HPCwire Talks Exascale with Doug Kothe at Oak Ridge National Laboratory

2022-07-06 17:29:19 By : Ms. Fanny Yeung

Since 1987 - Covering the Fastest Computers in the World and the People Who Run Them

Since 1987 - Covering the Fastest Computers in the World and the People Who Run Them

In this one-on-one interview, Doug Kothe – associate laboratory director, Computing and Computational Sciences at Oak Ridge National Laboratory, and director of the Exascale Computing Project (ECP) – discusses Frontier’s progress, the significance of breaking the exaflops barrier, and the first applications that will run on Frontier. As Frontier gets up and running, the ECP will benchmark and validate across a broad range of targets in support of U.S. energy and security missions.

Tiffany Trader: Hey Doug. How’s it going?

Doug Kothe: It’s going pretty well.

Trader: We’re here at Oak Ridge National Laboratory with the newly appointed Associate Director for Oak Ridge. I believe that was official on June 6.

Kothe: It was. It was. So I’m a few days in, still trying to get my feet under me. I will say too that I retained Director of the Exascale Computing Project. But Lori Diachin from Livermore, my deputy, is really taking the reins there. And she’s doing great already.

Trader: You’re wearing two really big hats now, but they’re hats that I think fit nicely together.

Kothe: I believe so as well. I guess time will tell. I very much care deeply about ECP. I really want to see it across the finish line. We can see the can see the light at the end of the tunnel and it’s not a train.

Trader: And that June 6 date I thought was notable because that’s pretty much exactly one week from the official debut of Frontier as the first supercomputer to cross the exaflops milestone, so seemed like good timing there. And of course, being the director of the ECP. That is the project responsible for exascale-readiness, and getting the applications ready for day one on exascale. And now with these new machines coming online, I think that will be a proving ground for that – and that’s where these two things are coming together. Do you want to provide a mini update on ECP, and the applications that you’ve been developing and how that will roll out to the system and especially Frontier.

Kothe: This is an incredibly exciting time. I used to play high school football, if you can believe it. It’s like we’ve been doing two-a-day practices for five years. We are ready. It’s really an exciting time. I’ll note to that Frontier being delivered was incredible. It’s not formally part of this project of ECP. But we have dear friends and colleagues and we really are right there with them watching this happen, and helping where we could from the software point of view. And the fact that it was delivered just a few weeks ago, kind of given what we’ve seen going on with COVID and supply chain and all that is incredible. To be honest, I didn’t think it would happen when it did. Every every machine is unique and this one was certainly unique and complex. But what impressed me was the staff at Oak Ridge working closely with AMD and HPE – really a very cohesive team and that’s what it takes.

So on the ECP side, we are ready. We’ve been working on a couple hundred nodes of Frontier really since January. And so our software stack and our applications are running well. And by well meaning they’re they’re compiling and they’re getting the right answer. That’s the first step. But now doing kind of single node or multi-node performance optimizations. Performance, in particular on the MI250X GPU for most of our apps is meeting expectations. We still have a lot of work to do and scaling up and, you know, being ready because we’re going to now move from 200 nodes to 9,400 nodes. And so it’s not going to be a walk in the park. I think our teams know what’s ahead of us. And that we’ll probably go through a several week, sort of scale-up period. And we’ll be rolling the apps on in terms of readiness, who’s the most ready. But we have [24] apps ready to go, and over 70 software products ready to go. They all have signed contracts with external reviewers for quantifying and demonstrating, you know, in fair amount of detail, what we signed up to five years ago. So it’s, again, it’s, it’s exciting to be here.

Trader: Five years ago… 2016, I think was the start of ECP.

Kothe: We really started funding the teams in September of 16 and so, you know, it’ll be six years this September. And, you know, we had to… it’s hard to sort of set specific quantitative goals that far in advance within a field that’s so agile and and ever changing, but it’s so far it’s worked out well.

Trader: So you have 24 ECP applications – 21 of those are DOE Office of Science and the NNSA [National Nuclear Security Adminstration] contributes three of those. Can you give us some examples of those applications and the use cases?

Kothe: You bet. It is very exciting to talk about this. So for example, let’s talk power grid, being able to simulate the entire national power grid consists of three interconnects, and if you count the points for generation, transformers, houses, you’re up to 10 to the ninth, 10 to the 10th, at some point, maybe exascale level points, we want to be able to simulate what happens on the grid when certain power sources come on, maybe due to a disaster, or when we have wind and solar that tends to to sort of ebb and flow with day and night. So we want to be able to do planning so we can help prevent blackouts or brownouts. I mean, we saw that in Texas recently and other places. That’s an exciting new application that’s very non-traditional HPC.

Another example is wind farms. In talking to the experts there, for wind farms consisting of tens of turbines, maybe 50 to 100 close by, they can buffet each other, and because of the turbulence from one turbine to the other, they can lose 20 to 30 percent of the potential wind energy coming into the wind farm. So they’re not nearly as efficient as they could be. So we’re trying to understand that so we can develop more efficient wind farms. We’re simulating quantum materials. Quantum materials are materials where the electrons can flow around very freely. In quantum, we’re trying to understand what makes the material have correlation as it’s called. And that will inform how to build quantum computers, it will inform how to build room temperature superconductors or super insulators. That’s an example of a materials application.

Others include chemistry, being able to design catalysts, basically virtually design a molecule that helps catalyze a reaction without having to do any experiments. And maybe you go into the experiment, and you fabricate and you confirm, rather than just explore. My background in nuclear engineering – I’d be remiss not to mention this – being able to design in the computer small and micro reactors, and then go out and build a safe operating reactor without necessarily having to do… with very little testing. We also engaged in fusion and also clean combustion of coal, being able to burn coal or oil, and have the byproduct be just CO2 that you can then capture and reuse or sequester. So it’s typical for the Department of Energy – and I’ll emphasize Energy – we’re all into energy production, energy transmission, materials and chemistry for energy. But I’ll also mention, the Department of Energy funds fundamental science, the origin of elements in the universe, the evolution of the universe, the nuclear force, which is known as the standard model – very, very fundamental science areas that I think will lead to some fantastic new insights. So that’s just a few examples. And again, these were chosen very carefully and selectively with our sponsors, and so every one is going to have a home and have a steward post-ECP.

Trader: And I understand from speaking with Justin Whitt, who is the project director for Frontier, that it’s nearing its acceptance phase. And so what does that mean for the timeline for the ECP applications and increasing the scale that they run on?

Kothe: It’s a good question. So we’ve negotiated a timeline, it’s been fairly conservative, because fortunately, the OLCF [Oak Ridge Leadership Computing Facility] leadership team has been through many acceptances. And they know, there’ll be fits and starts. The point is probably about four to six weeks from now, some of our most ready apps will get on. Whether or not acceptance will be done, it’s hard to tell, they have an aggressive schedule. But basically, if we get on later than that we have plenty of headroom in our own schedule. So as Justin probably talked to you, the acceptance is very rigorous. There’s functionality, do basic things that we need work? There’s performance, are we getting the performance out of the system? Certainly all indications are based on the HPL [High Performance Linpack] run that we are. And then there’s stability and stability is the one that’s most challenging. Essentially, surragate workloads that mimic actual production workloads are run for for weeks on the system. And there’s very specific metrics in terms of the percent of jobs that have to complete and the percent of those jobs that get the right answer, etc. So acceptance is pretty onerous. And so we feel confident that after that period, the machine will be fairly well shaken out for us to get on.

Trader: And Oak Ridge is hosting the HPC User Forum this week, and you gave a presentation yesterday. And one of the things that stood out to me was that you said you overestimated the time that it would take to achieve readiness. You want to talk a little bit more about that?

Kothe: Yeah, it’s interesting you picked up on that.

Trader: You don’t hear that very often.

Kothe: Well, you know, scientists, I think, tend to be more pessimistic about, you know, “gosh, I need to hypothesize, test, hypothesize, test, things are going to change, I don’t know the future, there’s lots of risk.” Certainly in software, that’s the case. But I think what we observed – and we want to write a retrospective on this – is if you have the right team together, and in the case of applications, it’s kind of five to 10 people. But not all physicists, not all engineers like myself; you need to have mathematicians, computer scientists, computational scientists, performance engineers. When you have this eclectic mix of people, everybody brings a diverse point of view and a different set of experiences. And the lessons learned there is the teams were smaller than we thought they needed to be, and I think took less time than we thought they needed. Now, we haven’t crossed the finish line yet. But it’s all about not surprisingly, getting the right people. And so we were lucky because ECP has attracted the best and brightest. And we have great teams and teams that have been together for, you know, five plus years and learn from one another. When DOE set up this large project, yes, it’s complicated to manage, but we brought together teams of people that maybe knew of each other, maybe not, but to watch this cross-fertilization of lessons learned and best practices. And we have a lot of, you know, quarterbacks, A students, Michael Jordans, whatever you want to call them. There’s a lot of one-upmanship that goes on and that – people feed off each other. And so they’re kind of some nice competition going on within the project. So sort of all those things, they’re hard, if not impossible to measure, but that was in my head when I made the comment that in retrospect, you know, these teams pulled off more than we thought.

Trader: And next steps for ECP, I understand there are certain KPPs – key performance parameters – that need to be achieved before the project can conclude.

Kothe: Right. So for a formal project in the Department of Energy, we generally have to sign up to a small number of three to five quantitative metrics that constitute formal success. From a sort of our own staff point of view, we want to set that bar reasonably high, but we want to go beyond it. So our threshold KPP metrics have to do with demonstrating our applications are simulating real important challenge problems. Okay. A challenge problem is a problem of strategic interest to the various program offices that we’re building the apps for. And about half of the apps have to show they can do 50x performance relative to 2016 – so most of the apps benchmarked on Titan or Mira or Theta at Argonne. And so they had to sign up, you know, five-and-a-half, six years ago for 50x performance. Now, that doesn’t mean just getting an answer quicker. It also in probably in every case I can think of getting a better answer quicker, meaning an answer that has more physics, that is higher confidence, more predictable. So the 50x is for 11 of the 24 apps. And then the other 13 have signed up to demonstrate capabilities. So we’ve got around the hook for 24 apps, and our minimum performance is half. And I think we can do probably 70 – 80 percent. At least that’s our that’s our target goal.

On the software side, the way we’ve incentivized integration and portability is, if I’m building a software product, somebody has to care about it, somebody has to use it, and it has to be on an exascale system. So somebody is typically an application. It could be another software product, it could be the facility wants it there. So our software products have signed up for generally four to eight capability integrations. So if I’m building like a linear solver library, I have to demonstrate that let me say two apps are using my library in a critical way on, say, Aurora and Frontier, because you get a kind of a point for each, or, four apps are using my capability, say, on Frontier. And you know, the point is you get a higher score if you show what you can do on both systems. So it sounds complicated, it took us a couple of years to figure out kind of a scoring metric for software integration. And here, I would just say if your stuff is used and useful, then you ought to be fine on the score. So those are the three we have to hit. And we want to hit these by less than a year from now, ideally, much less than a year.

Trader: And benchmarking these applications on Frontier and Aurora – Aurora being the Argonne National Laboratory system. So that’s the other system that is part of your goal to benchmark applications on to hit your KPPs.

Kothe: That is correct. Now, the way that KPPs are defined is we can ideally hit all of our KPPs on one system. But we really want to do far better than that, and achieve them on both. Because for the better of science, post-ECP – these are science and engineering tools – for I think decades, we want to show that this ecosystem is robust and portable, and able to deliver great answers on any number of types of hardware. So we definitely want to get on Aurora and do the same thing.

Trader: And what are the steps to go beyond ECP? What are the plans in place as you wind down ECP to prepare for future milestones?

Kothe: So the applications are going to be mature enough to be used in science campaigns by the program offices, and I’ll call science campaign as I’m using the application to discover new science to design new things, but also to further validate the code. The point is validation is comparison against experimental data. We’ve done some of that in ECP. But the program offices – and again, it’s not our decision, but we’ve been engaged with the program office stakeholders that, you know, view a given application as being in their mission space. We’ve been talking to them for the last five years. And in also negotiating with with ASCR, the Advanced Scientific Computing Research office on making sure the software stack is sustainable. And like I mentioned at the HPC User Forum are foreign, we’ve been releasing our software stack now for three and a half years, every quarter. It’s called E4S – Extreme-scale Scientific Software Stack – and we’ve got really a nice cadence and a nice process for essentially handing off this software stack to DOE, and a lot of us will still be working on it and evolving it post-ECP. So I’m quite confident it will be sustained, not just released, documented and available, but further evolved. We see in the five to 10 years to come, the software stack will evolve to capture edge technologies, likely quantum capabilities – by that meaning, you know, elements of the software stack to support quantum – and, of course, more AI and machine learning. So the stack will continue to evolve as we kind of move through these next two or three tipping points.

Trader: And then high level, how do these exascale systems support the mission of the DOE and the NNSA? And why are exascale systems important?

Kothe: Very good question. So, as you probably remember, there are a number of exascale requirements workshops that were held, pointing to very specific program offices. In the case of ECP, there are on the order of 10 program offices that we’re building applications for. So we sat down with the program offices and as a part of these workshops and in private discussions, and talked about problems of strategic interest for their office that they currently cannot address or solve today that are amenable to computing, at least for part of the solution, and need exascale. So every one of our applications is targeting a very specific problem that’s really unachievable and unattainable without exascale resources. And again, you know, you need lots of memory and big compute, to go after these big problems. So without exascale, some of these, a lot of these problems would take months or years to address on a petascale system, let’s say, or they’re just not even attainable. So the exascale drivers are there. And you know, we sat down with the program offices at the very beginning and laid out plans for each app. And part of their KPP is to show they can do that problem. And that they can do that problem by fully exploiting all the breadth and depth of Frontier or Aurora. So they have very specific metrics about full system runs, doing all the science and incorporating all the science and the physics needed for a given problem with very specific outcome metrics for each problem.

Trader: Computational milestones like exascale are exciting and inspirational. What do you think when you look ahead to future milestones?

Kothe: Well, I’d like to think that I will probably be retired at that time. But these tools and technologies we’ve developed will lead to groundbreaking discoveries, Nobel Prizes, new concepts and designs for the landscape of DOE, from energy production to power grid to materials and chemistry for energy. I mean, we’re in the middle of some real challenges right now. And you know, I really need to mention the national security aspect, too. So, you know, it’s not about just exascale. In my mind as an application person, it’s about, we need to as a nation [need to], and we are, leading applications that deliver solutions to policymakers [and] to decision-makers to make consequential decisions. So I do anticipate the results of the simulation insights provided will greatly sort of de-risk decisions and give us high confidence in making decisions that we can bank on – that’s really an important role for simulation.

Trader: Great. Well, let’s leave it at that. Thanks. It’s been great talking with you.

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

In this one-on-one interview, Doug Kothe – associate laboratory director, Computing and Computational Sciences at Oak Ridge National Laboratory, and director of the Exascale Computing Project (ECP) – discusses Fronti Read more…

Oak Ridge National Laboratory’s exascale Frontier system may be stealing some of the spotlight, but the lab’s 148.6 Linpack petaflops Summit system is still churning out powerful science. Recently, for instance, the Read more…

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it’s a promotion, new company hire, or even an accolade, we’ Read more…

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Director Justin Whitt. The first supercomputer to surpass 1 exafl Read more…

You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question... Read more…

Hamburg-based Indivumed specializes in using the highest quality biospecimen and comprehensive clinical data to advance research and development in precision oncology. Its IndivuType discovery solution uses AWS to store data and support analysis to decipher the complexity of cancer. Read more…

Consumers use many accounts for financial transactions, ordering products, and social media—a customer’s identity can be stolen using any of these accounts. Identity fraud can happen when setting up or using financial accounts, but it can also occur with communications such as audio, images, and chats. Read more…

MLCommons’ latest MLPerf Training results (v2.0) issued today are broadly similar to v1.1 released last December. Nvidia still dominates, but less so (no grand sweep of wins). Relative newcomers to the exercise – AI Read more…

In this one-on-one interview, Doug Kothe – associate laboratory director, Computing and Computational Sciences at Oak Ridge National Laboratory, and director Read more…

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question... Read more…

MLCommons’ latest MLPerf Training results (v2.0) issued today are broadly similar to v1.1 released last December. Nvidia still dominates, but less so (no gran Read more…

In February 2020, the United States’ National Oceanic and Atmospheric Administration (NOAA) announced that it would be procuring two HPE Cray systems, allowing the organization to triple its operational supercomputing capacity for weather and climate applications. Now, those efforts have come to fruition: NOAA has inaugurated the two systems, which are... Read more…

With the Linpack exaflops milestone achieved by the Frontier supercomputer at Oak Ridge National Laboratory, the United States is turning its attention to the next crop of exascale machines, some 5-10x more performant than Frontier. At least one such system is being planned for the 2025-2030 timeline, and the DOE is soliciting input from the vendor community... Read more…

HPE's early stab at ARM servers close to a decade ago didn't pan out, but the company is hoping the second time is a charm. The company introduced the ProLiant RL300 Gen11 server, which has Ampere's ARM server processor. The one-socket server is designed for cloud-based applications, with the ability to scale up applications in a power efficient... Read more…

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programmin Read more…

Getting a glimpse into Nvidia’s R&D has become a regular feature of the spring GTC conference with Bill Dally, chief scientist and senior vice president of research, providing an overview of Nvidia’s R&D organization and a few details on current priorities. This year, Dally focused mostly on AI tools that Nvidia is both developing and using in-house to improve... Read more…

Intel has shared more details on a new interconnect that is the foundation of the company’s long-term plan for x86, Arm and RISC-V architectures to co-exist in a single chip package. The semiconductor company is taking a modular approach to chip design with the option for customers to cram computing blocks such as CPUs, GPUs and AI accelerators inside a single chip package. Read more…

In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those... Read more…

The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed... Read more…

The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials... Read more…

The battle for datacenter dominance keeps getting hotter. Today, Nvidia kicked off its spring GTC event with new silicon, new software and a new supercomputer. Speaking from a virtual environment in the Nvidia Omniverse 3D collaboration and simulation platform, CEO Jensen Huang introduced the new Hopper GPU architecture and the H100 GPU... Read more…

AMD/Xilinx has released an improved version of its VCK5000 AI inferencing card along with a series of competitive benchmarks aimed directly at Nvidia’s GPU line. AMD says the new VCK5000 has 3x better performance than earlier versions and delivers 2x TCO over Nvidia T4. AMD also showed favorable benchmarks against several Nvidia GPUs, claiming its VCK5000 achieved... Read more…

PsiQuantum, founded in 2016 by four researchers with roots at Bristol University, Stanford University, and York University, is one of a few quantum computing startups that’s kept a moderately low PR profile. (That’s if you disregard the roughly $700 million in funding it has attracted.) The main reason is PsiQuantum has eschewed the clamorous public chase for... Read more…

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. "We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…

Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…

Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores... Read more…

Just a couple of weeks ago, the Indian government promised that it had five HPC systems in the final stages of installation and would launch nine new supercomputers this year. Now, it appears to be making good on that promise: the country’s National Supercomputing Mission (NSM) has announced the deployment of “PARAM Ganga” petascale supercomputer at Indian Institute of Technology (IIT)... Read more…

The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…

MLCommons today released its latest MLPerf inferencing results, with another strong showing by Nvidia accelerators inside a diverse array of systems. Roughly fo Read more…

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

Supercomputing has been indispensable throughout the Covid-19 pandemic, from modeling the virus and its spread to designing vaccines and therapeutics. But, desp Read more…

© 2022 HPCwire. All Rights Reserved. A Tabor Communications Publication

HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.

Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.