Challenge
The Texas Advanced Computing Center (TACC) employs some of the world’s most powerful advanced computing technologies and innovative software solutions to enable researchers to answer complex questions. Every day, researchers turn to TACC resources to help them make discoveries that can change the world. TACC supports researchers working on some of the largest problems in science today, from aerospace, engineering, weather and climate, materials, and cosmology to medicine, biology, public health, and remotely sensed environmental impacts, to name a few.
“TACC exists to provide large-scale computing support to open science,” said Dan Stanzione, Executive Director of TACC. “As soon as you can scale beyond what you can do on your laptop, we call that an advanced computing problem, and that’s where we get involved.” TACC not only provides computational resources for researchers within The University of Texas system, it also supports NSF research nationwide as part of the Extreme Science and Engineering Discovery Environment (XSEDE). The original Stampede, an Intel based cluster, was installed in 2013 to address many of those needs.
“Speed and capacity are a huge concern,” added Stanzione, “for both time to solution and how much science we can support. With the original Stampede, we were receiving five to six times the proposals we could support. We always need more capacity to solve more problems. There are more and more communities we could support.”
Solution
To serve the world of open science requires staying on top of technologies with the latest and most advanced computational facilities. The Stampede series of systems at TACC began without an Intel fabric, and it evolved over the last four years to include the Dell* Networking H Series* Fabric based on Intel® Omni-Path Architecture (Intel® OPA). A new subsystem that was a prototype for Stampede2, a 500-node Intel® Xeon Phi™ processor 7250-based cluster was installed in May 2016.
“Before the launch of Intel OPA, we worked with Dell and TACC to let them know what we were doing with high-performance fabrics,” stated Brian Dietrich, account executive with Intel. “As we were planning the future, TACC staff and faculty got the chance to see Intel OPA in a production environment on a small cluster at SC15, so they could see the fabric’s performance and management.” TACC later ordered Intel OPA on a new 508-node sub-system of Stampede that was a prototype for Stampede2.
Stampede2 started installation in Q1 of 2017 with the first phase of 4200 Dell EMC PowerEdge C6320p servers with bootable Intel Xeon Phi Processors 7250 and Intel OPA. “Intel OPA is the interconnect to our storage system and everything else on
the system,” commented Stanzione.
Stampede2 is a 5,936-node supercomputer with a peak performance of 18 Petaflops that includes 1736 Dell EMC PowerEdge C6420 servers and Intel® Xeon® Processor Scalable family. Stampede2 placed #12 on the November 2017 Top500 list of supercomputers with a peak performance of nearly 13 petaflops.
Results
“With the 500-node cluster, we could see how to get code to run efficiently on the many integrated core architecture and high-bandwidth multi-channel DRAM (MCDRAM) memory of the Intel Xeon Phi Processor,” commented Stanzione. “We’ve
done a lot of good science on this system. It let us build experience for Stampede2.”
Stampede2 is a very large heterogeneous supercomputer, supporting large-scale science and engineering applications such as NAMD, WRF, and FLASH. Indeed, TACC has measured single-node performance for the sub-system of Stampede that was a prototype for Stampede2 (Stampede2 and the sub-system of Stampede use the same processors) showing speedup from 1.5x for FLASH to over 3x for WRF compared to nodes built on Intel® Xeon® Processor E5 Product Family. “We have some extreme examples in some seismic and earth-related codes where we get 5x or 6x faster per node,” stated Stanzione.
But Stampede2 runs less scalable workloads that are increasingly common in research today, including Python*-and MATLAB*-based simulations. “Most of our larger runs,” added Stanzione, “that use highly parallel MPI and OpenMP* hybrid codes are going to run very well on the Intel Xeon Phi Processor nodes. Our ability to deliver lower cost cycles in both power and the capital outlay for them will work in their favor. For the newer, less tuned, less scalable codes, the Intel Xeon Processor Scalable family is going to run circles around them, both because of the power management and the higher clock rate.”