How to program large-scale heterogeneous parallel computers
According to the European Research Council, its ERC grants “aim to support up-and-coming research leaders who are about to establish a proper research team and to start conducting independent research in Europe”. Prof. Torsten Hoefler has been awarded one of these prestigious grants. His group addresses a fundamental and increasingly important challenge in computer science: how to efficiently program heterogeneous supercomputers with millions of processors of different architectures.
Minh Tran: A new list of the world’s fastest supercomputers that was published in June showed that China maintained its No. 1 ranking for the seventh consecutive time. Prof. Hoefler, why should we take the competitive threat seriously?
Torsten Hoefler: I wouldn’t say it’s a competitive threat at all. The Chinese system is developed by fellow researchers who share the same goal to drive the forefront of computing technology to enable us solving larger and larger problems. In fact, they are very open in this process and invited an international group (which I was part of) to the supercomputing centers in Wuxi and Guangzhou to introduce and discuss their new technologies. The systems are both highly impressive - number 1, the Sunway TaihuLight machine with 125 petaflop, uses fully Chinese-made processors and number 2, the Tianhe-2 machine with 50 petaflop, uses a Chinese-made network. We also learned that a new government program has been launched to drive next generation supercomputing in China, with planned investments beyond CHF 1 billion. This just emphasizes that high-performance computing, while often not visible to the average person, drives the high-end of technological and scientific development at national levels. We are very fortunate that the Swiss government as well as the ETH administration recognize this and enables us to operate one of the fastest supercomputers in Europe, Piz Daint at CSCS in Lugano.
MT: An important assignment is therefore to create in Europe a framework for the development of the next generation of supercomputers. Does your research proposal to the European Research Council pursue that mission?
TH: The major challenge in high-performance computing is the cost of data movement. While computation consistently decreases in cost due to improving manufacturing of microprocessors, data movement cost is not reduced as fast. Those trends went that far that already today moving the operands to the computation unit is already more expensive than the computation itself. This has major impacts on how we should reason about computation. For example, in traditional theoretical computation models, we count the number of instructions to solve a certain problem. In fact, most of computational complexity theory that we teach today bases on these principles. Yet, these costs are negligible in today’s systems leading to a larger gap between theoretical performance considerations and practical systems. Furthermore, programming languages focus on operations and not data movement, making it hard for programmers to optimize data movement. The ERC project Data-Centric Parallel Programming (DAPP) squarely tackles these problems by advocating a data-centric view on algorithms and by developing a new data-centric programming model. In this model, we aim to express data movement and dependencies as primary concern when programming and we allow the runtime system to perform automatic optimizations.
MT: High-performance computing, from PCs to supercomputers, is in a confused state. Which architecture, how much parallelism, which software, and when to innovate are all commonly heard questions. Does your research on data-centric parallel programming respond to these issues?
TH: Yes, typically programming models and environments in high-performance computing follow architecture developments. The importance of data movement is not reflected in today’s languages. Our design ideas base on today’s heterogeneous architectures such as GPUs and Xeon Phi KNL and anticipated future architectures that will likely be reconfigurable and memory-centric. This work is especially interesting because HPC is often at the forefront of computer architecture to solve the world’s most daunting simulation problems. Many people see HPC as the Formula 1 of computing where experimental features are tested under somewhat extreme conditions. Some of those features will later be adopted in general purpose computing (e.g., vector units, general purpose programming for GPUs, manycore, etc.). This is what makes HPC ideal for a research group like SPCL where we look at future scalable technologies such as massive parallelism but also practical quantum and FPGA computing.
MT: Supercomputers are the base on which has been built the continuous advancement of computer simulation, now a key element in the scientific and industrial competitiveness of knowledge-based economies in the 21st century. Where do you see the significant progress in industrial fields which simulations can drive in the future?
TH: One of the biggest potential break-throughs of this century may be the wide adoption and success of machine learning and artificial intelligence. Yet, one may wonder why this 40-year-old field suddenly gains that much popularity in such a short time-frame. The two main drivers to new data-driven modeling techniques, such as deep neural networks, are the availability of vast amounts of data from our increasingly networked environment as well as the computational capability to process the data. Much of that computational capability is delivered by general-purpose GPUs, a concept originally widely adopted in HPC. Several learning approaches, for example at Baidu, are using the Message Passing Interface (MPI) standard to communicate between distributed instances. Whenever performance is required to process vast amounts of data, HPC techniques are used.
MT: One of the design challenges of the fastest computers is their enormous power consumption. How does your novel data-centric programming approach achieve both highest performance and energy efficiency on all scales up to supercomputers?
TH: If you operate one of the largest computers on the planet, you’ll also have a large energy bill. Some of the machines today are using up to 15 Megawatts, which, depending on the location of the machine costs between CHF 1-2 Million per month. The power consumption is so massive that some operators of large machines in the US are required to notify the electricity provider before they execute certain applications; a segmentation fault in one of these applications and the surge in power consumption can lead to large power-outages. In the long run, the energy costs during the lifetime of a typical server already exceed the purchase costs. Even worse, future machines will likely consume more energy and their deployment will mostly be limited by their power draw. All of that requires us to consider energy consumption as a primary optimization goal. Most often, energy consumption is minimized by improving the performance so that an application completes faster. However, many of the interesting research challenges lie in the space where slowing an application down in a controlled manner (e.g., using dynamic frequency and voltage scaling) improves the energy consumption. Large parts of the aforementioned DAPP project aim to optimize for energy.