Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D center.
In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry related topics from my unique perspective.
Today, let’s discuss the secrets of Fugaku, the world’s best performing CPU and its unique design. I was once a member of the SoC design team for embedded microprocessors and game consoles, which heavily relied on them. If you compare SoC development to building a house, my role was maybe similar to a foreman. I don’t have any experience with the development of supercomputers, but I was still impressed by the excellent architecture.
Fugaku: The World’s Best CPU
On June 11, 2020, Fugaku ranked first place in the World Supercomputer Rankings, winning not only the TOP500 (LINPACK) benchmark, but also three others which evaluated a variety of parameters(TOP500, HPCG, HP-AI, and Graph500). Fugaku was able to outperform the competition by at least 258% (Figure 1) . Another exciting feat was during the HPL-AI benchmark, where it was able to surpass the previous barrier for Exa FLOPS (Exa Flops performs floating point arithmetic quintillion times per second.)
One of these goals outlined for the project was to develop the world’s first exa-scale machine, a computer that has the ability to compute more than 1 exaFLOPS (quintillion FLoating Point operations per second). When considering this goal, it would be meaningless unless it ranked first in a variety of benchmarks and greatly surpassed the performance of the pre-exa-scale machines (which are ranked second or lower in these rankings) .
Mr. Matsuoka also outlined the following nine target applications in mind for the supercomputer during development (Figure 2) :
- Development of personalized preventative medicines through genetic analysis
- Disaster prevention for earthquakes and tsunamis
- Environmental change prediction using large amounts of data
- High-Efficiency energy generation, conversion, and storage
- Clean energy systems
- High-performance function devices
- Creative design in the manufacturing industry
- Manufacturing processes
- Study for the fundamental laws and evolution of the universe
World Rankings for the TOP500
Looking briefly at figure 3, it shows the performance for the top ten supercomputers, as of June 2020, for the TOP500 benchmark test. From these results, it should be noted that seven of the top ten systems were using a GPU while the Fugaku was not. We will explore how the Fugaku was able to win, despite this difference, by comparing it with the second place “Summit.”
The Basic Structure of the Fugaku
Moving on, let’s take a look at the basic structure of the Fugaku itself. Figure 4 outlines the architecture of the Fugaku which includes an A64FX that is within the CPU memory unit. This unit is then inserted into a shelving unit which is placed inside of a rack and housing unit.
Specification Comparison for Fugaku & Summit
As mentioned above, we will now compare the Fugaku to the second place “Summit.” As shown in figure 5, the Fugaku outperformed the Summit by around 270%. It is also worth noting that the Fugaku has 152,064 nodes, 16.5 times more than the Summit’s 9,216 nodes.
When comparing the peak node performance, the Summit achieved 21.7 TFLOPS compared to Fugaku’s 3.072 TFLOPS. It might be described as follows.Summit equipps a powerful node while the number of nodes is less. Vice versa, Fugaku equippes a mediumlly powerfull node while the number of nodes is much Fugaku amd Summit take different design concepts.
Taking a look at the hardware, the configuration for each system is very unique. For the Summit system, where the IBM Power9 AC922 server acts as the base, three NVIDIA GPU Tesla V100s are connected to a Power9 chip as accelerators via NVLink-2. Using this configuration, 10 major chips are mounted to the printed circuit board .
On the contrary, Fugaku’s configuration is quite simple, with only two A64FX chips. Within the A64FX chip, high speed HBM2 memory and the network interface I/O circuit TNI (TofuD Network Interface) are built in. Further explaining TofuD, Tofu stands for “Torus Fusion” or “Forus connected Full connection” while the D stands for “high-Density”, “Dynamic” packet slicing for “dual-rail”, which have been improved compared to previous generations.
Putting the hardware aside, let’s take a look the operation time required for transmitting 256kB of data from one node to another as an example (Figure 7). The preparation time means the only data processing time within the node, excluding the time where the data ravelling through the network.
In the Summit system, the Power9 chip reads the transmitted data in the GPU HBM through the NVLink-2 (50GB/s) on the printed circuit board. This data is temporarily stored in the DRAM, denoted by the red arrow on the right side of Figure 7. After being stored, the Power9 chip reads the transmitted data from the DRAM and sends it to the Mellanox NIC, via the PCIe Gen4 (16GB/s) on the circuit board (the blue arrow on the right of Figure 7), before being sent to the inter-node network.
Shifting our look to the Fugaku, the A64FX CPU chip reads the transmitted data , from the HBM2@ (256GB/s) within the same package (the red arrow on the left side of Figure 7.) This data is then sent to the TNI, which is also located within the same chip (the blue mark on the left side of figure 7) before sending it to the inter-node network.
Now that we understand the process for the data preparation, we can draw a comparison between the two systems. From the literatures, it is estimated that the Summit configuration performs this task within about 100μs, while the Fugaku configuration is within single digits of μs.
These latency values from the Summit are due to the fact that the GPU, Power9 chips, DRAM, and NIC are all located in separate chips, so the transmission data must be sent through the NVLink and PCIe. The NVLink-2’s transfer speed can reach a peak of 50GB/s, however, the amount of time needed to establish a communication link must also be taken into consideration .
For example, in order to obtain an effective transfer bandwidth of 80% of the peak, it is necessary to send around 10MB or more at one time. To optimize the performance of Summit, a program must be developed to eliminate the transfer of small amounts of data whenever possible. Restricting the performance of this system also depends on the characteristics of the application, algorithm, and programming.
The Graph500 Benchmark
Inspecting another benchmark, the Graph500 is a good indicator of inter-node communication performance. The Graph500 benchmark is an important performance test because complex real world phenomena are often expressed using large-scale graphs (relationships between data by vertices and branches). One real life example involves social networking services, where graphical analysis is used to evaluate data related to interpersonal connections.
These types of applications require high-speed computer analysis so advances in supercomputer technologies are imperative. For this test, the Fugaku succeeded in solving a breadth-first search problem for a large-scale graph consisting of about 1.1 trillion vertices and 17.6 trillion branches in an average time of 0.25 seconds .
Going further into the results from the benchmark test, we will once again look at Figure 1. Although the difference between the two systems typically deviated between 2.58-4.57 times in Fugaku’s favor for other tests, the performance gap for the Graph500 rose to 9.26. As stated before, this difference is also most likely linked to the inter-node communication performance and latency.
Taking advantage of a low-latency design and direct integration of the A64FC chip, memory, and network interface, the Fugaku is easily rising to the top. From busines perspective also, it is difficult to integrate IBM’s Power9 chip and Nvidia’s Tesla V100 GPU into one chip.
A64FX Design and Implementation
Thus far, we have mentioned the A64FX chip several times which plays a key role in the success of the Fugaku. Manufactured using TSMC 7nm semiconductor process technology, the A64FX houses approximately 9 billion transistors and contains 52 scores (48 computational cores and 4 assistant cores). While performing double-precision floating-point arithmetic, these 48 computational cores can reach a peak performance of 3.3792 TFLOPS (in boost mode). This chip also has 4 sets of HBM2 interfaces, as well as a TofuD and PCIe interface for both input and output (Figure 8). For similar multi-core processors, the connection method among the cores, the caches, and the memory plays a vital role in its operating performance.
In the A64FX, each CMG consists of 12 computational cores, an assistant core, a secondary cache, and a memory controller. Cache consistency is maintained among the four CMG’s and the system software can treat the CMGs as NUMA nodes . Proving the high efficiency of this in-chip architecture, DGEMM, a double precision floating point matrix multiplication algorithm, can achieve an effective performance of more than 2.7TFLOPS (more than 90% of peak performance).
The phenomenal performance of the A64FX wouldn’t be possible without the use of 2.5D packaging technology to integrate the CPU chip and 3D stacked memory into a single package, depicted on the left of Figure 9. On the right is a cross sectional view of the 2.5D package for the A64FX where the CPU chip and HBM2 memory are placed close to each other and are connected with fine wiring to create a high density.
Figure 10 shows a photo of the A64FX chip . To minimize the wiring to the HBM2, the interfaces are placed on the right and left sides of the chip, and the core group and other components are carefully arranged in the center.
The A64FX Instruction Set of Fugaku
To conclude our look into the A64FX chip, lets analyze Arm’s v8.2-A SVE (Scalable Vector Extension), the chosen instruction set which is an extension of NEON’s 128-bit register (Figure 11, left). By collaborating with Arm, Fujitsu has contributed to the development of SVE that can execute HPC (High Performance Computing) applications including scientific and technical computing and high-speed AI.
SVE specification allows scalable implementations in 128-bit increments up to 2048 bits, and the A64FX implements 512-bit widths. SVE also allows for the creation of binaries that can run independent of the SIMD bit widths of the hardware. These vector length independent binaries can be executed on SVE machines with different SIMD bit widths without recompilation.
In Figure 11, 100 loop programs are shown being compiled into vector length independent binaries to obtain the number of elements (vector length) to operate on in one instruction from the SIMD and then the machine also adjusts the number of loops accordingly. Using the case in Figure 11, the code loops 25 times on a machine with a vector length of 4, and 13 times with a vector length of 8.
In case that the original loop count is not an integer multiple of the vector length, the fractional elements are masked by the predicate operation . In this case, even if the SVE implementation is extended from 512-bits to 1024-bits at the hypothetical post-A64FX, the binary does not need to be changed.
If considering the peak performance of the A64FX, the following formula represents the double-precision floating-point data (64 bit), which is most often used in scientific calculations.
Looking at the system as a whole, the peak performance of the Fugaku is 488 PFLOPS, which is then multiplied by the 158,946 nodes within the system. Finally, with the half-precision floating-point (16 bit) data used in AI, up to 32 operations can be performed at the same time. This leads to an overall peak performance value that is four times higher than the value shown above.
Power Consumption Performance Comparison Using the Intel Xeon Server
Finally, lets conclude the article by looking at a comparison between the Intel Xeon server and a server running a single A64FX chip in terms of power consumption performance . As mentioned previously, the A64FX chip has 48 cores, however, the Xeon server uses two chips with 24 cores per chip. In figure 12, the light blue bar graph shows the Xeon-2 chip while the orange bar represents the A64FX, clocked at 2.2GHz.
Depending on the application program used, shown on the y-axis, Fujitsu’s A64FX chip outperformed the Xeon server by 2.3-3.5 times in terms of application performance per power consumption. Hopefully, Fujitsu will increase its range of applications, generating more sales, lower manufacturing costs be able to driving the next generation develoments.
The main factors leading to Fugaku’s high performance are:
- The “application-first” approach
- A balance between the node computing power and the number of nodes
- Incorporated ingenuity to improve performance based on actual application analysis on the previous generation “Kei” super computer,New consistent design made it possible by Fujitsu’s one-stop development, such as integrating the A64FX’s inter-node communication interface which greatly reduces communication latency
- Adoption of state-of-the-art TSMC 7nm semiconductor technology, where a concurrent system, chip and process technlogy development is the key
Finally, I would like to leave you with a few of my personal thoughts about the Fugaku supercomputer. I guess that it was a challenge to design the chip and the systems concurrently with the semiconductor process evolution. From the results, it is clear that RIKEN and Fujitsu made a great team and their combined efforts paid off. They probably faced many challenges along the way, but in the end, they were able to achieve four top rankings.
In 2018, Intel introduced the AVX512, implemented in Intel’s Xeon Skylake . In a simple comparison of register widths, it is now 512 bits, matching the A64FX. By taking advantage of its strengths of its original integrated development, hopefully the Fugaku can maintain its place at the top in the next generation of supercomputers.
 Wikipedia TOP500
 蹉跌と反省の末に 富岳，世界4冠の軌跡
 White paper FUJITSU Supercomputer PRIMEHPC FX1000 AI・エクサスケール時代を切り拓く HPC システム
 第55回 Top500の1位は理研の富岳スパコン，Green500はPFNのMN-3が獲得
 IBM Power System AC922
 WikiChip Summit (OLCF-4) – Supercomputers
 Evaluating Modern GPU Interconnect: PCIe,NVLink, NV-SLI, NVSwitch and GPUDirect
 高性能・高密度実装・低消費電力を実現するスーパーコンピュータ「富岳」のCPU A64FX
 Wikipedia ストリーミングSIMD拡張命令
 日本の次世代スパコン「富岳」プロトタイプ機がGreen500の1位を獲得 – SC19