Hello, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D Center.
In this series of articles, I will share some insights from my decades long experience in the semiconductor industry and comment on different AI industry related topics from my unique perspective.
You can read the Vol1 and Vol2 of the series by clinking the links.
Today, I would like to talk about the big news, which is the acquisition of Xilinx from semiconductor maker AMD  . I will give my thoughts on the meaning and the significance of the acquisition from the viewpoint of an experienced engineer.
AMD is currently headed by CEO Ms. Lisa Su and CTO Mr. Mark Papermaster. I was a member of the Cell microprocessor co-development team from 2001 to 2006, and was lucky enough to get acquainted with both Ms. Su and Mr. Papermaster (working for IBM at that time). I vividly remember that I was very impressed with their excellent leadership and strategy.
GPUs for rapidly expanding data centers
“In 2021, the share of GPUs in the data center processor market will grow to a scale close to that of CPUs.” Lisa Su, president and CEO of Advanced Micro Devices (AMD), said at a press briefing held in San Francisco on November 6, 2018 (US time).
As mentioned above, the market size of processors for data centers is expected to grow from $20 billion today to $29 billion in 2021, according to AMD CEO Sue. In particular, GPUs are expected to grow at an annual rate of several tens of percent, accounting for about 40% of the total market. This is due to the rise of large-scale simulations and deep learning, which are expected to grow in the future .
General-purpose CPU performance improvements slow down
Now, let’s take a moment to look at CPUs. CPUs with the so-called x86 architecture (referred to as “General Purpose CPUs” in this article) are widely used from notebook PCs (Core-ix, etc.) to data center servers (Xeon, EPYC, etc.).
The graph Figure 1  below shows the history of CPU’s performance improvement over a period of about 40 years (1980 to 2020). Due to various technological innovations, the performance ratio in 2020 reached tens of thousands when the performance in 1980 was set to 1, which was a tremendous improvement. However, the growth rate in recent years has remained at an annual rate of 3%.
The rise of Domain Specific Architecture (DSA)
We just talked about how the performance improvements of CPUs are slowing down in the recent years. On the other hand, the demand for computing power is ever-increasing.
Figure 2  shows an example of the required computing power for deep learning, a field of AI. Performance requirements are increasing at an annual rate of 1000% (10 times a year), and the performance improvements of general-purpose CPUs are not catching up at all.Operations that require enormous calculations are not limited to the field of AI. For example, large-scale simulations (infection simulation, climate change prediction, etc.) and drug discovery (molecular structure calculation) typically requires processing of a huge amount of data.
This is why they came up with the concept to combine general-purpose CPUs with an accelerator, in order to have a huge amount of calculations performed (offloaded) by the accelerator. An accelerator is a combination of a hardware and software mechanism.
The configuration of general-purpose CPU core + accelerator has been around for a long time, but in recent years, it has been called “Domain Specific Architecture (DSA)” in the computer architecture industry  . Figure 3 shows my understanding on the relationship between system flexibility and processing performance in specific applications.
The horizontal axis represents the flexibility of the system. If the item is on the right side, it indicates that the application range is wide, which means, the flexibility of the system is high. If the item is on the left side, the application is less flexible. The vertical axis represents the processing performance for a specific application.
Now, let’s plot “General-purpose CPU”, “GPU + General-purpose CPU”, “Accelerator ASIC + General-purpose CPU”, and “Accelerator FPGA + General-purpose CPU” on this figure. “GPU + general-purpose CPU” is a domain-specific architecture for 3D graphics, and “accelerator ASIC + general-purpose CPU” is a domain-specific architecture for deep learning. I will explain more on that later in the article.
Please take a look at the features of each. A general-purpose CPU can process spreadsheets, Web browsers, database processing, as well as 3D graphics and deep learning, although its performance is low. I think this is highly flexible.
By the way, GPU (Graphics Processing Unit) is a kind of accelerator for speeding up 3D graphics processing. It has been used in PCs and game consoles since around 1990, and it’s recognized to be a commercially successful accelerator.
In the GPU + general-purpose CPU configuration, the graphics performance is naturally higher than that of the general-purpose CPU. On the other hand, the GPU is not used for processing other than graphics (for example, spreadsheets), so the system might be less flexible than a general-purpose CPU.
The “ideal system!” on the upper right is a system with high system flexibility and high processing performance for specific applications. It doesn’t actually exist, but I think the real system continues to evolve towards this “ideal system” as a goal.
Example of domain-specific architecture: Google’s TPU
Google’s TPU (Tensor Processing Unit) is an accelerator developed in-house by Google that specializes in accelerating deep learning. It seems like Google itself uses it for services such as “Google Search”, “Google Translate” and “Google Photos”.
The right side of figure 4 shows the TPU chip floor plan . The calculation execution unit such as “Unified Buffer”, “Matrix Multiply Unit”, and “Accmulators” occupies about half of the chip area. The rest is occupied by DRAM interface, PCI express interface, etc., and the control unit (shown as “Control” in red) occupies only 2%.
In general-purpose CPUs and GPUs, the control unit occupies a much larger area. We can also say that TPU in general-purpose CPUs and GPUs has a very efficient design. This efficient design was enabled by specializing in high-speed deep learning.
The performance / W of TPU is about 30 times that of GPU, which is reported to be very high   ].While TPU is very powerful in deep learning calculations written in TensorFlow, it would be difficult to derive performance in other applications. Hence we can say that the flexibility of the system is rather low. TPU is shown as “Accelerator ASIC + General Purpose CPU” in Figure 3 .
Challenges of the Accelerator ASIC approach
I think there are mainly two possible issues with the Accelerator ASIC approach. The first possible issue is that the flexibility of the system is low. It has very high performance for specific applications, but it cannot produce performance when the application changes. The second issue is that it typically takes longer time and more money to design an application specific integrated circuit (ASIC).
The following Figure 6 shows an example of FPGA and ASIC development and development time. This is a rough drawing of the author’s personal opinion.
I estimate that the accelerator ASIC design would be time consuming and costly. In the case of Google TPU, (1) there are a lot of processes that you want to speed up, (2) the algorithms to be accelerated are standard and highly parallel, and (3) there are many (thousands to tens of thousands?) in the data center.
I also presume that ASIC development was possible because there is a possibility of using accelerator ASIC. If those conditions are not met, ASIC development will not pay off, so I think there was no choice but to continue using the GPU as an accelerator.
Accelerator FPGA + General Purpose CPU Approach
One way to overcome the time and cost of developing an accelerator ASIC mentioned above is to use an FPGA (Field Programmable Gate Array) instead of the accelerator ASIC to create an accelerator. This way, it might be possible to realize an accelerator optimized for each domain by taking advantage of the flexibility of FPGA.
When we reflect this characteristics in Figure 3, it will be at the position of “accelerator FPGA + general-purpose CPU” (shown in light green). It depends on the design, but the performance in a specific application would be higher than that of a general-purpose CPU. I think that it can be made less restricted by software than a GPU.
In other words, the flexibility of the “Accelerator FPGA + General Purpose CPU” system is inferior to that of general-purpose CPUs, but not with a huge difference. Furthermore, the FPGA design period is much shorter than that of ASIC development, and the development cost can be kept low (Figure 6). Therefore, we can also expect another advantage that modifications during development can be done at no additional cost.
Let’s take the example of FPGA accelerator for text search  as an example of “accelerator FPGA + general-purpose CPU”. As you can see in the figure below, it was reported that SQL Query throughput for Big data could be up to 16.2 times faster (Figure 7) on systems with Altera (now Intel) Stratix IV FPGAs (250 MHz operation) connected to IBM POWER7 servers .
Approach from FPGA to Deep Learning (Vtis by Xilinx)
Xilinx’s latest FPGA devices and design environments support Deep Learning   .
Vitis ™ AI Development Environment” is a Vitis AI development kit for AI inference development on the Xilinx hardware platform, including both edge devices and Alveo ™ accelerator cards, with optimized IP cores, tools, libraries, models, Includes sample design.
Designed for high efficiency and ease of use, Vitis AI maximizes the speed of AI inference with Xilinx FPGAs and Adaptive Compute Acceleration Platforms (ACAPs). By abstracting the underlying complex FPGA and ACAP, we help users with no FPGA design experience easily develop deep learning inference applications.
The use of FPGAs has also begun in cloud environments. For example, Amazon cloud environment (Amazon EC2 F1 instance) provides a virtual environment with an FPGA connected to a general-purpose CPU . If you use such systems, there’s no longer a need to have an FPGA at hand. RISC-V simulations using these type of systems have also been performed .
What is the next step for AMD?
NVIDIA GPUs include the company’s CUDA (Compute Unified Device Architecture) , which can be used with C ++ and Python languages. This is a general-purpose parallel computing platform (parallel computing architecture) and programming model for GPUs, and is widely used in machine learning and deep learning. It is estimated that it also has a large market share in terms of GPUs for data centers.
On the other hand, tensorflow-rocm  for AMD GPUs is not widely used. With the expected rapid expansion of the GPU market for data centers, it may still take considerable effort to penetrate the market with AMD GPUs.
Xilin’s aggressive commitment to AI, such as developing FPGA devices and design toolchains for AI, may have played a role in AMD’s acquisition of Xilinx. The acquisition of Xilinx makes it possible for AMD to offer customers of the next-generation platform “accelerator FPGA + general-purpose CPU”, in addition to competing with NVIDIA on the ground of “GPU + general-purpose CPU”.
- AMD to Acquire Xilinx, Creating the Industry’s High Performance Computing Leader
- AMD agrees to acquire Xilinx – Competes Intel by strengthening growth areas
- CPU rivals are one after another, new rivals after GPU
- A New Golden Age for Computer Architecture, Com. ACM, 2019
- J. Dean, “1.1 The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design,” 2020 IEEE International Solid- State Circuits Conference – (ISSCC), San Francisco, CA, USA, 2020, pp. 8-14, doi: 10.1109/ISSCC19947.2020.9063049.
- N.P. Jouppi, et.al.,m “In-Datacenter Performance Analysis of a Tensor Processing Unit”, 44th International Symposium on Computer Architecture (ISCA), Tronto, Canada, 2017
- Google’s AI chip “TPU” is 30 times faster than GPU
- Computer Architecture 6th Editionの7章”Domain-Specific Architecture” を読む (7.1, 7.2章)
- R, Polig, et.al., “Giving Text Analytics a Boost”, IEEE Micro, Vol.34,Issue:4, 2014, https://arxiv.org/pdf/1806.01103.pdf
- Amazon EC2 F1 instance
Accelerate development and deployment of FPGA accelerators in the cloud
- FireSim, a RISC-V simulation environment that runs on Amazon EC2 F1 instances
- AMD to Acquire Xilinx, Creating the Industry’s High Performance Computing Leader