Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D center.
In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry related topics from my unique perspective.
In volume eight of this series, we talked about Sony’s SPRESENSE device and a few potential uses for the system. In particular we discussed an experiment where SPRESENSE was used to analyze various factors affecting the housing prices in the city of Boston. If this sounds interesting to you, please check out my previous article, which can be found here: https://hacarus.com/ai-lab/20210506-spresense/.
In today’s article, I would like to introduce Xilinx Versal, the industry’s first Adaptive Compute Acceleration Platform (ACAP). Xilinx has been pushing the boundaries of AI computing, and this new intelligent engine showcases a few of its strengths. For this article, I want to provide you with an overview of the system and specifications.
How Versal Compares to Industry Standards
The concept of an Adaptive Compute Acceleration Platform isn’t new. It was first introduced back in 2018 and the first ACAP product was launched in June of 2019. Xilinx has reported that their Versal product is an intelligent, adaptive platform that has endless potential for applications in a variety of different fields.
This platform is commonly used in data centers to assist with both wired network and wireless 5G tasks. One area where the ACAP really shines is with assisted driving (ADAS) computing, where it performs around 20 times faster than any of the currently used FPGA methods. These gains are even more impressive when it comes to CPUs, where ACAP usage can see drastic improvements up to 100 times faster than conventional methods .
The ACAP platform also sits above the Zynq Ultrascale + RFSoC, its most cutting edge SoC technology.
Before we move on, I would just like to take a moment to thank Hisa Ando, the author of several articles used as references for this article. Mr. Ando has greatly contributed to my understanding of Xilinx’s ACAP technology and its unique characteristics.
To better visualize the comparison between the ACAP and other devices, Figure 1 shows three different categories of devices and their various models. Although it isn’t listed in Figure 1, the first Versal device ever released was the VC1902.
When I read these articles, I am always fascinated by the rapid growth of today’s technology and its applications. As technology evolves, so does the terminology used within the industry. The new generation of FPGA has brought with it two specific name changes. First, the Processor System (PS) has been changed to scalar engine. The term programmable logic (PL) has also been changed to adaptable engines. Finally, the term Intelligent engine has been replaced with AI engine. Both of these terms are still common, and this article will make references to both.
Now that we have clarified some of the changes in terminology, let us look at the improvements for the scalar engines. Compared to the dual-core Cortex-A53 from the previous generation, the new dual-core Cortex-A72 is a large improvement. The dual-core Cortex-R5, used for real-time OS processing, is the same as the previous generation.
The PL has also seen several upgrades between generations for the Configurable Logic Block (CLB) and other areas. One last area of improvement is semiconductor manufacturing. While the previous generation of devices included semiconductors manufactured using the TSMC 16nm process, Varsal’s devices use the TSMC 7nm process.
The Architecture of AI Engines
Versal has a lot of architectural features that give it a leg up against conventional FPGAs. The most significant difference between the two is that Versal devices have added an AI engine and improved upon existing DSPs found in FPGAs. Improvements have also been made regarding the chips used, allowing for faster speeds to better accommodate the high memory bandwidth needs of the AI engine (Figure 3).
Next, I want to explain how the array architecture is set up for the AI engine. Shown in Figure 4, a combination of AI cores and memory are arranged in a 2D grid. These tiles are connected vertically and horizontally in a mesh-like design where data is shared between each of the neighboring tiles. Combining 400 of these tiles together, the system is able to reach a peak performance of 133TOPS (Tera-Operations Per Second) using INT8.
Using this 2D mesh design, the AI engine is able to achieve non-blocking throughput between its tiles. In total, the entire engine has 12.5MB of memory. Even though this memory is shared, because there is no interference and each tile is directly connected to its surrounding tiles, this is not an issue. Figure 5 shows how this tile-based architecture functions in greater detail.
Supporting this engine is the 32-bit RISC processor which also includes a 512bit SIMD for vector processing The SIMD also includes a 7-way VLIW (Very Long Instruction Word) processor which is outlined in Figure 6.
Another important part of the AI engine architecture is its memory structure. Figure 7 shows how each component of the memory hierarchy works together to achieve an impressive total bandwidth of 38TB/s. The entire array is layered so that L1 SRAM is connected to L2 SRAM and the L2 SRAM is connected to the DRAM. The connections between the L1 and L2 SRAM are also capable of multicasting and broadcasting.
Systolic arrays are a way of arranging computational tiles in a 2-D space where each tile is connected to those around it and shares data between them. The benefits of this arrangement have been studied for decades with the first publication appearing around 1978 .
The term systolic is taken directly from the heart’s systolic cycle. Similar to a human heart, the array focuses on supplying the entire system with a steady stream of data in the most efficient way possible.
When mentioning arrays, it is impossible not to talk about matrices and matrix multiplication. Understanding the basics behind this math will greatly improve your comprehension of systolic arrays. Figure 8 shows an example of 2 by 2 matrix multiplication.
Although the matrix multiplication example above may appear easy, in reality, it is a little more complicated of a process. Figure 9 shows a simplified notation of what happens inside of a tile. On the left, you will notice there are 3 input terminals (i0, i1, and i2) and 3 output terminals (o0, o1, and o2). Starting with each input, we can observe what happens to each data set (a, b, or c) as it travels through the tile. Within the tile, there are two mathematical functions for adding or multiplying the data sets as well as 3 separate registers.
The right side of Figure 9 shows the time table for the data sets as they travel through the tile. Here you can see that after a slight delay, the register of a and b occurred on their output terminal one time cycle T later. The output for o2 is also shown as a*b+c which coincides with the notation on the left.
Applying Matrix Multiplication to Demonstrate Systolic Arrays
Looking back to the example of 2 x 2 matrix multiplication above, we will use the concepts we just learned about systolic arrays to visually represent the matrix multiplication within the array step by step. Below I will explain each time frame one at a time to show exactly what is happening.
T0: To start, we will set up the matrix elements as shown in the diagram below. These values and colors correspond to the math example in Figure 8. The pink element at the bottom is also set to zero, which will represent the solution matrix at the end.
T1: First, looking at the yellow tiles, each element will shift one space to the right and downwards. The green tiles will also move to the left and downward by one tile each, and finally the pink element will move up one tile. In the center, we can see that there are 3 elements that have moved into the same tile, which represent the inputs shown in Figure 9, where i0=1, i1=5, and i2=0.
T2: Following the same movements as T1, the yellow and green elements shift positions by one tile. This creates the situation shown in image T2. It is important to note that while the yellow and green elements always retain the same value, the pink value changes according to the operators outlined in Figure 8. For example, in T1 the middle tile had three elements that followed the operations a*b+c (1*5+0). This gives a final value of 5 which carries over to T2 where it acts as the input c.
T3: Following the same movements as always, the green and yellow elements shift either right of left and downward. The pink values also shift upward one tile after performing their individual operations. After performing the operations from T2, the upper element changes to 19, the right element changes to 6, and the left element changes to 15.
T4: For the fifth time, each element moves in the same direction as previously described. Carrying over the computations from T3, there are three tiles that need to be recalculated:
4*7+15=43, 2*8+6=22, & 3*6+0=18
T5: This is the final step of the matrix multiplication. Since the upper three elements just shift upward, there is only one tile that needs to be adjusted. After solving the last expression, 4*8+18=50, we now have the solution matrix:
This solution matrix is the exact same as the solution found in Figure 8. To recap what is happening inside of the array, let’s look back to Figure 4 and Figure 5 one more time. These figures show that data flows between adjacent tiles using the neighboring tile’s memory. The data flow is also supported by the interconnects for non-adjacent tiles.
Pros and Cons for Using a Systolic Array
Perhaps the most notable advantage of using a systolic array is its low power consumption. For applications that perform matrix multiplication using a conventional computer, the power consumption is significantly higher. This is because the data is stored in the main memory and transmitted repeatedly to the arithmetic operators over a long distance.
On the contrary, a systolic array performs a sum-of-product function (a x b + c) within each tile and transmits the data only between neighboring tiles. This approach reduced the distance traveled, considerably driving down the energy consumption.
Another advantage of using a systolic array is the prevalence of sum-of-product functions in deep learning. These functions are used in approximately 95% of deep learning processing. With the rising popularity of deep learning methods, finding more efficient and cost effective methods is essential. Many companies, including Google, have realized the importance of array architectures and are adopting them for practical use in FPGAs .
Although array architecture has come into the limelight for deep learning, it is not without its own flaws. While ideal for matrix multiplication, when it comes to irregular operations, systolic arrays perform worse than conventional methods. For this reason, the use of array architectures have been limited to applications such as signal processing.
Capabilities of the AI Engine
So far, I have spent the majority of this article talking about the theory and architecture behind the ACAP device. While this is very important, I am sure that the question on everyone’s mind is “How does it perform?”
Looking at Figure 11, the short answer is that the Versal ACAP device vastly outperforms SoC devices. The figure shows two examples where the Versal and UltraScale + MPSoc devices were used to perform image recognition. When using GoogleNet (left), the Versal outperformed the SoC device by about 7 times. Similarly, for the ResNet-50 case (right), it performed around 5 times faster. Although impressive, it is worth noting that the current Versal system was not used in this test and a predicted value of the Versal VC 1902 AI engine was used instead.
Programming for the AI Engine
To finish up this article, let’s briefly talk about the programming behind the AI engine. As mentioned previously, the AI engine processor also includes a 7 way VLIW configuration that requires a lot of local memory, interconnects, and controls to operate (Figure 6). According to references ,  and , the AI engine consists of a dataflow graph written in C++ which can be compiled and executed using the AI engine compiler.
Adaptive Data Flow (ADF) graph application consists of nodes and edges – nodes in this case represent computational kernel functions, and the edges represent data connections. The application kernel is a basic component of the ADF graph specification, and it can be compiled in order to run on AI engine. ADF graphs are Kahn process networks  with AI engine kernels running in parallel. ” Xilinx also has an AI engine v2.0 LogiCORE IP (Figure 12) , which is a functional block (IP) developed by Xilinx that connects to its AXI. The advantage of this is that users only need to understand how the functional block works, without the need to program the AI engine entirely.
I would like to thank you for reading the ninth volume of my Edge AI Evangelist’s Thoughts series. Hopefully you were able to learn something new or spark an interest in this powerful technology. To finish this volume, I will briefly summarize the main contents of this article.
- Xilinx has developed a new line of devices known as Adaptive Compute Acceleration Platforms which will act as a successor to current FPGA and SoC devices.
- Unlike the FPGA devices from previous generations, which only included scalar and adaptive engines, Versal devices will also come equipped with an AI engine.
- With improvements to semiconductor production, Versal devices are using 7nm components from TSMC as opposed to the 16nm semiconductors found in devices from the previous generation.
- The AI engine consists of a 512bit SIMD which includes a 7-way VLIW processor and local memory tiles arranged in a 2D systolic array pattern.
- Memory tiles are arranged in a 2D mesh topology that allows non-blocking data transfer. The topology also establishes a dedicated connection between neighboring tiles and their memory.
- When comparing image recognition capabilities for the VC1902 to the UltraScale+MPSoC FPGA, it performed about 5 times faster for GoogleNet and about 7 times for ResNet-50.
- The intelligent engine uses data flow graphs written in C++ and based on Kahn process networks. The engine is also compatible with the AI engine v2.0 LogiCORE IP developed by Xilinx.
References: Xilinx Versal
https://japan.xilinx.com/products/silicon-devices/acap/versal.html  A[Iエンジンを持ったXilinxの「Versal FPGA」
https://news.mynavi.jp/article/hotchips31_ml-14/  さまざまな分野におけるAIアプリニーズに対する答えとなる「Versal」
https://news.mynavi.jp/article/hotchips31_ml-15/  Xilinx Developer Forum Versal:AI Engine & Programming Environment
https://www.xilinx.com/publications/events/developer-forum/2018-frankfurt/versal-ai-engine-and-programming-environment.pdf  SYSTOLIC ARRAYS FOR (VLSI)
H.T. Kung, C.E. Leiserson, CMU, April,1978
https://apps.dtic.mil/dtic/tr/fulltext/u2/a066060.pdf  H.T.Kung CMU, Jan. 1979, “Let’s Design Algorithms for VLSI Systems”
https://caltechconf.library.caltech.edu/192/1/HTKung.pdf  FIRST IN-DEPTH LOOK AT GOOGLE’S TPU ARCHITECTURE
https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/  AI Engine v2.0 LogiCORE IP Product Guide
https://www.xilinx.com/support/documentation/ip_documentation/ai_engine/v2_0/pg358-versal-ai-engine.pdf  Versal ACAP AI Engine Architecture Manual
https://www.xilinx.com/support/documentation/architecture-manuals/am009-versal-ai-engine.pdf  Versal ACAP AI Engine Programming Environment User Guide
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1076-ai-engine-environment.pdf  AI Engine Kernel Coding Best Practices Guide
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1079-ai-engine-kernel-coding.pdf  Kahn process networks, Wikipedia