# Edge AI Evangelist’s Thoughts Vol.12: Cloud-Native Processors

Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D center.

In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry-related topics from my unique perspective.

In the 12th volume of this web series, I will introduce two new series of cloud-native processors developed for use in data centers. The processors I will go over have been developed by Amazon and Ampere, two companies that are not traditionally seen as microprocessor manufacturers.

### Ampere Altra

The first processor we will look at is Ampere’s Altra. Ampere is a startup created in 2018 by several Intel employees. The Altra is their second product developed and integrates 80 cores using Arm’s v8.2A+ architecture. The core design is also based on Arm’s Neoverse N1 core used in their data servers [9].

The Altra is a very impressive piece of hardware and is even used by Oracle for its cloud infrastructure (OCI) [4]. The Altra is fitted with cores that run up to 3GHz single-threaded and have a 64kB L1 instruction cache and a 1MB L2 cache. Another feature of this chipset is its integration of 32 MB of system-level cache (SLC). Looking at Figure 1, we can see a practical use of the Altra inside of the Mt. Jade platform.

Figure 1. Mt. Jade Platform using Ampere’s Altra Microprocessor

Now that we are familiar with Altra’s specs, let us see how it performs in a practical test. Looking at Figure 2, we can see how the Altra performs compared to other processors from AMD and Intel. Using the Coremark 1.0 benchmark to measure iterations per second, the Altra and Altra and EPYC perform at a similar level while both processors greatly outperform Intel’s Xeon Platinum.

Moving to the right side of Figure 2, we begin to see the advantages of the Altra. When observing the power consumption for the three processors, it was clear that the Altra was superior, performing almost twice the number of iterations compatible with the EPYC while consuming the same amount of power.

Figure 2. Coremark 1.0 Benchmark Results for the Mt. Jade Platform

### Amazon’s Graviton2 Processor and EC2 Instance

Our second microprocessor example is Amazon’s Graviton2, which Amazon has developed to use in its EC2 cloud environment. As expected of Amazon, the EC2 cloud environment is run using some of the highest-performing processors available. Similar to Ampere, the Coremark 1.0 benchmark test was run using a wide variety of processors used on Amazon’s AWS cloud instances and the results are shown in Figure 3. Coming in first place was the m6g_16xlarge using the Gravition2.

Figure 3. Amazon’s EC2 Coremark v1.0 Processor Comparison

Following these results, it is no surprise that Amazon has transitioned 49% of its microprocessors to the Grativon2. The remaining 51% are traditional x86 microprocessors from AMD and Intel. Figure 4 shows the breakdown of instance types on the left with the Graviton2 sitting at 15% as of January 2021 [5]. The high-performance CPU options for the EC2 are also shown on the right side of figure 4.

Figure 4. Amazon’s EC2 Instance Breakdown (Left), EC@ CPU Options (Right)

The Altra and Graviton2 are both marvelous pieces of technology, but Amazon and Ampere can’t take all the credit for their development. Both of these microprocessors are based on Arm’s Neoverse design. If you look closely in Figure 5, you can see that the logos for Arm and Annapurna Labs, a design company acquired by Arm, are engraved into the Graviton2.

Figure 5. Ampere Altra’s Block Diagram (Left), Amazon Graviton2’s Cover & CPU ID Information (Right)

The Neoverse series was first announced at ArmTechCon2018, a private event held in California. The introduction of the Neoverse was a big deal because until this point the Cortex was Arm’s flagship processor for devices such as smartphones. To show just how committed they were to the Neoverse, Intel decided to use it in their Xeon MPU instead of the Cortex. We can look at the Neoverse platform’s roadmap in Figure 6 to get an idea of its future [5].

Figure 6. Arm Neoverse Platform Road Map [5]

To get a better understanding of how the software and hardware of the Neoverse function, we can look at Figure 7 [10]. This diagram includes important information, such as the number of cores and the DRAM connection width. The information provided here is also important for the register-transfer-level design for the processor. By using the efficient and powerful design of the Neoverse, both Ampere and Amazon can save a lot of time and manpower by choosing not to develop their own designs.

Figure 7. Neoverse Software & Hardware Information

### Understanding How Cache Memory Works

To understand a bit more about how microprocessors work, I want to share a practical example of cache coherency. For this example, imagine going to an ATM to withdraw money. Your bank balance is $100 and you want to withdraw$60 first. After withdrawing the money, you then want to attempt to transfer $50 to another account. Below we will look at several examples to explain how these transactions are computed. Before looking at the examples, let’s go over some basic background information for microprocessors. Simply speaking, microprocessors operate by repeatedly reading data from the main memory, processing this data in the internal arithmetic processing unit, and writing the data back to the main memory. Therefore, the time to read and write data to and from the main memory has a significant impact on the overall performance. Until the early 1980s, microprocessors were running at similar speeds to the Dynamic Random Access Memory (DRAM), which serves as the main memory in a server. Since then, the DRAM has only steadily increased while the microprocessor speed has drastically increased, creating a large performance gap that increases each year (Figure 8). Figure 8. Microprocessor-DRAM Performance Gap [11] Now that the background is out of the way, let’s get into the examples. First, we will explain step-by-step how the transactions are computed using a system consisting of only a single core and DRAM outlined in Figure 9. Figure 9. One Core System Without a Cache Step 1: The account balance of$100 is read from the DRAM and is written into the register (r0) in the core.

Step 2: The $60 withdrawal is subtracted from r0. Step 3: The updated balance of$40 is written in the DRAM.

Step 4: The previous account balance of $40 is read from the DRAM. Step 5: The transfer of$50 is subtracted from the balance of $40 in r0, resulting in a balance of -$10.

Step 6: The transfer will correctly fail due to an insufficient balance.

Looking back to Figure 9, we can observe that the total execution time for all these transactions was 152ns, of which 150ns was spent in the DRAM. This base example shows us just how big of a difference there is in the computational speed of a processor compared to DRAM. Based on this result, it can be thought that increasing the microprocessor’s performance will have very little effect on the overall system performance due to the DRAM. One simple solution to this is a cache memory system, which we will cover next.

Cache memory is a form of memory that is small-capacity but allows for fast access and it is on the same chip as the core that holds a copy of the main memory. While there is a lot more to cache memory than this, to understand the example below, this will suffice.

Figure 10 shows the same situation as before, except this time the system has one core with cache memory. The cache memory consists of a valid bit that is added to the core and the data memory. Now, let’s go over the same steps as before.

Figure 10. One-Core System With a Cache

Step 1: The account balance is read from the DRAM and written to the data memory as well as to the register r0.

Step 2:  the $60 withdrawal is subtracted from r0. Step 3: the updated account balance of 40$ is written to the cache memory. (since it is located in the same core, the writing time is only 2ns denoted by the green arrow).

Step 4: The account balance data is stored in the cache memory, so it can be read into r0 as fast as 2ns.

Step 5: The transfer of $50 is subtracted from the balance of$40 in r0, resulting in a  balance of -$10. Step 6 – The transfer will correctly fail due to insufficient balance. When using cache memory, the total execution time was reduced to 56ns from the previous 152ns. This time reduction is possible because the cache memory significantly reduces the access time needed for each reading after the initial writing into the data memory. While impressive, cache memory isn’t able to reduce the time needed for the initial reading from the DRAM, so there will always be a 50ns delay. Another drawback to cache memory is that it is only effective when using a single-core system. When using a multi-core system, a series of errors will most likely occur. To explain how these errors occur, let’s once again look at this banking problem using a 2-core system (Core0 & Core1) that also uses cache memory. As shown in Figure 11, both cores share a single DRAM through an interconnect using a memory-sharing multiprocessor architecture. To reduce the processing time further, we will assume that Core0 is tasked with performing the$60 withdrawal while Core1  performs the $50 transfer and we will proceed to step 3. Step 3: Core0 writes the updated account balance of 40$ to the cache memory and to the DRAM at a later time.

Step 4, 5, & 6: Core 1 transfers $50 based on the old account balance of$100.

Even though the balance is insufficient, the transfer goes through by mistake. This is because there are inconsistent copies of the account balance in the cache memory of multiple cores. This example of cache incoherency is shown in figure 11. In order to solve this issue, a mechanism has been devised, known as the Snoop mechanism, to maintain the coherency of the cache memory while taking advantage of its high processing speed.

Figure 11. Two-Core System Without Cache Coherency

To further explain the Snoop mechanism, we will look at the banking example one last time. Analyzing only step 3, we will add the Snoop mechanism to maintain consistency between the cache memory of the two cores. This principle is known as cache coherency protocol and is illustrated in Figure 12. The Snoop mechanism serves two roles. First, it sends out a cache update message, and secondly, it will constantly monitor the messages flowing through the interconnect and change the cache state if necessary.

Figure 12. Two-Core System With Cache Coherency

Step 3-1 (Core0): When writing the updated account balance of 40\$ to the cache memory, the core0 snoop mechanism sends a cache memory update message to the interconnect.

Step 3-1 (Core1): The Snoop mechanism detects the message from Core0 and checks if the Balance data exists in the Core1 cache memory. Since there is no stored data, the valid bit gets set to 0.

Step 3-2 (Core1) When Core1 reads the account balance data, it looks to see if there is any data in Core1’s cache memory. In this case, when it doesn’t find it, Core1 reads the account balance stored in the DRAM of Core0.

Using the Snoop mechanism, the correct account balance data can be shared between Core0 and Core1.

### The evolution of the Inter-Core Interconnect

The example above showed the versatility of cache memory and how it can be applied to multi-core systems, but in that case, there were only two cores. for Ampere’s Altra Max, the system contains 128 cores (32 clusters) [12]. This is where using cache memory can once again run into some issues. As the number of cores increases, the cache memory update messages begin to overwhelm the system and it becomes a bottleneck for the system.

One solution is to increase the communication capacity of the interconnects Figure 12 shows bus-type and ring-type interconnects. Ring-type interconnects like in Figure 13 have been commonly used since 2005.

Figure 13. High Scalability of CoreLink CMN (Left), Memory Coherent CCIX

Looking at the interconnect architecture of the Neoverse V1 platform, which uses the CoreLink CMN-600 Coherent Mesh Network architecture [13],  we can make a comparison to the old ring-type architecture. As shown in the graph of Figure 13, the mesh network does not reach a peak in bandwidth, while the CCN architecture peaks at a ceiling of 8 clusters (32 cores) In the early 2000s, it was thought that the upper limit of cores used in memory-sharing computers would be 32. Nowadays, 128 cores can be integrated into a single chip, which is quite amazing. Both the theory and implementation of semiconductor manufacturing technologies have come a long way since then.

Finally, The Neoverse V1 platform also offers CCIX (Cache Coherent Interconnect for Accelerators) Figure 13, right [10]. CCIX is an open connectivity standard that enables shared memory between the main memory of a process (e.g. the Altra) and the memory of a separate accelerator. For example, the memory of an AI accelerator card in a PCI-express slot can be seamlessly shared with the processor’s main memory which could reduce the amount of data transferred and improve system performance.

### Summary

To finish this article, I want to recap the main points covered today. Thank you for taking the time to read this installment of my web series and I hope you found it interesting.

• Amazon and Ampere, prominent companies in the data center business have developed “cloud-native processors” for their own data centers, and have begun to deploy them as cloud instances.
• Ampere and Amazon have both developed their own microprocessors based on Arm’s Neoverse design in order to cut development time and manpower usage while still delivering a high-performing microprocessor.
• The Neoverse design has a lot of advantages. For example, Ampere’s Altra series provides high performance with power consumption half that of AMD’s EPYC microprocessor.
• The Neoverse platform has an intra-chip inter-core mesh network (CoreLink CMN-600) that maintains cache coherency, enabling memory-sharing processors with 128 cores (up to 256 cores.
• The Neoverse platform provides cache coherent interconnect for accelerators (CCIX). For example, the memory of the AI accelerator card in the PCI-express slot can be seamlessly shared with the processor’s main memory. This is expected to reduce data transfer and improve system performance.

### References

[1] Ampere Altra The World’s First Cloud Native Processor

https://amperecomputing.com/altra/

[2] 80個のArmコアを集積するサーバー専用MPU、米インテル出身者のスタートアップ

https://active.nikkeibp.co.jp/atcl/act/19/00008/032300805/

[3] Ampere Altra Performance Shows It Can Compete With – Or Even Outperform – AMD EPYC & Intel Xeon

https://www.phoronix.com/scan.php?page=article&item=ampere-altra-q80&num=4

[4] Ampere A1 Compute

https://www.oracle.com/cloud/compute/arm/

[5] Arm、DC向けCPU IPデザイン「Neoverse」の高性能版「Neoverse V1」とArmv9に対応した「Neoverse N2」を発表

https://cloud.watch.impress.co.jp/docs/news/1321510.html

[6] Amazon EC2 で選択できる高性能CPUの選択肢

https://d1.awsstatic.com/webinars/jp/pdf/services/20200707_BlackBelt_Graviton2.pdf

[7] Benchmarking Amazon’s Graviton2 Performance With 64 Neoverse N1 Cores Against Intel Xeon, AMD EPYC

https://www.phoronix.com/scan.php?page=article&item=amazon-graviton2-benchmarks&num=4

[8] Amazon’s Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute

https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd

[9] 打倒Xeon、英アームがサーバー／インフラ向け「Neoverse」で第1弾のCPUコア

https://xtech.nikkei.com/atcl/nxt/column/18/00001/01751/?P=2

[10] Boost SoC performance from edge to cloud ARM®CoreLink™ System IP

http://www.armtechforum.com.cn/attached/article/2016ATS_C1_Neil_Parris20161206151154.pdf

[11] John L. Hennesy and David Patterson, “Computer Architecture A Quantitative Approach fifth edition”, page 102, Morgan Kaufmann

[12] Ampere® Altra Max™ 64-Bit Multi-Core Arm® SoC Product Brief

[13] Arm, Corelink Coherent Mesh Network