Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D Center.
In this series of articles, I will share some insights from my decades long experience in the semiconductor industry and comment on different AI industry related topics from my unique perspective.
Today, I would like to talk about the M1 processor by Apple, which is used in Apple’s various products – including MacBook Pro, MacBook Air, and Mac mini announced in November 2020.
Many professionals have already benchmarked it, and most are surprised by its low power consumption and high performance. How did Apple manage to realize such high performance with low energy consumption? In this article, we will try to solve the mystery by analyzing the contents and performance of Apple M1 processor.
M1 Processor Performance
In the conventional Mac with Intel CPU, the CPU, memory, Apple T2, Thunderbolt controller, and I / O chip were separate. In M1, all of these components are integrated. As a result of using SiP technology to mount the memory on the CPU and one package board, the memory bandwidth has been improved and the memory latency has been reduced.
The semiconductor manufacturing process uses the 5nm process with 16 billion transistors. Figure 1 shows a graph comparing GeekBench 5 Single Thread Score, one of the CPU benchmarks, with a conventional Mac. The score of the new MacBook Air (2020) is 1687, which is about 36% higher than the score of 1239 of the MacBook Pro one generation ago.
This is the third time Apple has made a major CPU change. Before 1994, Apple was using the Motorola MC68000 series, but in 1994 switched to PowerPC. In 2006, Apple introduced an Intel CPU, and in 2020, they decided to use their own – M1. 
Now, let’s take a look at the components of M1. It is covered with a metal cover (left side of Figure 2), and the inside is a package board with M1 SoC (left side), two DRAMs (right side), and capacitors (top and bottom) (right side of Figure 2). This packaging method is called SiP (System in Package).
The M1 SoC part on the right of Figure 2 announced by Apple does not actually show the real SoC part – if you disassemble your Mac computer and remove the metal cover, you should be able to see the back of the SoC.
This is because the silicon surface is attached toward the package substrate side. Just like the case with the announcement of semiconductors, the right side of Figure 2 is very likely an edited image for explanation.
The M1 SoC has a built-in memory controller, and two main memory DRAMs are directly connected to the M1 SoC via a package board. It is estimated that the main memory access time is considerably shorter than before, contributing to a high CPU benchmark score.
Regarding graphics, let’s look at the memory of the old Mac using Intel CPU (left side of Figure 3) and the memory of the new Mac using M1 (right side of Figure 3).
GPU memory and CPU memory used to be separated, and it was necessary to copy data between the two when drawing on the screen. With the new memory, Apple can call the M1’s memory configuration (on the right) its UMA (Unified Memory Architecture). Proper use of both memory areas would eliminate the need for data copying, which must have led to improved graphics performance.
Improvement of A13, A14, M1 SoC chips
Figure 4 shows three recent Apple chip photos. A13 is used on iPhone11 and A14 is used on iPhone12. Let’s have a look at the layout of the main blocks in the chip.
They both have all the common components – (1) GPU on the top of the chip, (2) system level cache memory below it, (3) high-performance CPU core (Firestorm 4core) on the bottom left, and (4) high on the right side. Efficiency (low power consumption) CPU core (Icestorm 4core), (5) Neural Engine at the bottom left of the chip.
For example, the GPU of M1 has twice as many cores as A14 GPU, so the GPU area is larger than that of A14. However, Relative positions have not changed – the size of the chip and the operating clock frequency are greatly influenced by the arrangement of the main blocks.
Continuing to use the chip layout of the previous generation has one advantage; when developing, the chip design know-how can be inherited. M1 is Apple’s first Apple Silicon for Mac, but in terms of chip design, I think it is an evolutionary form following the A13 and A14 used for smartphones.
The semiconductor manufacturing technologies used are TSMC 7nm (N7P) for A13 and TSMC 5nm (N5) for A14 and M1. The first chip using 5nm, the A14, has a chip area of 88mm2, while his next M1 chip area is as large as 119mm2.
Generally speaking, the initial stage of a new semiconductor manufacturing process is still immature, and there is a risk in mass production of large-area chips suddenly. It may be a countermeasure to reduce the chip area with A14 to reduce the risk at the beginning of the manufacturing process.
Since the M1 is for Mac, the CPU and GPU have been strengthened and the chip area has become larger than that of A14. It might have been a strategy to reduce mass production risk by making it the second chip, using 5nm manufacturing technology.
Figure 5 shows the GeekBench 5 Single Thread Score, which measures the performance of one CPU core. Benchmark score values have improved steadily to A13, A14, and M1.
Difference between variable-length instruction CPU and fixed-length instruction CPU
ーwith the example of cooking curryー
It is reported that the M1 equipped Mac has a CPU that’s 36% faster than the conventional Intel CPU equipped Mac (Figure 1). There are many factors that affect CPU performance, and it’s a rather complicated to explain.
To help you understand, I will try a simplified explanation focusing on one of the factors, the instruction set. We will examine the difference between the Intel x86 instruction and the ARMv8 instruction (M1 CPU), using the curry cooking procedure as an example.
Steps of cooking curry
Figure 6 shows the steps we need to take to successfully make curry. For simplicity, the cooking time required for each step will be the same.
Intel’s x86 instruction set has various variations with a minimum instruction length of 1 byte to a maximum of 15 bytes.
On the other hand, in the ARMv8 instruction set, all instructions are fixed at 4 bytes (32 bits) in length. It is called a fixed-length instruction set.
Intel CPU is the variable length instruction type, whereas ARMv8 CPU is the fixed length instruction type.
In-memory instruction placement and execution
The information required to execute “Step 1 Go to the butcher and buy meat” would be (A) go shopping, (B) the address of the butcher, (C) the type of meat, and (D) the amount of meat. For long instruction CPUs, let’s set the length at 3 bytes.
Compared to these, there is less information required to execute “Step 5 Stir-fry meat”, so I think that 1 byte length is enough. We’ll assign 1 to 3 byte length instructions to all the other cooking steps, depending on the actions and information required.
The variable-length instruction CPU program virtually created in this way is on the left side of Figure 7. On the other hand, the right side of Figure 7 shows a fixed-length instruction CPU program. The first byte of each instruction is hatched.
Increased instruction code size and PC memory size
Figure 7 Comparing the instruction code sizes of virtual programs, it is 17 bytes for variable-length instruction CPUs and 40 bytes for fixed-length instruction CPUs. The former is compact at 42.5% (17 bytes / 40 bytes) of the latter, and it can be seen that the memory utilization efficiency is high.
The maximum expansion memory of the first IBM Personal Computer (released in 1981), the originator of the current personal computer, was 256kB (64kB on the main board and three expansion cards combined) .
Considering that the typical size of the memory installed in a personal computer in 2020 is 4GB, the amount of memory installed in a personal computer has increased 65,536 times (2 ^ 16 times).
In the 1980s, memory was much more valuable than it is today, and variable-length instruction CPUs with good memory utilization were a good match.
Difference between instruction code size and instruction decoding
Let’s compare the instruction decode using Figure 7. The variable-length instruction CPU reads the first byte of the first instruction, address 0.
At the time of reading, I do not yet know the length of my instruction or the start address of the next instruction. Only after decoding it will know your command length, and therefore the start address of the next command.
The first byte of each instruction is hatched. The hatched areas will be irregular. In other words, it is necessary to decipher the instructions sequentially.
On the other hand in a 4-byte fixed-length instruction CPU, the start address of each instruction is predetermined from the address 0x0 to every 4 bytes. Hatching is more regular than variable-length instruction codes. If the circuit scale allows, it is possible to decipher the current instruction, the next instruction, and the previous instruction at the same time.
With the progress of semiconductor technology, as the memory capacity of PCs increased and complicated circuits could be mounted on CPU chips, the merit of compactness of variable-length instruction CPU programs diminished.
On the other hand, the simplicity of the fixed-length instruction CPU, for example, the instruction start address is simply determined, has changed to a merit. It has been reported that the high-performance CPU core (Firestorm) used in the A14 and M1 decrypts up to eight instructions at the same time .
In fact, Intel x86 CPUs have also converted x86 instructions (corresponding to the left side of Figure 7) into RISC-like fixed-length instruction sets called μOPs inside the CPU chip since the Pentium Pro processor released in November 1995, depending on the instruction. It has an instruction conversion mechanism that divides and converts one x86 instruction into multiple μOPs.
The performance was improved by converting to fixed-length instructions inside the chip while maintaining machine language-level compatibility of x86 instructions. However, the restriction that only a few registers can be used with the original x86 instructions remained. The instruction conversion mechanism has an overhead such as an increase in chip area and increased the number of pipeline stages due to instruction conversion.
How to make curry faster?
Now, let’s assume you have a request to shorten the curry cooking time. Figure 8 shows the change in cooking time when the number of chefs is increased from one to two.
If there is one chef (left figure), the cooking time is 10. If you increase the number of cooks to two, decipher multiple instructions at the same time, and cook in parallel with two cooks, the cooking time can be reduced to 6 (40% less) as shown on the right side of Figure 8.
Here, it is assumed that there are two or more pots and two or more stoves. The number of cooks, the number of pots, and the number of stoves correspond to the amount of computing resources in an actual CPU.
In order to use arithmetic resources effectively, it is important to find instructions that can decode multiple cooking instructions at the same time and execute them in parallel. A fixed-length instruction set that is easy to decipher is advantageous for this, and the ARMv8 instruction set used in M1 meets these conditions.
Improved AI performance
Apple has announced improvements in AI performance on the new MacBook . The ML Compute framework is a machine learning framework that includes TensorFlow optimized for Mac.
The new tensorflow_macos that supports TensorFlow 2.4 used only the CPU of the Mac until then, but it seems that both the M1 and the Mac equipped with Intel made the best use of not only the CPU but also the GPU.
Figure 9 shows the training time (seconds / batch) per batch for five deep learning programs using TensorFlow. Black indicates the time when using TensorFow 2.3 on a MacBook Pro (Intel CPU). The orange color indicates the time when using TensorFow 2.4 on a MacBook Pro (Intel CPU).
Red indicates the time when using TensorFow 2.4 on MacBook Pro (M1 CPU). It is reported that the training time was reduced to a maximum of 1/7 on the MacBook Pro (M1 CPU).
Today, we attempted to solve the mystery of a high-performance CPU in the M1 processor used in the new MacBook Pro, MacBook Air, and Mac mini released by Apple in November 2020. It is reported that the CPU performance of the new Mac has improved by about 36% compared to the Mac using the conventional Intel CPU.
I think there are two main reasons why the new Mac has high CPU performance; (1) The Unified Memory configuration was adopted to reduce the memory access time from the CPU, and (2) the change from Intel x86 instructions to ARMv8 instructions facilitated the decoding of multiple instructions and increased the parallel execution of instructions.
I explained the point (2) using an example of curry cooking. Apple reports that the enhancements to ML Compute have made training times for deep learning programs up to 7 times faster than the Intel-powered 13-inch MacBook Pro.
Another reason Apple switched to its own Apple Silicon is thought to be lower CPU procurement costs. Prior to the new Mac, Apple had to procure Intel CPUs at a price set by Intel. The M1 absorbs CPU design internally from Apple, and he outsources chip manufacturing to TSMC.
Apple is said to be TSMC’s largest customer and has strong price bargaining power over TSMC. In other words, Apple has the power to reduce CPU procurement costs. The new Mac has a good reputation for its high cost performance. It may also be thanks to their effort in lowering CPU procurement costs.
 Three Macs with “M1” Chips Now Available! Summary of Apple event announcements
 Benchmark of Apple’s original SoC “M1” chip is released, what is its ability?
Why is the “M1” chip developed by Apple so high-performance?
 Apple M1
 Apple M1 Chip – Apple (Japan)
 The Apple A13 SoC: Lightning & Thunder
 Apple A14 Die Annotation and Analysis
 Apple Announces The Apple Silicon M1
Qualcomm Discloses Snapdragon 888 Benchmarks: Promising Performance
 The 2020 Mac Mini Unleashed: Putting Apple Silicon M1 To The Test
 IBM Personal Computer
 Apple Announces The Apple Silicon M1: Ditching x86 – What to Expect, Based on A14
 Leveraging ML Compute for Accelerated Training on Mac