Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D Center.

In this series of articles, I will share some insights from my decades long experience in the semiconductor industry and comment on different AI industry related topics from my unique perspective.

You can read the Vol1 of the series by clinking the link.

Today, I would like to talk about NVIDIA Ampere.

In May 2020, NVIDIA unveiled some new products such as the NVIDIA Ampere architecture and its product family, including the NVIDIA A100 TENSOR core GPU for data centers.

You can read their whitepapers below:

How sparsity powers AI inference (in Japanese)

NVIDIA A100 Tensor Core GPU Architecture – UNPRECEDENTED ACCELERATION AT EVERY SCALE

With the spread of deep learning inference on the edge (terminal) side, a compact learned deep network that supports the limited amount of memory and computing power of edge devices is required. In the learning steps performed in the data center, research and development to make the deep network compact while maintaining accuracy is active. For example, this material reports “Pruning,” “Quantization,” “Weight Sharing,” and “Huffman Coding.”

Today, we will take a look at a compression technique that utilizes the sparsity of deep learning’s weight matrices. This aricle touches on compression using the sparseness of the layer density for Resnet-18, which is one of the deep learning networks.

The left side of Figure 2 shows the original layer density distribution that does not reduce the layer density, with red showing positive weights and blue showing negative weights.

The figure in the center shows the distribution when the weight near zero is set to zero (black) and the layer density is reduced to 35%, and the figure on the right shows the distribution when the weight is further reduced to 15%. You can see that the black area increases toward the right and there are many weights set to zero.

Figure 3 shows the relationship between layer density and Top-5 accuracy. You can see that reducing the layer density to 15% reduces the Top-5 accuracy by only 0.86%. Therefore, we can say that the deep learning weight matrix has sparsity.

layer density | 100% | 35% | 15% |

Top-5 accuracy | 87.43% | 87.36% | 86.57% |

Figure 3 3 types of weight reduction and Top-5 accuracy

Data structures and computational methods that efficiently handle sparse matrices are well known in the field of softwares. Texture data compression standards are used for GPUs, which is also software processing.

In my opinion, what is new about NVIDIA Ampere is “a tensor core for matrix calculation that supports sparse networks with’hardware'”. Let’s look at some examples.

### Weight matrix compression

First of all, I would like to explain the compression process of the weight matrix with reference to the literature and my knowledge.

Figure 4 (A) is the weight matrix to be compressed. Figure 4 (B) is a two-color representation of (A) with element values. When we focus on columns 0 to 3 (blue frame in Figure 4 (C)) in one row, two elements with a large absolute value among the four elements are shown as light blue, and the other two elements with an absolute value smaller than light blue are shown in black.

Compression is described as fine-grained structured pruning (2: 4 non-zero). They are also described as “prunes trained weights with a 2-out-of-4 non-zero pattern” in the whitepaper by NVIDIA; NVIDIA A100 Tensor Core GPU Architecture.

There is no further explanation in the whitepaper, but I presume that the dedicated hardware will probably pruning the remaining 2 elements, leaving 2 elements with the highest absolute value out of the 4 elements. The scale of the dedicated hardware that does this processing will be insignificant compared to the entire GPU.

It seems that columns 0-3 of a row are compressed first, then columns 4-7. The results seem to be written to (D) Non-zero data (in light blue) and (E) Non-zero indices (in purple).

Non-zero Indices indicates the position of the remaining 2 elements in the original 4 elements in 2 bits. Similarly, 2 bits are required for columns 4 to 7, and a total of 4 bits are required per row.

(F) The frame (in green) shows the 4 elements compressed from the line (A) and the Non-zero indices 4bit.

Figure 5 shows the weight matrix image before and after compression. The white elements of the 8×8 matrix on the left are pruned, and on the right are compressed to the remaining elements (dark green, light green) and non-zero indices (purple).

Figure 4 Sparse Tensor Core of NVIDIA Ampere [3] with author’s comments in red and blue.

Figure 5 Compressed Matrix format [3]

Figure 6 shows the compression ratio of the weight matrix. The weight matrix is compressed in about half.

Bit width of weight (precision) | Before compression
(8 columns per row) |
After compression
(8 columns per row) |
Compression rate |

8bit | 8 x 8 | 4 x 8 + 4 | 56.25% |

16bit | 8 x 16 | 4×16 + 4 | 53.125% |

Figure 6 Weight matrix compression ratio

### Calculation of compressed weight matrix and active value

Figure 4 (H) shows the Sparse Tensor Core, which is a block that performs a matrix dot product operation between a compressed weight matrix and an input activations matrix.

First, let’s bring the 8 elements in a row of the input activity value matrix (M) to (I). Compressed weight matrix (G) decodes the non-zero indices in the green frame, and then MUX (J) chooses the four input activity values (K) corresponding to the weight positions left from the eight input activity values (I).

It repeats the multiply-accumulate operation four times while changing the elements of (K) and (G) to obtain the result for one column of input activity value.

Now, this process is repeated by changing the column of the input activation value matrix to obtain Output Activations (N). Since (N) is an activity value matrix of the same size as the input (M), it can proceed to the next process as it is. Re-learning is performed after weight compression to recover the correct answer rate that was reduced by pruning.

### The advantages of hardware compression

I think there are two advantages to hardware compression.

One is that the number of product-sum operations can be reduced in half compared to before the weight matrix compression, because it can skip the product-sum operation of the weights that have been compressed to zero. This way, the product-sum calculation performance can be doubled.

The other advantage is the ability to reduce the transfer amount by reducing the weight matrix size (Figure 6) in half. The weight matrix is usually located on an external DRAM rather than inside the Ampere chip. By using the compressed weight matrix ((D) and (E) in Figure 4), the footprint on DRAM is almost reduced in half. Also, the transfer hand width between the external DRAM and the Ampere chip can almost be halved.

Sparse Tensor Core is a method that leaves two weight matrix elements with large absolute values and sets the remaining two to zero, regardless of the size of the weight value. It may sound a little forcible as an approach.

However, by making good use of the sparseness of the weight matrix (Figures 2 and 3), the product-sum calculation performance can be doubled. This is a huge advantage that cannot be neglected.

### Before we go

In deep learning – which is currently the mainstream of AI methods – it is important to reduce the size of deep learning networks because inference is executed by edge devices.

The NVIDIA Ampere Architecture for data centers introduced the Sparse Tensor Core of NVIDIA Ampere. This is a mechanism that uses the sparsity of deep learning weight matrices, compresses the weight matrix in half and doubles the product-sum calculation performance by adding relatively simple hardware. It is safe to conclude that the use of sparsity expands the application field of the technology.

**Reference**

- NVIDIA A100 TENSOR Core GPU

https://www.nvidia.com/ja-jp/data-center/a100/ - How sparsity powers up AI inference

https://blogs.nvidia.co.jp/2020/05/27/sparsity-ai-inference/ - NVIDIA A100 Tensor Core GPU Architecture

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf - S.Han, et.al., “Deep compressing deep neural networls with pruning, trained quantization and huffman coding”, ICLR 2016

https://arxiv.org/pdf/1510.00149.pdf - X Liu, et.al., “Efficient Sparse-Winograd convolutional neural networks”, ICLR 2018

https://arxiv.org/abs/1802.06367 - https://docs.scipy.org/doc/scipy/reference/sparse.html
- Hiroshige Goto’s Weekly Overseas News – Features of NVIDIA Ampere’s Pruning Support

https://pc.watch.impress.co.jp/docs/column/kaigai/1265707.html - Hiroshige Goto’s Weekly Overseas News – Model compression technology is the key to 3rd generation deep learning processors

https://pc.watch.impress.co.jp/docs/column/kaigai/1265701.html