Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D center.
In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry-related topics from my unique perspective.
In the previous volume, I introduced Xilinx’s new ACAP, Versal. The article focused on the systolic array architecture that provides fast computing with low power consumption. I also covered some of the Versal platform’s system and hardware specifications. If you haven’t read it yet, you can find it here: https://hacarus.com/ai-lab/20210514-versal/
In the 10th article of this web series, I will continue talking about the Versal platform and its hardware. Xilinx has been focused on overcoming several of the issues commonly found in conventional chip designs. To overcome these challenges, the engineers at Xilinx have had to think outside of the box and break away from conventional architecture. The Versal’s unique structure is redefining the way designers are approaching wire management and information transmission.
Versal’s Position in the Industry
To begin, let’s revisit the background of the Versal product line that was discussed in the previous article.
The concept of an Adaptive Compute Acceleration Platform isn’t new. It was first introduced back in 2018 and the first ACAP product was launched in June of 2019. Xilinx has stated that their Versal product is an intelligent and adaptive platform that has endless potential for applications in a variety of different fields.
This platform is commonly used in data centers to assist with both wired networks and wireless 5G tasks. One area where the ACAP really shines is assisted driving computing, where it performs around 20 times faster than any of the current FPGA methods. These gains are even more impressive when it comes to CPUs. By using an ACAP, we can see drastic improvements that are up to 100 times faster than other commonly used methods   . The ACAP platform also sits above the Zynq Ultrascale + RFSoC, the most cutting-edge SoC technology.
Looking ahead to the next generation of FPGAs, let’s make a comparison to the Zynq Ultrascale + MPSoC. The Zynq Ultrascale is fabricated using TSMC’s 16nm FinFET method while the Versal uses a 7nm FinFET process. Although it has its advantages, the scaling of components has caused several issues, including congested interconnects and an increase in routing delays.
In order to eliminate these problems, Xilinx has adopted a new approach for its chip design. Below, we will discuss this new approach and how it deals with the increase in wiring delay from miniaturization. We will also cover how Xilinx is introducing Network on Chip (NoC) subsystems into their product line.
Increases in Wiring Delay from Scaling
With the current market trends, one might think that smaller components are always better. This way of thinking also applies to semiconductors, where many think that a 7nm semiconductor is always better than the 16nm variant. In reality, this is not true and I will explain why.
The issues that surface during the manufacturing process are related to the normalized resource delay for different-sized semiconductors. As shown in Figure 1, this delay is broken down into two components: transistor delay and metal delay. These delays are displayed for a range of different semiconductor sizes, where the Zynq Ultrascale + MPSoc uses the 16nm method and the Versal uses the 7nm method.
Looking at the results shown in Figure 1, we can see that there are 3 trends. The first is that as the semiconductor size decreases, so does the transistor delay. The second trend is as the size decreases, the metal delay actually increases. Finally, when combining these two components, the total resource delay acts in a parabolic manner where the 16nm method actually has the lowest resource delay while the 7nm method has the greatest delay. If we only look at this graph, it would be easy to assume that the 7nm method is inferior to the 16nm method.
Although smaller semiconductors have a greater resource delay, they have many strengths that can offset this inefficiency. One advantage that the 7nm method has is its cost-effectiveness. By downscaling the semiconductors, TSMC has stated that the overall chip size can be decreased by around 33~43% .
This size reduction is significant because it allows TSMC to increase the number of chips that can be manufactured from a single silicon wafer by 2.2~3 times. Although the cost per individual chip will increase, due to the complexity of the manufacturing process, the overall cost of chip production will decrease. Along with its cost-effectiveness, this method also increases the manufacturing capacity for computer chips.
Even though these benefits are exceptional, these trends have been known for several decades. As far back as the 1970s, companies realized that they could increase their production, reduce costs, and improve chip performance by miniaturizing their chips.
Until around 2010 this concept was effective and there were minimal drawbacks for miniaturization. Looking back at Figure 1, it seems that the resource delay will continue to increase as semiconductor size decreases. Unlike the 1970s, the performance gains of today are being overtaken by the downsides of miniaturization such as resource delay. Many researchers think that the miniaturization method has reached its limit, however, many individuals are still attempting to improve this process today.
Circuit Resistance Model
Now that we have covered the negative effects of resource delay, let’s discuss why this happens within the actual circuit. While a voltage is passed through the circuit, it will be affected by the transistor and metal delays described above. The magnitude of these delays act differently depending on the configuration of the circuit as shown below.
Figure 2 (left) shows a simple circuit with two inverters (Inv1 and Inv2) which act as simple logic gates. The voltage waveforms are represented using n1, n2, and n3. In between the inverters, the waveform is affected by the metal delay, RC, where R stands for resistance and C stands for capacitance. For a short wired circuit, the transistor and RC delay are illustrated using the blue and green arrows respectively (Figure 2, right). Although larger, the RC delay is acceptable when compared to the transistor delay.
While short wired circuits appear acceptable, a major issue arises when the circuit follows a long wired model. As shown in Figure 2 (bottom), both the resistance and capacitance are directly proportional to the wire length. This means that if the wire is seven times longer than the short wire circuit, the resistance is 49 times higher! We can clearly see that increasing the wire length can lead to serious problems since this resistance is proportional to the square of the wire length.
This problem is also illustrated in Figure 2 (right), where the waveform for the long-wired model is shown in red. Similar to the short wiring model, a wire is connected between Inv1 (n1) and Inv2 (n4). Unlike the previous case, however, the waveform is extremely dull and more pronounced, leading to a drastic increase in RC delay times.
Wire Architecture & Material Properties
Although the issue of RC delay is well known, with the current chip architecture it is difficult to find a solution. Figure 3 shows a cross-sectional area of the wire configuration used in TSMC’s 7nm SRAM chips. Starting from the bottom, the cross-section is split into different layers from M0 until M12. Below M0 is a component labeled ‘fin’ which is the transistor. The black figures between M10~M12 are the metal interconnects.
Moving up from the transistor, the layers from M0~M4 are designated for short-range wiring. Here, there is minimal concern for RC delay so the number of wires is high and each wire is as thin as possible.
Next, the layers M5~M10 are laid out for mid-range wiring. Here there are considerably fewer wires than in the short-range section. However, this is when wire length starts to become an issue. To offset some of the wire resistance, causing the delay, the diameter of the wires is increased.
The last section of the chip from M11~M12 is slotted for long-range wires. Here, RC delay is significant, so the diameter of these wires is made as wide as possible. Typically, these areas don’t require a large volume of wires so this is achievable. The top layer (large black section) is a thick metal layer that is used for power supply interconnects.
Another element that affects the operation of the chip is the material properties of the wires. Back in 1997, IBM pioneered the use of copper wiring in electronics and it was quickly adopted by the industry. Over the years, several issues have been discovered when using copper. One downside is that the copper particles diffuse within the semiconductor. It was later found that using various barrier metals, such as cobalt, to encase the copper in a protective wrapping was an effective solution.
Figure 4 shows two cross-sections for a copper wire wrapped in cobalt. As we can see, when the manufacturing process changes from 16nm to 7nm, the dimensions of the wire drastically change.
Using a simple rectangular model for the two wire cross-sections, the resistance ratios were calculated for the 16nm and 7nm methods. After performing the calculations, it appears that the 7nm method’s wire has a resistance ratio that is 3.3 times greater than the 16nm method’s wire.
There appear to be two main factors that lead to this result. First, the bottom wire in Figure 4 has a width requirement of 18nm, half that of the upper wire. Since the copper has to be surrounded by a uniform coating of cobalt, this means that the surface area of the copper wire is greatly reduced. A thinner wire, as mentioned above, increases the resistance to the electricity traveling through it. This will naturally produce a higher value for the resistance ratio.
The second factor is the difference in resistance values between copper and cobalt. In nature, cobalt is 3.5 times more resistant than copper. This means that electricity passing through the wire will flow only through the thinner copper wire. These two factors combine to give us the result shown in Figure 4.
The results in Figure 5 reinforce the notion that copper wire encased in a barrier metal will increase its resistance ratio when miniaturized. These figures were presented at a scientific conference by TSMC in 2017.
Implementing Network on Chip (NoC)
As previously outlined, the wiring architecture and chip efficiency are serious problems facing FPGAs. Unlike most Microprocessors, like the x86, or mobile SoC (e.g. Apple A14), FPGAs are designed by the user. This means that they cannot oversee the design process from the beginning to the end. While working on the Versal, Xilinx has been trying to find new ways to break through these barriers and one of these innovations is the introduction of NoC.
Traditionally, the architecture for FPGAs involves gathering function blocks, PL (also known as Fabric), image processors, memory controllers, and various I/O. All of these components are then integrated into a large FPGA board using interconnects, such as the AX14, to connect them.
Compared to previous generations, the Versal is making significant changes to its interconnect architecture. While previous generations used a crossbar-based interconnect architecture, the Versal is using an NoC approach. Figure 6 shows a layout of the Versal VC 1902 chip, where the orange sections indicate the NoC architecture .
Going into further detail, NoC is a packet-based communication network within the chip. The main idea is to divide the data into packets before sending them into the communication channel by the sender.
NoC was first introduced by Arteris back in 2000 as IP for SoC design . Even though this technology has existed for over two decades, this is the first time that it has been implemented into an FPGA. It can be argued that this type of communication network wasn’t needed in the past, but with the increasing number of issues mentioned above, it might be the solution FPGAs desperately need.
Until the last generation, the user-designed logic circuits and interconnects used the same programmable logic (PL). This caused issues with interference and made place-and-route design difficult.
Looking at the NoC architecture shown in Figure 7, we can see that the on-chip network consists of two long and narrow horizontal regions as well as four vertical regions . Each of these regions act as a physical channel that has a bi-directional bandwidth of 2Tbps for both the vertical and horizontal regions separately.
To better understand how the NoC works, Figure 8 shows a conceptual diagram and provides an explanation of its benefits . In the NoC, inputs come in from the left, and then the path is selected by the switch in the center. At the same time, outputs enter from the right side and pass through the switch as well. PL and other blocks are also connected to this on-chip network as well .
With this new architecture and communication network, wiring is pre-loaded into this NoC region. This makes communication times and power consumption more predictable. Another advantage of this layout is the high efficiency for PL consumption. Both of these strengths may allow FPGAs to overcome their initial wiring problems.
To finish this article, I will quickly summarize what we have discussed. Continuing from Volume nine of this series, we finished our introduction to Xilinx’s Versal product line. The Versal platform is a powerful piece of technology and today I went into detail about the chipsets that are included in the system. Below are the main two topics that were mentioned today:
- The Versal system has used semiconductor manufacturing techniques to reduce the component size from 16 nm to 7nm. While this has its advantages, it also leads to wire resistance and delay issues which make chip design increasingly difficult.
- Versal is the first FPGA to introduce NoC in order to solve the issues associated with upscaling FPGA devices and wiring resistance.
I would like to thank you for taking the time to read the 10th volume of the ‘Edge AI Evangelist’s Thoughts’ series. I hope that you enjoyed today’s topic and I hope you will continue to read my articles in the future.
 Xilinx Versal
 ベールを脱いだザイリンクスの次世代FPGA，CEOらが講演 (2018/8/27)
 B.Gaide, et.al., Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture
 いざ7nm世代の製造プロセスへ，TSMCやIBMらが発表 (2016/12/7)
 後藤弘茂のWeekly海外ニュース AIアクセラレータコアをFPGAに組み込んだXilinxの新カテゴリ「Versal」(2019/9/30)
 後藤弘茂のWeekly海外ニュース 7nmで作られた第3世代Ryzenのトランジスタ密度が低い理由 (20201/29)
 AIエンジンを持ったXilinxの「Versal FPGA」
 微細化に向かう車載半導体に注力 – SoCの配線IPコアを提供するArteris 2016/1/8