Edge AI Evangelist’s Thoughts Vol.6: New Wave computers utilizing FPGA

Edge AI Evangelist’s Thoughts Vol.6: New Wave Computers Utilizing FPGA

Hello everyone, I’m Haruyuki Tago, Edge Evangelist at HACARUS Tokyo R&D center.

In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry related topics from my unique perspective.

In volume five of this article series, we talked about the basic structures of FPGAs and about the new non-volatile FPGA devices. Finally, we concluded with a short case study that illustrated the potential of FPGA technology in the Japanese space industry.

In today’s article, we will continue our discussion regarding the NV-FPGA Initiative, a public symposium held on January 8th, 2021.

 

NV-FPGA Initiative – Public Symposium

On January 8th, 2021, a public symposium was held by Non-Volatile Field-Programmable Gate Array (NV-FPGA) Initiative, one of the themes of the AIST Consortium [1], a thematic study group run by the National Institute of Advanced Industrial Science and Technology (AIST).

I am not a member of the study group, but will briefly report the highlights of the symposium from the perspective of an attendee.

Presentation program

  1. “CMOS annealing machine that accelerates combinatorial optimization processing and its FPGA implementation” by Mr. Masanao Yamaoka from Hitachi, Ltd.
  2. “About the realization of RISC-V based on Microchip’s non-volatile FPGA” by Mr. Nobuhisa Ikeda from Macnica
  3. “Latest Trends in Atomic Switch FPGA Technology” by Mr. Toshiji Sakamoto from NanoBridge Semiconductor
  4. “FPPARB Component Technology for Intelligent Robot Systems” by Mr. Takeshi Okawa, Tokai University
  5. Panel discussion “The future of non-volatile switch FPGA”

Moderator: NV-FPGA Initiative Secretariat Takeo Matsumoto

Today, I would like to focus on lectures 1 and 4 which introduced the concepts of new computer systems using FPGAs. Below, we will go into further details about the topics discussed by Mr. Masanao Tamaoka and Mr. Takeshi Okawa.

Applying annealing concepts to real world issues

Before diving into the technical capabilities of FPGA technology, let’s first take a  look at why this technology is being developed and how it can be applied to real world issues. 

Mr. Masanao Tamaoka, from Hitachi Ltd., explains this nicely in Lecture 1 which is about CMOS annealing machine accelerating combinatorial optimization processing and its FPGA implementation.

Lecture Overview

The optimization of social systems is essential for the realization of social innovation. A few examples include transportation systems, distribution systems and electric power grids. For these systems, it is necessary to optimize the flow of traffic, delivery routes and the distribution of electricity.

For such optimization, it is necessary to solve what’s known as a combinatorial optimization problem. However, it is difficult to solve these problems using a conventional computer. To bypass this issue, Hitachi has developed a new computing technology in hopes of realizing social innovation [2].

The Traveling Salesperson Problem

A classic example of a combinatorial optimization problem is the “traveling salesperson” problem shown in figure 1. In this problem, you are a salesperson that needs to travel to a number of different cities at least once while also ending your trip back in the city you started.

Assuming you know the cost of traveling between each city, you must calculate the cheapest and shortest route possible. This problem may appear simple when the number of cities is relatively low, but the complexity exponentially increases as you increase the number of cities.

For example, if you try to solve this problem using 30 cities, the number of solutions becomes 1030 or approximately 14 million years of computational hours using a supercomputer.

As we can see, obtaining an exact solution to this problem is unrealistic due to the volume of computations required which is true for most combinatorial optimization problems [4].

Figure 1: The traveling salesperson problem (delivery route problem). This problem aims to find the shortest route between a group of cities given the distance between each city.

Figure 1: The traveling salesperson problem (delivery route problem). This problem aims to find the shortest route between a group of cities given the distance between each city.

 

In the 2000s, annealing machines were introduced, including machines focused on semiconductors (figure 2).

Figure 2:Shows the transition between optimization methods and their main characteristics. Both conventional computers and problem solving algorithms are restricted by the time required to compute optimization problems. In addition, annealing machines operate at -273 C which is very costly.

Figure 2:Shows the transition between optimization methods and their main characteristics.
Both conventional computers and problem solving algorithms are restricted by
the time required to compute optimization problems. In addition, annealing machines operate at -273 C which is very costly.

 

Using the Ising Model

Replacing the previous methods involving the optimization problem, we will instead look at the Ising model which is used to determine the state in which the energy, H, of the system is minimized. This is achieved by utilizing a process known as annealing where the system is heated to a high temperature and then cooled gradually.

While cooling, data is collected from the lattices to determine the state of the system. In order to avoid arriving at a localized solution, random data points are deleted to ensure an accurate solution that closes in on the optimal solution.

This process repeats until the system has significantly cooled. At which point, the spin states of each lattice are calculated and the combination that minimizes the system’s energy is determined.

With this method, it is not always possible to reach the optimal solution, however, it is extremely close. For conventional computers, the program response depends on the coupling and strength of the lattices.

Figure 3: A statistical model showing the properties of ferromagnets. It is composed of lattice points (spins) which form two coordinate bonds. The system is in a stable state when the interaction coefficients between adjacent lattice points are at their lowest [6].

Figure 3: A statistical model showing the properties of ferromagnets. It is composed of lattice points (spins) which form two coordinate bonds. The system is in a stable state when the interaction coefficients between adjacent lattice points are at their lowest [6].

Figure 4 (on the left) shows the traditional architecture for the von Neumann computing process which uses a sequential algorithm. However, the proposed architecture (Figure 4 on the right) will replace this process with the Ising model in order to reach a solution.

Figure 4. Comparison between the von Neumann architecture (left) and the proposed architecture (right).

Figure 4. Comparison between the von Neumann architecture (left) and the proposed architecture (right).

 

Comparing the different annealing methods available

Figure 5 shows the benchmarks for a variety of specialty computers using  annealing methods in order to solve combinatorial problems. Several advantages of the CMOS annealing technique is that it comes equipped with a large capacity of lattices and it can operate at room temperature.

Figure 5. Benchmarks for various annealing techniques.

Figure 5. Benchmarks for various annealing techniques.

 

Implementation of FPGA

Hitachi has developed an FPGA machine using Xilinx UltraScale for the FPGA devices and Xilinx Aurora for the connections between FPGAs which operates at a frequency of 82.5MHz.

25 of these FPGA devices were inserted into a rack and configured into a torus shape in order to build a system that can handle up to 100,000 parameters. To evaluate the performance of the device, the time required to compute the solution of a randomly generated Ising model was measured between the CMOS annealing machine (A) and a conventional computer running a simulated annealing program (B).

While running with all 100,000 parameters A completed the evaluation in 10.527 ms while B took 1648.9ms, 156 times slower than the CMOS machine. This technology is available through the Annealing Cloud Web so that users can easily access it and try it for themselves [7].

Although powerful, applying this technology to real world applications can be tricky. First, the existing issue must be converted into an appropriate Ising model. In addition, this process requires extensive knowledge of both business and advanced annealing technology.

 

A practical example using the CMOS system

One potential application for the CMOS system is in the generation of work schedules for dozens or even hundreds of employees. During the COVID-19 pandemic, concerns regarding customer and employee safety have become more prevalent.

These concerns have created a situation that places restrictions on the number of employees working to avoid congestion. At the Hitachi Central Research Laboratory, this technology was used to create a weekly work schedule for 360 researchers with exceptional results [8]. Hitachi has also started to offer its “work schedule optimization solution” software as well [9].

Figure 6: The preliminary outline for the work schedule model [5].

Figure 6: The preliminary outline for the work schedule model [5].

Now, I would like to share my personal thoughts about this topic. From the article, we can see that Hitachi has developed an analytical software that is shared with their customers online through their cloud based network.

They have also begun offering similar application based solutions as part of their business. I think that the bottleneck in this process is the conversion of an issue from a combinatorial problem into a Ising model.

Creating an automated process for this conversion using conventional programs is difficult, leading many experts in this field to believe that consultations are still necessary at this point.

 

FPGA component implementation and further robotic development

So far, we have explored the potential of FPGA technology towards solving real world optimization problems. Another field that has greatly benefited from this technology, is the robotics industry.

In lecture 4, Mr. Takeshi Okawa, from Tokai University, talks about FPGA component technology being developed for intelligent robot systems.

 

Lecture Overview

Beginning in 2019, the Japanese Science and Technology Agency (JST) decided to expand its research into Multi-access Edge Computing (MEC) through its Strategic Creative Research Promotion Project (CREST).

Under the Ministry of Education, Culture, Sports, Science, and Technology (MEXT), CREST would focus on developing a multi-node system to integrate into various systems [9] [10] [11].

With the introduction of 5G, the time it takes for a signal to go from a 5G station to another terminal is under 0.5ms. The main idea is to place an arrangement of edge devices in the area surrounding the base station in order to perform time sensitive processing required for IoT (figure 7).

This plan aims to take advantage of the low delay times made possible by 5G which are not possible using other methods. While edge terminals and cloud processing are both available, edge terminals are limited by their low power generation and computing power while cloud computing suffers from significant delay times and large power consumption.

In the future, MEC will hopefully be integrated into many quasi-real-time processes such as factory, traffic and power control as well as full-scale-real-time processing such as automated driving.

Figure 7: Images of MEC (Multi-Access Edge Computing) [12]

Figure 7: Images of MEC (Multi-Access Edge Computing) [12]

The main reasons to use an FPGA component is for its network transparency as well as its high portability. Regarding the network transparency, the FPGA design allows for continuous processing regardless of the edge device’s position within the range of the base station.

In addition, if the device travels outside of the base station, the processing will instantaneously switch over to the FPGA accelerator inside of the device.

In terms of its portability, the FPGA accelerator can be used as an accelerator for both the base stations and inside of edge equipment. The base stations and edge devices require different sized accelerators to function.

However, by incorporating a scalable mechanism into the accelerator, it is able to fit both devices. Figure 8 shows a system that is able to run multiple FPGA boards at once, known as a FiC Cluster, using M-KUBOS. FPGA boards that are using M-KUBOS have already been commercialized which are shown in figures 9 and 10 [13].

Figure 8: M-KUBOS/PYNQ Cluster [12]

Figure 8: M-KUBOS/PYNQ Cluster [12]

Figure 9: PALTEK M-KUBOS/PYNQ [12]

Figure 9: PALTEK M-KUBOS/PYNQ [12]

Figure 10: (Left) PALTEK’s M-KUBOS board & (Right) 19” desktop rack storage unit (panel mount and other options available) [13]

Figure 10: (Left) PALTEK’s M-KUBOS board & (Right) 19” desktop rack storage unit (panel mount and other options available) [13]

Figure 11: M-KUBOS PYNQ Cluster [11]

Figure 11: M-KUBOS PYNQ Cluster [11]

ROS2 and its performance benefits

Mr. Okawa’s research team is currently studying the use of robots which are equipped with the ROS2 (Robot Operating System 2) This software is being developed with the intention of being implemented into the medical field.

Currently, the ROS2 software contains a wide variety of libraries and tools necessary for the further development of robot technology. The software platform also has an open community that conveniently connects users and developers [14]. Also to clarify, ROS and ROS2 are actually middleware, not an operating system.

Figure 12 shows how the FPGA accelerator compares to an intel CPU as well as an ARM CPU with a built-in FPGA chip. All three devices were tested using an image processing program and their latencies were recorded.

As shown below, the FPGA accelerator outperformed both CPUs in terms of latency with a delay of only 270.7 ms while also keeping power consumption to a minimum. Comparing latencies, the FPGA accelerator performed 36.3 times faster than the intel CPU and 263.2 times faster than the ARM processor.

Figure 12: Examples of image processing performance using ROS2

Figure 12: Examples of image processing performance using ROS2

 

Once again, I would like to include my personal thoughts about this lecture. These days, it is common to try and simulate the performance of emerging software, especially in the fields of computer systems and computer architecture research.

For developers, creating hardware takes a lot of time and money. On the other hand, if your primary goal is research, creating a software model and publishing your findings takes significantly less time and is relatively easy.

However, it is difficult to determine the extent to which the model of hardware affects the performance levels of the software. These discrepancies in the research findings often lead to scrutiny from other researchers who claim that they were written purely as fluff pieces.

For this reason, the research team at the Amino Laboratory, located in Keio University, is committed towards simultaneously creating hardware to match their software.

Today, this concept may be considered rare, but it has become a common practice for MEC research where hardware is first developed using an FPGA board and then software is integrated into it. I am eagerly awaiting the results of their research and how it can be implemented into 5G systems to benefit society.

 

References

[1] 産総研コンソーシアム
https://unit.aist.go.jp/colpla/iuao2020/consortium.html

Note from the author: Please reference No. 35 from the table about the Non-Volatile Field-Programmable Gate Array (NV-FPGA) Initiative

[2] M. Yamaoka, et.al., “Advanced Research into AI Ising Computer”, Hitachi Review, Vol.65 (2016) No.6, p.156-160
https://www.hitachi.com/rev/archive/2016/r2016_06/pdf/r2016_06_110.pdf

[3] Travelling salesman problem
https://en.wikipedia.org/wiki/Travelling_salesman_problem

[4] Annealing Cloud Web “WHAT IS THE COMBINATORIAL OPTIMIZATION PROBLEM”
https://annealing-cloud.com/en/knowledge/1.html

[5] CMOSアニーリングの顧客適用に向けた量子コンピュータ技術の応用https://www.hitachihyoron.com/jp/archive/2020s/2020/03/03b09/index.html

[6] M. Yamaoka et.al., “20k-spin Ising chip for combinational optimization problem with CMOS annealing”, ISSCC 2015.

[7] Annealing Cloud Web “THE ISING MODEL AND THE ANNEALING MACHINE”
https://annealing-cloud.com/en/knowledge/2.html

[8] Annealing Cloud Web
https://annealing-cloud.com/en/index.html

[9] 日立製作所ニュースリリース 数十人,数百人規模の最適な勤務シフトを作成するソリューションを提供開始
https://www.hitachi.co.jp/New/cnews/month/2020/10/1019a.html

[10] CREST プログラムの概要
https://www.jst.go.jp/kisoken/crest/about/index.html

[11] [コンピューティング基盤]令和元年度採択課題
https://www.jst.go.jp/kisoken/crest/project/1111102/1111102_2019.html

[12] Challenge to Multi-access Edge Computing CANDAR 2020 Special Session (2020.11.25)
http://www.am.ics.keio.ac.jp/crest/wp-content/uploads/2020/12/CANDAR2020.pdf

[13] 5G時代に注目されるエッジコンピューティング
https://www.paltek.co.jp/techblog/productinfo/200602-1

[14] ROS 2 Documentation
https://index.ros.org/doc/ros2/

Subscribe to our newsletter

Click here to sign up