
Hello everyone, this is Haruyuki Tago, Edge Evangelist at HACARUS’ Tokyo R&D center.
In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry-related topics from my unique perspective.
In today’s volume, I will try my best to cover three topics. The first involves examining the transition of computational power during machine learning training. The second explores why the power consumption and CO2 emissions for modern deep learning models are so large during training. The final topic shows how sparse modeling has a much smaller carbon footprint when compared to deep learning.
Changing the Number of Operations during Model Training
To start this discussion, let’s look at Figure 1, which shows the arithmetic operations used to train machine learning models starting back in 1960 with the Perceptron, the originator of machine learning, until the AlphaGoZero in 2020 [1]. The training time is represented as the number of calculations performed during the training (learning) period of a machine learning model. For Example, the term ‘3.1×0.45+9.7E-3’ is the sum of two floating-point operations (FLOP: Floating OPeration), where one operation uses multiplication and one uses addition.
Looking at Figure 1, the horizontal axis shows the year of publication and the vertical axis represents the number of operations required for training per day which are explained below. The amount of operations has increased by a factor of 1018 (1e+4/1e-14) over the past 60 years. For the year of publication, the years before 2012 are classified as the First Era and the years following 2012 are the Modern Era.
Starting from 2012, the number of operations per day doubled in just 3.4 months (11.5 times per year). This rapid increase coincided with the use of the Convotional Neural Network (CNN) to significantly reduce the misrecognition rate of benchmark images. This marked the beginning of the deep learning boom.

Figure 1. Changes in the number of operations for machine learning models during training (1960-2020)[1]
The_number_of_GPUs * GPU_FLOP/S * Execution_time * GPU_estimated_utilization
= 2 * 1.58E+12FLOP/s * ( 5.5days * 24hours/day * 3600s/hour ) * 0.33
= 496E+15FLOP
The term 496E+15FLOPs is also transformed into a form that includes the computation rate per day.
The computation speed in seconds per day is also calculated as follows and the resulting computation rate corresponds to point A in Figure 2.
496E+15FLOP = 86400s*(496E+15FLOP / 86400s)
= 86400s * 5.741E-3Petaflop/s-days

Figure 2. Increases in the number of operations during training during the deep learning era (2012-2018)[1]
Next, let’s look at the number of hyperparameters for deep learning models. Looking at Figure 3, the number of parameters for the natural language processing (NLP) model (left) and the image recognition model (right) both seem to double approximately within 3.4 months.

Figure 3 Trends in the number of hyperparameters for the MLP model (left); Trends in the number of hyperparameters for the vision model (image recognition)
Estimated Power Consumption during Model Training for Natural Language Processing
In article [2], the power consumption of eight different machine learning models were compared for using natural language processing. The comparison looked at the total power consumption including cooling, the estimated CO2 emissions (CO2e), and cloud computer usage fees. Another reason the following experiment was conducted to estimate the power consumption during training is that these values are not typically published.
Below, I want to include information about my general setup for the experiment.
- Operating machine: A PC running three Nvidia GTX1080 TI GPUs for the ELMo model. Another PC with one Nvidia Titan X GPU was also used for the other models.
- Machine Learning Software: The software used in each publication was run with their default settings.
- Run time: Up to one day of operation
- Power Consumption Measurements: CPU power consumption was sampled many times using Intel’s Running Average Power Limit interface, and GPU power consumption was sampled many times using NVIDIA System Management Interface2 to obtain the average power.
Moving on to the results, the parameters and estimated results for each model are shown in Figure 4. The columns for Model, Hardware, and Hours are all taken from previously published data. The Power, kWH-PUE, CO2e, and Cloud Compute Cost columns were calculated using estimates.
For example, Transformerbase uses eight P100s (probably Nvidia’s Tesla P100 [5]). Here, the GPU model and number of GPUs are different from the single Nvidia Titan X used in their experiment [2]. The author of this article assumes that the measured power consumption was multiplied by some conversion factor to estimate the power value of 1415.78W. However, the paper doesn’t state the specific method.
Another thing to point out is that Tensor Processing Units (TPUs) are deep learning accelerators developed by Google.

Figure 4. Hardware configuration of various NLP deep learning models, computer power consumption, execution time, energy consumption including cooling (kWh-PUE), estimated CO2 emissions (lb), & cloud computer rates [2][4]
Another concept to look at is the Power Usage Effectiveness (PUE), which is a measure of the power usage efficiency of a data center. PUE is a combination of the computer power consumption and the ancillary equipment power consumption divided again by the computer power consumption. Most of the power consumed by the ancillary equipment is for cooling purposes. In 2018, the global average PUE value of 1.58 was used in the following equation.
By multiplying by PUE and converting the unit to kWh, we are given the total power (Pt).

Equation 1. The amount of energy (kWh) consumed by the cloud computing based on the hardware configuration and execution time [2]

Equation 2. Converting electric energy to estimated CO2 emissions [2]
Estimated Carbon Footprint during Training – A Large Issue
Starting with Figure 5, we can see a comparison for the estimated CO2e values during training of the NPL model with the CO2e values of familiar activities. For example, Transformer (big) and Data Label G emit 192 lbs of CO2e per training session. In deep learning training, the training is repeated many times with different hyperparameters to obtain better results. Figure 5 shows the results of this analysis.
As a comparison, the estimated CO2 emissions for familiar activities are shown using the data levels A through D. A is the CO2e per passenger on a round-trip flight between New York and San Francisco, which is 1984 lbs. B represents the CO2e emitted by one person (most likely the global average) over a lifetime, which is 11,023 lbs. Last, D is the CO2e emitted throughout the lifecycle of a passenger car including fuel emissions which is 126,000 lbs.
Comparing these values to H, the carbon dioxide emissions during training of a deep learning model for NLP could be equivalent to the CO2 emissions of 5 standard passenger cars throughout their lifecycle [2][3].

Figure 5. Comparison for estimated CO2e emissions from the training of NLP deep learning models and CO2e emissions from familiar activities [2] (Data label added to Table 1)
Author’s Recommendations [2]
- The author suggests directly comparing different models to perform an accurate cost-benefit analysis, which should report the sensitivity of the training times and hyperparameters.
- The author also suggests that academic researchers should have equal access to computational resources. Recently, advances in computing have driven up prices and made these tools too expensive for most people. For example, an off-the-shelf GPU server with eight Nvidia 1080 Ti GPUs and supporting hardware can be as expensive as $20,000 USD. In the author’s case, the hardware required to develop the model and run the model for 172 days used 58 GPUs and cost around $145,000 USD. This doesn’t even include the cost to power the unit, which is about half the cost of a GPU in a commercial cloud. The proposed solution is a government-funded academic computing cloud server for academics and researchers.
- The author recommends that industry and academia parties work together to promote research for developing more computationally efficient algorithms and hardware that require less energy.
Energy Consumption Comparison between Sparse Modeling & Deep Learning
Compared to conventional deep learning models, HACARUS’ AI and sparse modeling can produce highly accurate results even with small amounts of data. In some cases, training speeds for sparse modeling models are 4 to 5 times faster than deep learning[6][7]. The energy consumption during the training period is also very low, only consuming 1% of the energy required for deep learning. This allows sparse modeling to walk away with a very small carbon footprint.

Figure 6. Comparison of energy consumption and training time for sparse modeling and deep learning training [7].
After training both sparse modeling and machine learning models, the accuracy levels obtained were comparable. However, HACARUS’ sparse modeling achieved the same results with only 1% of the energy consumption. The training times were also four to five times faster, taking only 20 minutes compared to the 100 minutes for deep learning.

Figure 7. Experimental conditions and results [7]
Conclusion
- The number of operations during the training process for machine learning models had increased by a factor of 1018 in the 60 year period from 1960 to 2020. Also, the change in computing power since 2012 has doubled every 3.4 months (11.5 times every year).
- in this experiment, off-the-shelf models of popular natural language processing (NLP) deep learning models were compared for power consumption, total energy consumption (including cooling), CO2 emissions, and cloud computing costs. For example, the CO2 emissions for all Transforer Big’s training processes were found to be 626,155 pounds. This amount is equivalent to the CO2 emissions of 5 standard passenger cases over their lifetime of use.
- On the other hand, HACARUS’ AI and sparse modeling can produce highly accurate results even with small amounts of data. In some cases, the training speeds can also be up to 5 times faster than deep learning models. Energy consumption during the training period is as low as 1% of that of deep learning, resulting in a very small carbon footprint.
References
[1] AI and Compute
https://openai.com/blog/ai-and-compute/
[2] E.Strubell, et.al., “Energy and Policy Considerations for Deep Learning in NLP”
https://arxiv.org/abs/1906.02243
[3] K. Hao, “Training a single AI model can emit as much carbon as five cars in their lifetimes”, MIT Technology Review, June 6, 2019
[4] J. Dean, “The Deep Learning Revolution and Its Implications
for Computer Architecture and Chip Design”, Google Research
https://arxiv.org/ftp/arxiv/papers/1911/1911.05289.pdf
[5] NVIDIA TESLA P100 The World’s First AI Supercomputing Data Center GPU
https://www.nvidia.com/en-us/data-center/tesla-p100/
[6] HACARUS独自のAI技術、スパースモデリングとは?
https://hacarus.com/ja/sparse-modeling-benefits/
[7] LESS IS MORE スパースモデリング:スリムだがパワフルな組込み向けAI