Hello everyone, this is Haruyuki Tago, Edge Evangelist at HACARUS’ Tokyo R&D center.
In this series of articles, I will share some insights from my decades of experience in the semiconductor industry and I will comment on various AI industry-related topics from my unique perspective.
In today’s volume, I will try my best to cover three topics. The first involves examining the transition of computational power during machine learning training. The second explores why the power consumption and CO2 emissions for modern deep learning models are so large during training. The final topic shows how sparse modeling has a much smaller carbon footprint when compared to deep learning.
Changing the Number of Operations during Model Training
To start this discussion, let’s look at Figure 1, which shows the arithmetic operations used to train machine learning models starting back in 1960 with the Perceptron, the originator of machine learning, until the AlphaGoZero in 2020 . The training time is represented as the number of calculations performed during the training (learning) period of a machine learning model. For Example, the term ‘3.1×0.45+9.7E-3’ is the sum of two floating-point operations (FLOP: Floating OPeration), where one operation uses multiplication and one uses addition.
Looking at Figure 1, the horizontal axis shows the year of publication and the vertical axis represents the number of operations required for training per day which are explained below. The amount of operations has increased by a factor of 1018 (1e+4/1e-14) over the past 60 years. For the year of publication, the years before 2012 are classified as the First Era and the years following 2012 are the Modern Era.
Starting from 2012, the number of operations per day doubled in just 3.4 months (11.5 times per year). This rapid increase coincided with the use of the Convotional Neural Network (CNN) to significantly reduce the misrecognition rate of benchmark images. This marked the beginning of the deep learning boom.Below, Figure 2 shows the arithmetic volume trends for the Modern Era from 2012 to 2018. As an example, let’s find the computational volume of AlexNet (noted as A in figure 2), which was published in 2012. Citing the literature, AlexNet’s network takes 5-6 days to train using two GTX 580 3GB GPUs . Estimating the rated performance of Nvidia’s GTX 580 to be 1.48E+12FLOP/s and its average GPU utilization at 0.33 (33%), the number of operations (calculations) required for training AlexNet is estimated as shown below where the unit of measurement is FLOPs (FLoating OPeration) .
The term 496E+15FLOPs is also transformed into a form that includes the computation rate per day.
The computation speed in seconds per day is also calculated as follows and the resulting computation rate corresponds to point A in Figure 2.Figure 2 also shows different machine learning models for image recognition such as Alexnet, VGG, Xception, and more. It also includes models for natural language processing such as DeepSpeech2 and Neural Machine Translating. Finally, it shows models for the game, Go, including AlphaZero and AlphaGoZero. It is interesting to note that regardless of the field, each model almost doubled in pace within 3.4 months.
Next, let’s look at the number of hyperparameters for deep learning models. Looking at Figure 3, the number of parameters for the natural language processing (NLP) model (left) and the image recognition model (right) both seem to double approximately within 3.4 months.
Estimated Power Consumption during Model Training for Natural Language Processing
In article , the power consumption of eight different machine learning models were compared for using natural language processing. The comparison looked at the total power consumption including cooling, the estimated CO2 emissions (CO2e), and cloud computer usage fees. Another reason the following experiment was conducted to estimate the power consumption during training is that these values are not typically published.
Below, I want to include information about my general setup for the experiment.
- Operating machine: A PC running three Nvidia GTX1080 TI GPUs for the ELMo model. Another PC with one Nvidia Titan X GPU was also used for the other models.
- Machine Learning Software: The software used in each publication was run with their default settings.
- Run time: Up to one day of operation
- Power Consumption Measurements: CPU power consumption was sampled many times using Intel’s Running Average Power Limit interface, and GPU power consumption was sampled many times using NVIDIA System Management Interface2 to obtain the average power.
Moving on to the results, the parameters and estimated results for each model are shown in Figure 4. The columns for Model, Hardware, and Hours are all taken from previously published data. The Power, kWH-PUE, CO2e, and Cloud Compute Cost columns were calculated using estimates.
For example, Transformerbase uses eight P100s (probably Nvidia’s Tesla P100 ). Here, the GPU model and number of GPUs are different from the single Nvidia Titan X used in their experiment . The author of this article assumes that the measured power consumption was multiplied by some conversion factor to estimate the power value of 1415.78W. However, the paper doesn’t state the specific method.
Another thing to point out is that Tensor Processing Units (TPUs) are deep learning accelerators developed by Google.Digging a bit deeper, Power (W) is an estimate of the new over consumption of cloud computing, which corresponds to the value in parentheses in the numerator of equation 1.
Another concept to look at is the Power Usage Effectiveness (PUE), which is a measure of the power usage efficiency of a data center. PUE is a combination of the computer power consumption and the ancillary equipment power consumption divided again by the computer power consumption. Most of the power consumed by the ancillary equipment is for cooling purposes. In 2018, the global average PUE value of 1.58 was used in the following equation.
By multiplying by PUE and converting the unit to kWh, we are given the total power (Pt).Next, I then converted the total power (Pt) into the estimated CO2 emissions (CO2e) values where CO2e is measured in pounds (1lb=0.454 kg). In the United States, the Environmental Protection Agency (EPA) publishes the average CO2 generated for electricity consumed and converts the total electricity to estimated CO2 emissions using equation 2. For example, 1kWh of electricity translates into an estimated CO2 value of 0.954 lbs. Looking again at article , in a reference to the trend of paying exorbitant cloud computing fees and emitting large amounts of carbon dioxide for insignificant performance gains, the NAS model (So et al, 2019) reported achieving a new state-of-the-art BLEU score of 29.7 in machine translation from English to German. In this case, they paid at least $150,000 USD in computing and carbon emissions costs for an increase of only 0.1 BLEUs. The author also notes that the numbers used here are unknown due to the use of TPUs.
Estimated Carbon Footprint during Training – A Large Issue
Starting with Figure 5, we can see a comparison for the estimated CO2e values during training of the NPL model with the CO2e values of familiar activities. For example, Transformer (big) and Data Label G emit 192 lbs of CO2e per training session. In deep learning training, the training is repeated many times with different hyperparameters to obtain better results. Figure 5 shows the results of this analysis.
As a comparison, the estimated CO2 emissions for familiar activities are shown using the data levels A through D. A is the CO2e per passenger on a round-trip flight between New York and San Francisco, which is 1984 lbs. B represents the CO2e emitted by one person (most likely the global average) over a lifetime, which is 11,023 lbs. Last, D is the CO2e emitted throughout the lifecycle of a passenger car including fuel emissions which is 126,000 lbs.
Comparing these values to H, the carbon dioxide emissions during training of a deep learning model for NLP could be equivalent to the CO2 emissions of 5 standard passenger cars throughout their lifecycle .
Author’s Recommendations 
- The author suggests directly comparing different models to perform an accurate cost-benefit analysis, which should report the sensitivity of the training times and hyperparameters.
- The author also suggests that academic researchers should have equal access to computational resources. Recently, advances in computing have driven up prices and made these tools too expensive for most people. For example, an off-the-shelf GPU server with eight Nvidia 1080 Ti GPUs and supporting hardware can be as expensive as $20,000 USD. In the author’s case, the hardware required to develop the model and run the model for 172 days used 58 GPUs and cost around $145,000 USD. This doesn’t even include the cost to power the unit, which is about half the cost of a GPU in a commercial cloud. The proposed solution is a government-funded academic computing cloud server for academics and researchers.
- The author recommends that industry and academia parties work together to promote research for developing more computationally efficient algorithms and hardware that require less energy.
Energy Consumption Comparison between Sparse Modeling & Deep Learning
Compared to conventional deep learning models, HACARUS’ AI and sparse modeling can produce highly accurate results even with small amounts of data. In some cases, training speeds for sparse modeling models are 4 to 5 times faster than deep learning. The energy consumption during the training period is also very low, only consuming 1% of the energy required for deep learning. This allows sparse modeling to walk away with a very small carbon footprint.Energy consumption and training time comparison between sparse modeling and deep learning 
After training both sparse modeling and machine learning models, the accuracy levels obtained were comparable. However, HACARUS’ sparse modeling achieved the same results with only 1% of the energy consumption. The training times were also four to five times faster, taking only 20 minutes compared to the 100 minutes for deep learning.
- The number of operations during the training process for machine learning models had increased by a factor of 1018 in the 60 year period from 1960 to 2020. Also, the change in computing power since 2012 has doubled every 3.4 months (11.5 times every year).
- in this experiment, off-the-shelf models of popular natural language processing (NLP) deep learning models were compared for power consumption, total energy consumption (including cooling), CO2 emissions, and cloud computing costs. For example, the CO2 emissions for all Transforer Big’s training processes were found to be 626,155 pounds. This amount is equivalent to the CO2 emissions of 5 standard passenger cases over their lifetime of use.
- On the other hand, HACARUS’ AI and sparse modeling can produce highly accurate results even with small amounts of data. In some cases, the training speeds can also be up to 5 times faster than deep learning models. Energy consumption during the training period is as low as 1% of that of deep learning, resulting in a very small carbon footprint.
 AI and Compute
 E.Strubell, et.al., “Energy and Policy Considerations for Deep Learning in NLP”
 K. Hao, “Training a single AI model can emit as much carbon as five cars in their lifetimes”, MIT Technology Review, June 6, 2019
 J. Dean, “The Deep Learning Revolution and Its Implications
for Computer Architecture and Chip Design”, Google Research
 NVIDIA TESLA P100 The World’s First AI Supercomputing Data Center GPU
 LESS IS MORE スパースモデリング：スリムだがパワフルな組込み向けAI