Hello! My name is Kishimoto. I’m an engineer intern at Hacarus. I’m going to be checking out NVIDIA’s high power board computer, the **Jetson Nano**, a highly reviewed Edge AI board and recent topic in the tech world. The Jetson Nano is a high-performance edge device with 4 Core Arm CPU (Central Processing Unit) and 128 CUDA Core. Using the CUDA Core, I’m going to be trying the Lasso Regression technique.

**1. Jetson Nano’s Development Environment**

Developed by NVIDIA, Jetson Nano is equipped with 128 Maxwell CUDA Core and can perform a large amount of parallel computation.

CUDA is an NVIDIA GPU (Graphics Processing Unit) parallel computing platform that can be used easily if you have an NVIDIA GPU.

## 1.1 Development and Building Environment

Jetson Nano/Ubuntu 18.04 LTS

Click here for OS installation instructions on Jetson Nano.

To use the NVCC (NVIDIA’s CUDA Compiler), go through PATH to /usr/local/cuda-10.0/bin. Add export PATH = / usr / local / cuda-10.0 / bin to .bashrc. I compiled this code with the following command.

nvcc main.cu -O3 –generate-code arch = compute_53, code = sm_53 -lcuda -lcublas_static -lcusolver_static -lcusparse_static

**2. Implementation of ADMM **

I’ll be using *ADMM *(Alternating Direction Method of Multipliers), a *Lasso* algorithm implementation. By optimizing the L1 norm, the weight of the resulting linear regression model becomes sparse. You can view this implementation on Hacarus Github. The implementation was based on the open-source library *spm-image*, which is maintained by Hacarus.

General linear algebra calculations such as inverse matrix calculation and matrix product are already prepared in the CUDA library, which we used. We used these following functions:

- cuBLAS
- cublasSnrm2

This finds the Vector L2 Norm. - cublasSasum

This finds the sum of the vectors. - cublasSgemv

This finds the matrix product of a two-dimensional matrix and a one-dimensional vector. - cublasSgemm

This finds the matrix product of a two-dimensional matrix and a two-dimensional matrix.

- cublasSnrm2

- cuSOLVER
- cusolverDnSgetrf

Function that performs LU decomposition. - cusolverDnSgetrs

This finds the inverse of the original matrix using the LU-decomposed matrix.

- cusolverDnSgetrf

In the loop for learning part, if you look at the spm-image/admm.py you can see that it was used as a reference when implementing, but in* spm-image*, the z_k and h_k operations are separate and the memory bandwidth is inefficient. For the sake of saving memory bandwidth, I put it all together in one CUDA kernel.

The code looks like this:

```
__device__ float _soft_threshold(float x, float thresh){
if(x>thresh){
return x - thresh;
}else if(x < -thresh){ return x + thresh; }else{ return 0; } } __global__ void _uzh(float *Dw_t, float *h_k, float *z_k, float *sub_z_h_k, float threshold, int n) { int index = blockIdx.x * blockDim.x + threadIdx.x; if(index>=n) return;
float dw = Dw_t[index];
float h = h_k[index];
float z = _soft_threshold(dw + h, threshold);
z_k[index] = z;
h += dw - z;
h_k[index] = h;
sub_z_h_k[index] = z - h;
}
void _update_z_h(float *Dw_t, float *h_k, float *z_k, float *sub_z_h_k, float threshold, int n)
{
_uzh<<>>(Dw_t, h_k, z_k, sub_z_h_k, threshold, n);
}
```

**3. Running ADMM**

## 3.1 Data set used for learning

This data set derives from Boston house prices. The following data is input from 1 to 13, and a model that predicts 14 (MEDV) is trained.

- CRIM –
*per capita crime rate by town* - ZN –
*proportion of residential land zoned for lots over 25,000 sq.ft.* - INDUS –
*proportion of non-retail business acres per town.* - CHAS –
*Charles River dummy variable (1 if tract bounds river; 0 otherwise)* - NOX –
*nitric oxides concentration (parts per 10 million)* - RM –
*average number of rooms per dwelling* - AGE –
*proportion of owner-occupied units built prior to 1940* - DIS –
*weighted distances to five Boston employment centers* - RAD –
*index of accessibility to radial highways* - TAX –
*full-value property-tax rate per $ 10,000* - PTRATIO –
*pupil-teacher ratio by town* - B –
*1000 (Bk – 0.63) ^ 2 where Bk is the proportion of blacks by town* - LSTAT –
*% lower status of the population* - MEDV –
*Median value of owner-occupied homes in $ 1000’s*

**3.2 Execute the Learning Process**

The commands used to compile and run the program are:

nvcc lasso.cu -O3 –generate-code arch = compute_53, code = sm_53 -lcuda -lcublas_static -lcusolver_static -lcusparse_static

./a.out

When the data is entered into the learner and learning is executed, the learned sparse vector is displayed on the console. The execution result is the image below. From the displayed vector, you can see that the learned vector is sparse.

## 3.3 Comparing the Execution Time

When operating with my laptop (which has a CPU of i7 6500U), it took 87ms. When operating with the Jetson Nano, execution time was only took 16ms. As you can tell, execution is quite slow. It seems that the size of the data used for learning is small and therefore, the strengths of the GPU cannot be utilized. The next section will deal with even larger data.

**4. Implementing Fused Lasso**

By setting a penalty that uses variation between adjacent variables, *Fused Lasso* can suppress changes between these variables. We implemented the operation related to D as a structure and passed to* ADMM* so that it can be easily expanded from* Generalized Lasso* to *Fused Lasso* and *Trend Filtering*.

There are three operations related to D in the implementation of spm-image: D.dot (w_k), DTdot (D), and inv_matrix.dot (rho * DT). D, a bidiagonal matrix, is a sparse vector, so using a normal matrix product is inefficient and I implemented its operations myself.

There is a presentation available here (in Japanese), by Mr. Masui, an evangelist of Hacarus sparse modeling on *Generalized Lasso*‘s extension method.

## 4.1 Data made to learn

We put noise on the rectangular wave (pulse wave) and from there, prepared a task to remove its noise.

**4.2 Executing Machine Learning**

The commands used to compile and run the program are:

nvcc fused_lasso.cu -O3 –generate-code arch = compute_53, code = sm_53 -lcuda -lcublas_static -lcusolver_static -lcusparse_static

./a.out

The learning result displayed in *matplotlib* is shown below. A square wave vector (blue) with 1080-dimensional noise was removed with *spm-image* (orange) and this implementation (green) using Fused Lasso. The left side is slightly out of sync, but I think this is because* spm-image* is calculated by *double* and CUDA by *float*.

## 4.3 Comparing Execution Time

It took 12 seconds to move *spm-image* on a laptop and 185 seconds to move it on Jetson Nano. In this implementation, Jetson Nano was able to calculate using CUDA in eight seconds, four seconds less than a laptop.

**5. Conclusion**

Implementing the sparse modeling algorithm onto Jetson Nano gave me the following conclusions:

- Jetson Nano’s computer power is exceedingly high.
- Applying an algorithm was also easily done and surpasses the performance of a laptop.

By using CUDA (GPU profiler), I was easily able to write a program using GPU. Although I was not able to use nvprof (a profiler where you can collect and view profiling data from the command-line), it seems that performance can be further accelerated.