**What Challenges Does Sparse Modeling Solve? **

*The Chronicles of the “LASSO” Representative Algorithm Development & Usage*

*The Chronicles of the “LASSO” Representative Algorithm Development & Usage*

This series introduces a technique called “Sparse Modeling” which can produce good analysis results even if the data set is “small”. The article was written for engineers who want to get started working on machine learning projects and for those who already have experience with deep learning (DL).

In this section, we will explain **LASSO**, a regression analysis algorithm that uses variable selection for prediction accuracy and minimizes the sum of errors. Along with this, we will be looking at the history of Sparse Modeling.

*Note:* *The original version of this article was published in Japanese on Codezine.jp* (by Naoki Kitora, CDO at Hacarus): スパースモデリングはなぜ生まれたか？ 代表的なアルゴリズム「LASSO」の登場

**The Basic Concept of Sparse Modeling**

### Occam’s Razor

In the early 2010s, Big Data was a huge buzzword. Through the enhancement of hardware and available cloud services, by 2019, it has become commonplace to use applications and services built on Big Data. Although there is no clear definition of Big Data, it is typically characterized by a quantitative aspect – generally describing its sheer size – massive amounts of data that are difficult to handle with common databases/RDBMS (Relational Database Management System) – and a qualitative aspect – meaning that the data types are usually diverse.

However, making sense of Big Data is a labor-intensive task. Creation of a single AI model based on Big Data is time-consuming, especially when unsupervised, and massive amounts of energy is needed to support the multiple computers needed for model creation.

Further, in many cases, the available data is not of good enough quality or is difficult and to collect.

The alternative to big data is small data, where we take samples from a larger population. Small data sets are often thought to misrepresent the total data by design.

It’s a bit of an extreme example, but to illustrate, below, we will compare a case where big data is available with a case where we take several smaller samples. When sufficient data is available, we are able to draw one straight line to represent the dense data, as shown in the figure below:

On the other hand, when the data set is small, there are many possibilities for functions that approximate the linear distribution. In an exceptional case, as shown in the following figure where all points are passed through, one could even define a very complex function with a small data set. Defining such a complicated function is obviously a misrepresentation of the actual data set. Clearly, the straight line derived from big data is closer to the truth.

Why does this happen? Is it always difficult to derive the correct answer if there is not enough data?

In the real world, various factors influence the results.

Let’s say you drop a ball and then measure the time it takes for it to reach the ground. Factors that have a big impact on its speed are things like gravity and distance to the ground. If it is a light ball, it might even be the weight and diameter of the ball. But there are probably more factors at play. The air resistance that the ball receives may change due to temperature or humidity changes or may be influenced by wind. If you measure very strictly, it may even be affected by the change in the position of a very heavy object nearby. Sometimes there are also factors that can affect the measurement system such as magnetic energy.

In this way, the data varies depending on many factors, some with bigger and others with a smaller impact. If the available data set is small, these variations will make it difficult to draw accurate conclusions as to the chances that factors of lesser importance become over-represented.

This is where Occam’s razor comes in handy – originally presented by 14th-century theologian Occam. He said that we** “ should not assume more factors than necessary to explain something”**. Following this principle, we should eliminate factors that cause variation (In the preceding example factors such as wind influence caused by things passing by) and create a model with only important factors.

Sparse Modeling is built with the same fundamental principle. Using sparse modeling we can make sense of small data sets by focusing on the key features.

### The History of Sparse Modeling

In practice, there are several approaches to Sparse Modeling, for data science its gained popularity as a result of **LASSO**, advocated by Robert Tibshirani (Stanford University), originally presented in 1996 (*More details about LASSO later*).

Another approach, **Compressed Sensing **(*also known as Sparse Sensing*) is used in medical fields and in astronomy. The methodology revolutionized measurement technologies such as microscopes and telescopes, and was proposed by David L. Donoho (Stanford University) in 2006. In short, Compression Sensing makes it possible to obtain highly accurate measurement results far faster than techniques that preceded it.

On the other hand, the **Compressed Encoding** method represents the original data through a base vector called a dictionary and a sparse vector. This method has been advocated since the late 2000s in various application fields, such as removing noise from images or super-resolution, a way to increase the resolution of images.

Although widely researched in the academic world, there are still few cases of Sparse Modeling applied to solve business problems – Hacarus is pioneering its commercial use.

Compared with Deep Learning-based approaches Sparse Modeling offers a series of benefits. Deep Learning requires massive amounts of data to create accurate predictions, the bigger the data set the better. However,this is both time and energy consuming. Recent studies have indicated that a single AI model based on Deep Learning technology pollutes the environment as much as 5 cars for their entire lifecycle.

Furthermore, Deep Learning also lacks an understanding of what data it actually handles. This opaque way of operating is also known as the “black box” where an AI’s decision-making is not accompanied with reasoning, therefore, lacking transparency and inciting bias from collected algorithms – in short, DL can only produce results, not the method that was used to derive the conclusion, this means it’s impossible for humans to understand the choices the machine took. With Sparse Modeling the Features that lead to the conclusion are clearly understood by the human user – this is called explainable White box ai.

**Linear Regression and Sparse Modeling**

From here, we will introduce LASSO (**least absolute shrinkage and selection operator**), a type of linear regression that uses shrinkage and variable selection method, which is the method of Sparse Modeling used at Hacarus, along with concrete examples including the philosophy behind it. The sample code is available via GitHub.

- hacarus/codezine-sparse-modeling（GitHub）

### Linear Regression Models and Optimization Problems

In the case of “two unknowns, but one equation”, which normally can not be solved, we earlier explained how Sparse Modeling offers a way to find a solution – by assuming that involved variables in an equation are sparse even if the equation may be unknown. In other words, Sparse Modeling can be used even when the number of explanatory variables is fewer than the number of data, assuming that many of the explanatory variables are sparse.

For practical use, it is more likely that sparse modeling is used to select this feature value.

As a concrete example, consider the sales data of a store (sales.csv). This data contains the following five explanatory variables and sales data.

- Shop size
- Distance from station
- Population around
- Number of competitors
- Number of store items

Normally, we would look at basic statistical data and correlation coefficients and select variables according to the analysis needs. Instead, this time we will make a multiple regression analysis.

When you execute this you get the following results:

If the sales is y, the shop size is x_{1}, the distance from the station x_{2}, the population around x_{3}, the number of competing stores x_{4}, and the number of items for sale x_{5}, the regression formula is as follows:

Since this is a value obtained from the data used for learning, it tends to show a slightly higher numerical value, but the coefficient of determination (R square) is 0.95, which is quite high.

However, if we assume the solution is sparse, the fact that all the variables are used deviates from the concept of Occam’s razor. Also, in this example, it may be difficult to realize because explanation variables are few, but as the number of explanatory variables to be input increases, it is hard to think that all variables contribute more than a certain amount to find the objective variable. Depending on the precision you are looking for, in most cases, it is enough for practical use with a simple model.

### L_{0} Norm Optimization

By using Sparse Modeling you can select explanatory variables and calculate their coefficients at the same time. There are several methods for Sparse Modeling – first, we introduce L_{0} optimization, which is known to lead to a true solution, if the solution is sufficiently sparse.

Let y be the vector of sales, Φ the matrix of the data which is the explanatory variable, and x the vector of the coefficient to be determined. Then we have to solve the following equation:

L_{0} optimization means to minimize the L0 norm^{ [1]} of x. The L_{0} norm of vector x is the number of nonzero elements of x. In other words, the goal of L_{0} optimization is to find as few nonzero elements as possible of x. The most straightforward way to solve this problem is the brute force method described below. Here, let x be the most sparse solution and L_{0} norm at that time be k.

- If y=0 end with x*=0
- Find out if y = Φx has a solution for all x where and if it has a solution it ends as a solution
- Return to 3. as

Although the algorithm itself is very simple, this is a combinatorial optimization. As the number of dimensions of the vector x increases, the number of possible combination explodes, which increases the possibility that the calculation cannot be completed within a realistic time.

### L_{1} Norm Optimization and LASSO

The combination explosion occurs in L_{0} optimization because we use the L_{0} norm. The L_{0} norm can be approximated by the L_{1} norm defined by the following equation:

** **

As can be seen from the formula, the L_{1} norm is the sum of the absolute values of the elements of *x*.

By approximation, the L_{0} norm minimization problem is a problem where you find the point of the L_{1} norm minimum among the points on the hyperplane* y = Φx* in n-dimensional space. When x is two-dimensional, it becomes as follows when it represents it in the figure.

When connecting the points where the L_{1} norm is constant, the vertex becomes the rhombus above the coordinate axis. The point at which the straight line first collides with the straight line after increasing the diamond gradually from the center point where the area becomes zero is the optimal solution. In most cases, because it hits the vertex of the rhombus on the coordinate axis, one of the coordinates becomes zero and a sparse solution is obtained.

Up to this point, we have explained on the assumption that noise is not mixed in the data. However, in the real world, it is normal that noise is mixed in the data. Using a least-squares method to compute approximate curves is a well-known method. However, in data containing noise, over-learning may occur as explained at the beginning (although I will omit detailed explanation). We can use the L_{1} norm as a regularization term to prevent over-learning. The problem is an optimization problem that minimizes the value of the following expression.

This optimization problem is called LASSO. LASSO is an algorithm that triggered widespread use of sparse modeling. Implementation of LASSO is in *scikit learn*, so we will try linear regression analysis using LASSO for data subjected to multiple regression analysis.

The execution result is as follows:

It is clear that the importance of the fifth variable becomes zero, which indicates selecting variables is one of the features of sparse modeling. The coefficient of determination is also almost the same as in the case of multiple regression using all variables.

### Solving L_{0} Norm Optimization Using the Greedy Algorithm – Orthogonal Matching Pursuit

We explained that using the brute force algorithm in the L_{0} optimization problem caused a combined explosion and would not be practical for use. The algorithm known as the *greedy algorithm* is effective in such cases. The greedy algorithm divides the problem into several subproblems and selects the best solution among the subproblems.

Orthogonal Matching Pursuit (OMP) is an iterative greedy algorithm used to solve L_{0} optimization. The OMP algorithm finds index set S that points to an element of a nonzero vector x. This index set of nonzero coefficients is called support. Suppose that this index set, support, is an empty set. We add a new base to the support set one-by-one so as to minimize the residual (also known as residual reduction) when x is approximated by the linear combination of basis Φ, and are included in support. It stops the residual when x is approximated with only the basis that becomes the threshold value or less. The greedy algorithm conclusively selects bases that contribute to the reduction of residuals. Optimality of the solution is not guaranteed, but in many cases, an excellent approximation is obtained. The following code that applies OMP to sales data is shown below.

When implemented, the result is the following output:

In Orthogonal Matching Pursuit, the number of features whose coefficients are non-zero is not automatically calculated but passed as constructor arguments like n_nonzero_coefs = 4. If not specified, 10% of the number of features is used as a default value. It is better to adjust while looking at the coefficient of determination.

^{[1]} *The L _{0} norm does not satisfy homogeneity mathematically, so in a strict sense, it is not a norm. However, according to the convention, here it is referred to as a norm.*

**Development of LASSO**

If there are some relationships such as similarity among multiple explanatory variables included in the data to be handled, imposing constraints that take these relationships into consideration makes it possible to obtain better results. In research on sparse modeling so far, we devised the regularization term of LASSO, expressing restrictions on the data and creating various evolutions. Here, let us briefly introduce a part of it.

### Time Series and Sequence Data are Applicable to Fusing LASSO

In order to analyze data, such as time series data and some adjacent explanatory variables which have the same degree of contribution to the objective variable, using a method called Fused LASSO is effective. By applying Fused LASSO, it is possible to identify explanatory variables with the same contribution rate to the target variable.

Fused LASSO has no implementation in the scikit learn mentioned earlier, but there is implementation in spm-image, which can be found in the open source sparse modeling library written by contributors. If you are interested, please see this sample.

### Category Data Applicable to Group LASSO

It is common to analyze categorical variables as dummy variables in data analysis. For example, when dealing with gender, you may come up with the following data.

ID | Male | Female |
---|---|---|

0001 | 1 | 0 |

0002 | 0 | 1 |

If sparse modeling is applied to a regression model including dummy variables, one of the coefficients of the dummy variable may become zero. In the above example, one of the variables of male or female may be chosen. This is problematic. Since the variable is gender, it is necessary to treat each of the dummy variables of male and female as a one gender group and then estimate whether these coefficients are zero or non-zero collectively. In such a case, you can apply sparse modeling by using group LASSO (Group LASSO).

*Each method will not be explained in detail in this article, but if you are interested please see the following papers*:

*Sparsity and smoothness via the fused lasso*by Robert Tibshirani and Michael Saunders*Model selection and estimation in regression with grouped variables*by Ming Yuan

*To sum up this article*, we’ve explained what Sparse Modeling is, its benefits as opposed to deep learning, and LASSO- the algorithm that popularized sparse modeling. Sparse modeling is a technique that Hacarus pioneers, where its usage is scarce in the business industry. We’ve given some example demonstrations through graph illustration to explain the difference in data sets between having sufficient data points (big data) and a few data points (small data). Deep learning consumes a large amount of time and energy, and overall, is a laborious task. Deep learning is incapable of understanding data and therefore, its workings are considered to be a “black box”. Without understanding how these AI based choices are handled, its data is “unexplainable” and lacks transparency.

Additionally, we explained linear regression, and more specifically LASSO- how it allows sparse solutions as well as how sparse modeling helps find solutions to equations. We give a case in the instant where sparse modeling is used to select feature values. Sparse modeling (an explainable, “white box” approach) uses LASSO, which is an algorithm that allows minimizing errors with variable selection for accuracy. Therefore, this allows small data sets, which originally have variations that lead to inaccuracies, to gather important factors and eliminate the unimportant factors.

We explain several methods for sparse modeling, one of them, L_{0} optimization which can lead to a true solution granted the solution is sufficiently sparse. Second, we explain the L_{1} norm which can approximate the L_{0} norm. The L_{1} norm can be used to prevent overlearning. The Orthogonal Matching Pursuit, which is a greedy algorithm used to solve L_{0} optimization, finds the most optimal solution to a group of subproblems.

Lastly, we explained two types of LASSO methods called Fused LASSO and Group Lasso. Fused Lasso allowed for data to be analyzed in time series and to identify explanatory features with the same contribution rate and the target variable. Group LASSO allows us to incorporate additional information with dummy variables and select groups.

Covering the topics of sparse modeling, norm optimization, and LASSO can give insight to engineers or tech enthusiasts interested in sparse modeling and its future. With Hacarus in the lead, we believe sparse modeling can pave the way for innovation.