This post introduces a technique called “Sparse Modeling” that can produce good analysis results, even if the amount of data is small. The article was written for engineers who want to start on machine learning and for those who have already experience with deep learning.

**Note:** The original version of this article was published in Japanese on Codezine.jp: 機械学習プロジェクトにおける課題と、スパースモデリングに期待が高まる背景

**Introduction**

The field of machine learning and particularly deep learning, based on data acquisition and collection of information through the cloud and IoT, has gained popularity. At the same time, consumers, businesses, and whole economies put high expectation towards machine learning and AI technology. Because of these developments, the profession of Data Scientists is getting very popular too, it is said to be the “sexiest occupation of the 21st century”.

Just a while ago, machine learning could not be used without deep mathematical understanding and statistical knowledge, but recently new open source frameworks and libraries make various algorithms available that can easily be applied with a few lines of code.

As more software engineers wish to learn about machine learning and basic statistical analysis, to extend their careers as Data Scientists and machine learning engineers, we have created this series of articles for readers who

- Want to start learning about machine learning or
- Have experience with deep learning and machine learning

The technique introduced in this article is called “Sparse Modeling”, which most machine learning beginners are not familiar with. We have tried to make this series fun – instead of going through all the details of the theory, we demonstrate it by using example code. For the understanding of this series of articles, it is beneficial, if you have some development experience with a programming language such as Ruby or Python.

In this part we look at the potential of sparse modeling.

**What is Machine Learning?**

As the name suggests, we talk about machine learning when a computer learns through input data and creates a model, that can be used to execute tasks.

For example, when you want to predict a sales forecast for a retail chain, based on the number of visitors, weather, and location. These information and previous sales figures are considered learning data while predicting future sales is the task.

As another example, imagine you want to classify the products coming out of a factory production into good items versus defective items, based on pictures taken of the production results. The task of identifying good or defective items is the task, the learning data would be images of previously produced products.

The type of data, such as the number of visitors, weather, image or voice data, that is used for learning can be very different and highly depends on the task the model is about to solve. However, the tasks to be executed are likely to be either prediction tasks (e.g. the sales forecast for the next day) or classification tasks (e.g. the product on this picture is defective).

Other potential tasks could be topic model, recommendation, clustering, etc., but in a business context, these are less likely.

If you want to create a rule-based system that can predict sales based on people’s intuition and experience, it will be impossible – because it is impossible to articulate clearly defined rules that represent the exact human evaluation process.

Using machine learning for this task enables you to create a system that can perform predictions and classifications based on relationships that are hidden within the input data and results. In other words, instead of creating a system by explicitly describing the exact rules, you let the machine learn and understand the patterns based on the data itself. This is what is called Artificial Intelligence.

When we speak about learning of a system, there are actually various ways to “learn how to learn”, such as simple linear models, that calculate a result based on a combination of input values, or decision trees, that express decisions through branches in a tree structure, according to the input information. There are other, countless modeling methods with new ones being constantly developed.

It is the work of data scientists to try out different methods to see, which method performs best at a given task. This is generally done through trial-and-error.

CRISP-DM, a process model established in the context of data mining, can be used to explain this flow in an easy to understand way.

Recently, automated services like DataRobot have stepped into the area of hypothesis verification and model selection. But in general, establishing and improving the entire decision process is the domain expertise of Data Scientists.

There is an important aspect to machine learning, in the fact that results from a machine learning system will never be 100% accurate. In the example of forecasting sales, usually, the actual sales figures will be different to the machine prediction. With visual inspection of images, it is possible that good products will be classified as defective and vice versa.

Since in machine learning, the learning is based on limited data, the resulting models are based on assumptions and simplifications. Eventually, it is necessary to verify how precise the results are, and it is almost impossible to reach 100% accuracy. This “mistakable” aspect is an important point to consider when using machine learning in a business context.

**Explainability required for business**

Probably most people reading this article have heard about deep learning before. As Artificial Intelligence (AI) is currently booming, the majority of new methods that are currently being actively researched and developed are based on deep learning.

Many websites and books cover the topics around deep learning, but one point that is not mentioned very often is, that it almost eliminates the hypothesis validation process mentioned above. A number of frameworks such as TensorFlow, Caffe, Chainer etc. became available and enable developers to easily automate the learning and execution process.

If you are interested to become a Data Scientist, should you look only at deep learning? Which problems can be solved through it? At least for now, deep learning is not a silver bullet and there are many scenarios that are not suitable to be addressed through deep learning.

One limiting factor is that there are cost and time involved in collecting sufficient data for learning. Deep learning requires a large amount of data to achieve a good level of accuracy. There are several methods (or rather tricks) to training a deep learning model with a small amount of, like data augmentation or transfer learning. But generally, these methods don’t lead to high-quality results.

Cost can be addressed by paying money, but with regards to the time required for collecting big data, it can be a problem. For example, if it takes a whole year to collect images of 1,000 images of defective parts from a production line – if within this year, there are changes to the production process and the nature of possible defects changes, the data gathered before these changes become practically useless.

Another problem is the explainability of results. In deep learning, input data is converted through several layers of the model before it returns a result. Usually, it is impossible to explain, why the model returned a certain result, even for the Data Scientist who trained the model. And, as mentioned before, any machine learning system is likely to generate false results, no matter how much the precision goes up.

Let’s say, for example, you are in charge of a quality control system. Would you want to introduce a system, that cannot explain the reason behind a wrongly identified good product, if a defective product could cause a significant damage?

As another example, let’s say you are responsible for a retail store. We assume that the sales forecasting system, predicting a sales decrease of 10%, works with high precision. You think about ways to prevent this loss of sales from happening, such as additional advertising to increase the number of visits. But, since the system does not tell you why it expects a 10% drop, your planned advertising might be just the wrong way to address the problem.

#### Black box problem

This problem is also referred to as “black box problem”, and it’s a concern within academia and businesses working on machine learning solutions. Especially the trade-offs between interpretability and performance of machine learning models is an often-discussed topic.

In general, simple linear models, logistic regression or decision trees etc. are easy to understand models, but the performance of prediction and classification for complex tasks is not good. On the other hand, ensemble learning approaches (using multiple algorithms in combination) such as random forest or deep learning do perform well, but the whole process that leads to the results is a black box.

In an attempt to find a degree of explainability for complex models, new approaches like LIME or the more comprehensive SHAP are recently discussed in the academic field. In addition, other methods such as Grad-CAM are proposed as a way to interpret deep learning CNN (convolutional neural networks) judgment in the context of image processing.

There are currently only a few companies in the business world, like NEC, that prioritize the ability to explain the rationale leading to a conclusion as a differentiation factor – the so-called White Box AI.

In his book “The Strong Data Analysis Organization” Professor Kaoru Kawamoto, the first recipient of the “Data Scientist of the Year” award by “Nikkei Information Strategy” and also an advisor to Hacarus, writes that no matter how accurate analysis happens in an organization, it will only be used if it can convince.

Hence the explainability of any machine learning system becomes an important factor to convince people in charge, who are usually used to manage the business based on their intuition and experience.

To create machine learning systems that can be used for business purposes, it is important to look beyond the currently very popular deep learning approach and to find a method for interpreting and evaluating the results of various machine learning methods.

**What is Sparse ****Modeling****?**

In this series, we introduce “Sparse Modeling” as a machine learning approach that can be applied

- with small amounts of data and
- when explainability is required.

Sparse modeling is considered rather as one approach in data analysis than an algorithm, such as deep learning or Random Forest. If this explanation sounds confusing, we will look at concrete examples in future articles of this series.

In this first article, we explain why sparse modeling works, taking an example from mathematics in junior high school.

Previously we have said “small amount of data”, so let’s look at a more concrete example and assume, that the number of data samples used for learning is smaller than the number of items of attributes (like temperature, number of visitors, etc.) of the input data.

Looking at the earlier example of a sales forecast, we want to predict sales from the number of visitors and the temperature.

For this example, we will use a linear model to express the sales numbers:

`Sales = number of visitors x (visitor weight factor) + air temperature X (temperature weight factor).`

With this equation, if we know the weight factors for visitors and temperature, we can predict sales from any combination of visitor numbers and temperature. So, the first step in machine learning is to learn these two factors from the data.

The learning data we have available to use in the equation are the number of visitors, temperature, and sales. Let’s assume the number of visitors is 10, the temperature is 20 degrees and the sales at that time are 100$.

`100 = 10 x1 + 20 x2`

(x1: visitor weight factor, x2: temperature weight factor)

But how do we solve an equation with 2 unknowns? Haven’t we learned in middle school that this is not possible?

This is how sparse modeling looks at the equation:

- Either temperature or number of visitors is irrelevant to sales
- The smaller the weight factor, the better

Why would these conditions make sense? Have a look at the graph below.

The general assumption in sparse modeling is, that much of the data will be zero. In other words, the amount of relevant data is considered to be small or “sparse”.

Taking this sparse assumption into account, you can achieve results even in the case of a small amount of data. With regards to the previous example, it was found that only the temperature affected the results, which made the result easy to predict and understand.

We will talk more about the sparsity assumptions that are made in sparse modeling in a later article, so stay tuned.

Although this is just a simple example, it illustrates how sparse modeling can derive results that are highly explainable, even in situations when only a small amount of data is available – by making use of the “sparseness” in the data and focusing on the relevant parts. Since this method has been proven to create good results, aspiring data scientists should add these methods to their portfolio and consider using it in future applications of AI.

In the next article, we will look at the history of sparse modeling and introduce LASSO, a representative algorithm.