Nice to meet you!

I’m Ippei. From March, I became a Data Scientist under training of Hacarus’s collaborative Data Scientist training program with Datamix. I currently study daily, and as a part of their program, I recently attended their Machine Learning/Data Learning Spring Camp at Osaka University’s Center for Mathematical Modeling and Data Science.

Let me tell you a little bit about what I learned below:

## About the Machine Learning/Data Learning Spring Camp

Machine Learning/Data Scientist Spring Camp (2019), a spring camp event organized and held at the Osaka University Center for Mathematical Modeling and Data Science. This spring camp was not only aimed towards researchers but also students and young adults who deal with data analysis.

The event goes on for two days. The first day consists of an “Introduction to Gaussian Process and Machine Learning” class. The second day consists of learning about “What is the current Bayesian statistics” and “Mathematics of Machine Learning: First Hand”, lectures that were held at the same time. I only participated on the second day, “Mathematics of Machine Learning”.

As far as I know, both days were great successes, especially the first day where the lecture of “Introduction to Gaussian Process and Machine Learning”, was given. So much that a book with the same name was published 4 days before the lecture.

There was a lot of mathematical debate on the Slack channel that was solely set up for questions asked on that day. It was a difficult task for the lecturers to pursue discussions in real-time.

On the second day, “Mathematics of Machine Learning: First Hand”, it was a lecture packed with all kinds of information from simple linear regression into classification, information criterion, decision tree, SVM, etc. all in one day.

Although there are explanations in mathematical formulas, there was also time for R implementation, and the lecture focused on using machine learning models while involving mathematical formulas.

Both lectures are extensively packed with 6 hours of information. Both lectures are very high speed and high level, so you’re gonna have to understand everything on the spot! That was a pretty difficult task for me to do.

On the bright side, I can say that in both lectures, teachers deeply understood their participants and wanted them to be interested in their lecture topics. In addition, they were nicely responding to our questions on Slack and were enthusiastic about all of us stepping into a new realm of information.

Next, let me tell you what I learned about the Gaussian process and machine learning.

## Advantages of the Gaussian Process Regression

What is the merit of using the Gaussian process regression instead of simple linear regression in the first place?

One is, to “know how much you don’t know”.

In the regression model using the Gaussian process, the output is obtained with a probability distribution, so the variance of the output is known. In a normal linear regression model, the function is uniquely determined, so the data variance cannot be expressed.

Moreover, since the distribution can be seen, it is possible to efficiently search the data in such a way as to increase the data where the distribution is large or not to collect the data where the distribution is small.

The second is that “model selection and feature selection are possible”.

I won’t go into details in this post, but you can select the model that generates the right function to fit the data and select the important features by comparing each set of functions.

As a specific application example of this, when the diagnostic system for medical examination information gives an answer, if the system announces that there are few data (unusual cases, lack of examination information, etc.), the reference book (Gaussian Process and Machine Learning) states that it is possible to properly consider seeking a second opinion.

The author expresses this as “honest artificial intelligence”.

## Implementation of Gaussian Process Regression

Next, let’s create a model from the data using Gaussian process regression.

If you would like to see the detailed mathematical expressions and derivations that are omitted, please refer to the “Gauss Process and Machine Learning” of MLP series for more details.

Again, all the source codes can be found here.

First of all, let’s take a look at the sample data below:

This is

y = 0.1x +0.3sin(x) + ε

from [-5~5]

A uniform random number x is generated in the range of, and 30 data are sampled and plotted using it. ε is a random number according to the standard normal distribution multiplied by 0.5.

Let’s start with a simple linear regression.

The linear regression weight vector solution is obtained as follows:

Φ is a design matrix of basis functions.

*Incidentally, when the loss function is a square error (least square method), the weight vector is the above solution.

Here, when the basis function was solved on the premise that φ0 (x) = x, φ1 (x) = sin (x), the following nice curves were drawn.

Next is Gaussian process regression.

### What is Gaussian process regression?

My general understanding is that it fits and models functions of complex shapes with the sum and product of kernels (RBF kernels, etc.).

The reason why such a method is necessary is that when the basis functions are arranged with a simple linear regression model, the number of basis functions increases twice as the number of dimensions increases, and the amount of calculation is enormous. (This is called a *dimensional curse*).

To avoid this, taking expected values for the design matrix and the weight vector, you can see that the output y is a multivariate Gaussian distribution that follows the input covariance matrix.

A function that gives the elements of a covariance matrix is called a* kernel function*. Here, the kernel function is actually an inner product of feature vectors, but if you know the kernel function, you can know the distribution of y, so you no longer need to know the feature vector.

You don’t have to know each basis function. This is called a *kernel trick*. There are various kernels that can be used as kernel functions, and the sum and product of kernels can also be used as kernels.

*By the way, when I was in my second year at university, I heard the word* kernel trick* for the first time. After reading the reference book, I finally understood its meaning and significance.

Let’s get back to it.

From now on, let’s create a model using a library called GPy that can easily use Gaussian process regression.

Let’s create a model for the previous data using GPy.

The kernel uses only RBF.

In this graph, the expected value of y for every x (only x = [-5,5] is shown here) is a blue line, and light blue bands are posterior distributions.

This indicates the error range.

You can see that even if the prediction is off, it will fit in this zone.

You may be thinking, “Is it too wide?”. This is because the kernel used in Gaussian process regression is used without parameter optimization. Since GPy also has an optimization function, let’s try parameter optimization.

You can easily do as follows:

By performing optimization, the error range can be reduced.

There are three optimization methods: SCG, L-BFGS, and TNC.

If you read the source, the default is written as “preferred optimizer”. The question is, would you choose the right one?

This time we optimized with default parameters, but L-BFGS was selected.

This graph is a plot of the results of linear regression prediction and Gaussian process regression, with y = 0.1x + 0.3sin (x) as the yellow line.

Although it seems that both linear regression and Gaussian process regression require a certain prediction result, the major difference between the two is whether or not a model has been set in advance.

For example, let’s look at the results of linear regression when the previous model setting fails.

Perform linear regression with basis functions as φ0 (x) = x ^ 2, φ1 (x) = x ^ 3.

Thus, the prediction will be as shown in the figure.

In the case of linear regression, I knew the function I set in advance.

So, if I prepared a basis function, it can be expressed nicely as a linear sum of basis functions.

(here φ0 (x) = x, φ1 (x) = sin (x))

If different basis functions are prepared, predictions may deviate greatly.

Gaussian process regression is very interesting because it can make such a prediction without setting up a model.

## Conclusion

We used the Gaussian process regression learned at the spring camp and compared the result with simple linear regression.

In addition to not having to set up a model, it was very interesting that the reliability of the result was obtained visibly because the average and variance of the output were required.

Nonetheless, there is a problem that Gaussian process regression takes O (N ^ 3) calculation time for inverse matrix calculation.

(Although not mentioned in this post)

One of the solutions to that is an auxiliary variable law, but I’ll talk about that in future posts.

That’s it!

Thanks for sticking around!

*Hacarus is recruiting new members!*

We are looking for embedded software, FPGA engineers, and data scientists who are interested in machine learning.

If you would like to conduct research and impact the development of machine learning, please apply!

If you are interested, please visit our **careers** page on our main site or see our job listings here: