Hello everyone, thank you for reading my article today. My name is Yushiro, a data scientist at HACARUS.
At HACARUS, we work with various machine learning models and we also write blogs about learning and interpretation methods for AI.
In my previous article, I introduced the topic of PDP, ICE plots, and d-ICE. These are all interpretation methods for machine learning models that can be applied to any model. While PDP and ICE plots are useful for visualizing the effect that each feature has on the model’s prediction, they do not work well when there are interactions between features. On the other hand, d-ICE is able to tell if there is an interaction, but it cannot tell you what type of interaction it is.
For my final article in this short series, I want to introduce RuleFit, a modeling method for automatically discovering interactions between features.
Explanation of RuleFit
In principle, one of the most basic machine learning models is the linear regression model. In linear regression, the relationship between a feature and its impact on prediction is considered to be linear. The model is also created based on the slope and intercept of this regression.
The advantage of linear regression models is that they are simple and highly interpretable, However, they rely on many assumptions such as linearity and independence. This means that they cannot account for any nonlinear effects and interactions between features.
Conversely, one model that is able to account for these actions is the decision tree model. Using this model, interactions are considered by using rules that refer to multiple features, but the linearity of the data is not considered. This means that even if the value of a feature changes slightly, the prediction results may remain the same. The prediction result may also change drastically if the value exceeds a certain threshold. For these reasons, using the decision tree model may not be appropriate when the results should change smoothly.
When creating a linear model involving interactions, a common method is to add the interaction term as a new feature. To help streamline this process, is it possible to automate this method? In reality, it is possible using the RuleFit method I will introduce below. This method can automatically discover interactions from data and can create a linear model that takes them into account by combining decision trees and linear models.
To learn more about this topic, please see the Rulefit chapter of Interpretable Machine Learning, a book translation project written by Christoph Molnar.
RuleFit works by first studying a decision tree model based on a dataset and then it creates decision rules. Next, it adds new features to the instance based on the decision rules. For example, if the decision tree has learned a rule that states ‘170cm or taller and 70kg or less’, we add new features to the instance that displays a ‘1’ if it satisfies the previous condition and a ‘0’ otherwise.
By training a linear regression model that includes these new features, we can perform a linear regression that also accounts for the interactions of features. With the interpretability of the model in mind, it is recommended to use three steps when training the decision tree. It is also recommended to use LASSO in order to perform feature selection to know if an interaction is intrinsically important in RuleFit.
Next, we can move on to some use cases for RuleFit using real data.
Christoph Molnar, the author of the previously mentioned book, Interpretable Machine Learning, released a Python package under the MIT license that has implemented RuleFit. In this article, we will also use that package to demonstrate the capabilities of RuleFit.
If you are interested, the code used in this article can also be found on Google Colab.
Before starting the experiment, first, we need to install the package.
It is important to note here that if you use the command ‘pip install rulefit’ here, it will instead install a different package.
The next step is to load the relevant dataset. For this experiment, we will use the California home price data set from Scikit-learn.
After loading the data, let’s start by taking a quick look at the data itself.
The data appears to contain 20,640 instances and 8 features. It is also worth noting that the instances are not individual houses, but rather districts within California. The 8 included features are as follows:
- MedInc: Medium Income
- HouseAge: Median Age of the House
- AveRooms: Average Number of Rooms
- AveBedrms: Average Number of Bedrooms
- Population: Population
- AveOccup: Average Housing Occupancy
- Latitude: Latitude of the House
- Longitude: Longitude of the House
The objective variable is the average price of houses, so this prediction will be considered a regression problem. Let’s start by training RuleFit using this data.
Using the code above, the trained model is used to make predictions on the test data, and the performance of these predictions is then evaluated by RMSE.
The result of the RMSE for RuleFit was 0.487. RMSE also calculated a value of 0.728 for the linear regression model (LassoCV) and a value of 0.732 for the decision tree model (DecisionTreeRegressor).
From these results, it would seem that RuleFit has improved performance over linear regression models and decision trees because it can consider both linearity and interaction.
Continuing our experiment, let’s review the interactions obtained with RuleFit.
In this case, ‘rule’ is the name of the feature and ‘type’ is the type of the feature. The term ‘linear’ refers to features that were originally included in the dataset, and ‘rule’ also refers to features that were newly created by the decision rule.
There are a few more terms that need to be defined as well. ‘coef’ is the weight in the linear regression model. The higher this value is, the greater the influence the feature has on the prediction. ‘support’ is the percentage of instances that fall under the decision rule. For ‘linear’ features, this value is set at 1, while it ranges from 0 to 1 for ‘rule’ features.
The final term is ‘importance’, which represents the feature importance. This value is calculated based on the values for ‘coef’ and ‘support’. To get a better understanding of how this value is calculated, please refer again to the Interpretable Machine Learning book.
Going back to the ‘rule’ term, if you look at its contents, you will see that there are duplicate conditions and too many threshold digits. this makes it difficult to read, so we will sort them out. (See the Python code for more details).
When looking at the features that are considered important, we observe that they are categorized by latitude and longitude. It also looks like the system ends up remembering and reproducing the prices in each district. This is primarily due to the default settings, where the number of maximum rules is set to 2000 (‘max_rules=2000’), which seems to create too many detailed rules.
On one hand, it is good that we are able to extract features where the interaction between latitude and longitude is meaningful, but too many interaction terms will reduce the interpretability of the model. In order to create a more reasonable number of rules, we will limit it to 50.
After reducing the number of rules from 2000 to 50, we reran the RMSE and obtained a value of 0.636. This means that the results are less accurate, but it is a lot easier to interpret.
To finish this article, let’s take a closer look at the features that have been identified as important.
First, we can look at all of the features whose ‘type’ is ‘linear.’ The data shows that the ‘Latitude’, ‘Longitude’, and ‘AveOccup’ features all have a negative impact on house prices. Oppositely, ‘AveBedrms’, ‘MedInc’, and ‘HouseAge’ appear to have a positive impact on the price.
‘Latitude’ and ‘Longitude’ can also show another trend in housing prices. It appears that housing prices are higher in the south and west. These are the areas in California that are more developed or allow access to the ocean. It is also natural that housing prices will go up in areas where the median income is higher.
Next, we will look at the features whose ‘type’ is ‘rule’. To begin, we will look at ‘MedInc” values between 2.832 and 4.579 and AveOccup values less than 2.04. With a weight of 0.333869, we can say that if the ‘MedInc’ is within the range above, housing prices will increase by 0.333869 compared to otherwise. This occurrence is further illustrated by the figure below.
For areas with moderate income, the housing prices appear higher in areas with lower housing occupancy compared to areas with an occupancy rate above 2.04. This is most likely due to the idea that prices rise when there is either a train station or community centers (ex. shopping malls) near the house. This feature is observable because we are using RuleFit.
While RuleFit is a convenient tool, the interpretation method is a bit more complex than linear regression models or decision trees. However, it allows us to see interactions from the data which allows us to make surprising discoveries that would otherwise be difficult to detect by intuition alone.
To quickly recap, in today’s article I introduced another useful tool called RuleFit. Using RuleFit, we can automatically discover interactions from data that combines both linearity and interaction. This combination makes the model highly interpretable, which is a common difficulty in machine learning.
Another issue with other methods is that while increasing the number of decision rules does improve prediction performance, it also decreases the interpretability of the model. This is where RuleFit shines because it provides a balance between performance and interpretability.
Thank you for reading the final article of my short series. I look forward to releasing more articles in the future.