âšď¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://towardsdatascience.com/catboost-regression-in-6-minutes-3487f3e5b329/ |
| Last Crawled | 2026-04-11 10:16:40 (20 hours ago) |
| First Indexed | 2025-02-21 16:53:20 (1 year ago) |
| HTTP Status Code | 200 |
| Meta Title | CatBoost regression in 6 minutes | Towards Data Science |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Photo by
Markus Spiske
on
Unsplash
This article aims to provide a hands-on tutorial using the CatBoost Regressor on the Boston Housing dataset from the Sci-Kit Learn library.
Table of Contents
Introduction to CatBoost
Application
Final notes
Introduction
CatBoost is a relatively new open-source machine learning algorithm, developed in 2017 by a company named Yandex. Yandex is a Russian counterpart to Google, working within search and information services [1].
One of CatBoostâs core edges is its ability to integrate a variety of different data types, such as images, audio, or text features into one framework. But CatBoost also offers an idiosyncratic way of handling categorical data, requiring a minimum of categorical feature transformation, opposed to the majority of other machine learning algorithms, that cannot handle non-numeric values. From a feature engineering perspective, the transformation from a non-numeric state to numeric values can be a very non-trivial and tedious task, and CatBoost makes this step obsolete.
CatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model. Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors. This process of adding a new function to existing ones is continued until the selected loss function is no longer minimized.
In the growing procedure of the decision trees, CatBoost does not follow similar gradient boosting models. Instead, CatBoost grows oblivious trees, which means that the trees are grown by imposing the rule that all nodes at the same level, test the same predictor with the same condition, and hence an index of a leaf can be calculated with bitwise operations. The oblivious tree procedure allows for a simple fitting scheme and efficiency on CPUs, while the tree structure operates as a regularization to find an optimal solution and avoid overfitting.
Compared computational efficiency:
Learning speed, Yandex [2]
According to Google trends, CatBoost still remains relatively unknown in terms of search popularity compared to the much more popular XGBoost algorithm.
Google Trends (2021) [3]
CatBoost still remains fairly unknown, but the algorithm offers immense flexibility with its approach to handling heterogeneous, sparse, and categorical data while still supporting fast training time and already optimized hyperparameters.
Application
The objective of this tutorial is to provide a hands-on experience to CatBoost regression in Python. In this simple exercise, we will use the Boston Housing dataset to predict Boston house prices. But the applied logic on this data is also applicable to more complex datasets.
So letâs get started.
First, we need to import the required libraries along with the dataset:
import catboost as cb
import numpy as np
import pandas as pd
import seaborn as sns
import shap
import load_boston
from matplotlib import pyplot as pltfrom sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance
boston=load_boston()
boston = pd.DataFrame(boston.data, columns=boston.feature_names)
Data exploration
It is always considered good practice to check for any Na values in your dataset, as it can confuse or at worst, hurt the performance of the algorithm.
boston.isnull().sum()
However, this dataset does not contain any Naâs.
The data exploration and feature engineering phase are some of the most crucial (and time-consuming) phases when making data science projects. But in this context, the main emphasis is on introducing the CatBoost algorithm. Hence, if you want to dive deeper into the descriptive analysis, please visit
EDA & Boston House Cost Prediction
[4].
Training
Next, we need to split our data into 80% training and 20% test set.
The target variable is âMEDVâ â Median value of owner-occupied homes in $1000âs.
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)
In order to train and optimize our model, we need to utilize CatBoost library integrated tool for combining features and target variables into a train and test dataset. This pooling allows you to pinpoint target variables, predictors, and the list of categorical features, while the pool constructor will combine those inputs and pass them to the model.
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)
Next, we will introduce our model.
model = cb.CatBoostRegressor(loss_function='RMSE')
We will use the RMSE measure as our loss function because it is a regression task.
In situations where the algorithms are tailored to specific tasks, it might benefit from parameter tuning. The CatBoost library offers a flexible interface for inherent grid search techniques, and if you already know the Sci-Kit Grid Search function, you will also be familiar with this procedure.
In this tutorial, only the most common parameters will be included. These parameters include a number of iterations, learning rate, L2 leaf regularization, and tree depth. If you want to discover more hyperparameter tuning possibilities, check out the CatBoost documentation
here
.
grid = {'iterations': [100, 150, 200],
'learning_rate': [0.03, 0.1],
'depth': [2, 4, 6, 8],
'l2_leaf_reg': [0.2, 0.5, 1, 3]}
model.grid_search(grid, train_dataset)
Performance evaluation
We have now performed the training of our model, and we can finally proceed to the evaluation of the test data.
Letâs see how the model performs.
pred = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = r2_score(y_test, pred)
print("Testing performance")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))
Test performance
As depicted above we achieve an R-squared of 90% on our test set, which is quite good, considering the minimal feature engineering.
Inference-wise, CatBoost also offers the possibility to extract Variable Importance Plots. Hence, a Variable Importance Plot could reveal underlying data structures that might not be visible to the human eye.
In this example, we are sorting the array in ascending order and making a horizontal bar plot of the features with the
least
important features at the bottom and
most
important features at the top of the plot.
sorted_feature_importance = model.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_feature_importance],
model.feature_importances_[sorted_feature_importance],
color='turquoise')
plt.xlabel("CatBoost Feature Importance")
Variable Importance Plot
According to the illustration, these features listed above holds valuable information to predicting Boston house prices. The most influential variables are the average number of rooms per dwelling (RM) and the percentage of the lower status of the population (LSTAT).
SHapley Additive exPlanations (SHAP) plots are also a convenient tool to explain the output of our machine learning model, by assigning an importance value to each feature for a given prediction. SHAP values allow for interpreting what features driving the prediction of our target variable.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names = boston.feature_names[sorted_feature_importance])
SHAP Plot
In the SHAP plot, the features are ranked based on their average absolute SHAP and the colors represent the feature value (red high, blue low). The higher the SHAP value, the larger the predictorâs attribution. In other words, the SHAP values represent a predictorâs responsibility for a change in the model output, i.e. prediction of Boston house prices. This reveals for example that larger RM are associated with increasing house prices while a higher LSTAT is linked with decreasing house prices, which also intuitively makes sense.
If you want to know more about SHAP plots and CatBoost, you will find the documentation
here
.
Final notes
So, in this tutorial, we have successfully built a CatBoost Regressor using Python, which is capable of predicting 90% of the variability in Boston house prices with an average error of 2,830$. Additionally, we have looked at Variable Importance Plots and the features associated with Boston house price predictions. If you want to learn more, I recommend you try out other datasets as well and delve further into the many approaches to customizing and evaluating your model.
Thanks for Reading!
Sources
[1] Yandex, Company description, (2020),
https://yandex.com/company/
[2] Catboost, CatBoost overview (2017),
https://catboost.ai/
[3] Google Trends (2021),
https://trends.google.com/trends/explore?date=2017-04-01%202021-02-18&q=CatBoost,XGBoost
[4] A. Bajaj, EDA & Boston House Cost Prediction (2019),
https://medium.com/@akashbajaj0149/eda-boston-house-cost-prediction-5fc1bd662673 |
| Markdown | [](https://towardsdatascience.com/)
Publish AI, ML & data-science insights to a global community of data professionals.
[Sign in]()
[Submit an Article](https://contributor.insightmediagroup.io/)
- [Latest](https://towardsdatascience.com/latest/)
- [Editorâs Picks](https://towardsdatascience.com/tag/editors-pick/)
- [Deep Dives](https://towardsdatascience.com/tag/deep-dives/)
- [Newsletter](https://towardsdatascience.com/tag/the-variable/)
- [Write For TDS](https://towardsdatascience.com/submissions/)
Toggle Mobile Navigation
- [LinkedIn](https://www.linkedin.com/company/towards-data-science/?originalSubdomain=ca)
- [X](https://x.com/TDataScience)
Toggle Search
[Machine Learning](https://towardsdatascience.com/category/artificial-intelligence/machine-learning/)
# CatBoost regression in 6 minutes
A brief hands-on introduction to CatBoost regression analysis in Python
[Simon Thiesen](https://towardsdatascience.com/author/simonthiesen/)
Feb 18, 2021
7 min read
Share

Photo by [Markus Spiske](https://unsplash.com/@markusspiske?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
> This article aims to provide a hands-on tutorial using the CatBoost Regressor on the Boston Housing dataset from the Sci-Kit Learn library.
## Table of Contents
1. Introduction to CatBoost
2. Application
3. Final notes
## Introduction
CatBoost is a relatively new open-source machine learning algorithm, developed in 2017 by a company named Yandex. Yandex is a Russian counterpart to Google, working within search and information services \[1\].
One of CatBoostâs core edges is its ability to integrate a variety of different data types, such as images, audio, or text features into one framework. But CatBoost also offers an idiosyncratic way of handling categorical data, requiring a minimum of categorical feature transformation, opposed to the majority of other machine learning algorithms, that cannot handle non-numeric values. From a feature engineering perspective, the transformation from a non-numeric state to numeric values can be a very non-trivial and tedious task, and CatBoost makes this step obsolete.
CatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model. Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors. This process of adding a new function to existing ones is continued until the selected loss function is no longer minimized.
In the growing procedure of the decision trees, CatBoost does not follow similar gradient boosting models. Instead, CatBoost grows oblivious trees, which means that the trees are grown by imposing the rule that all nodes at the same level, test the same predictor with the same condition, and hence an index of a leaf can be calculated with bitwise operations. The oblivious tree procedure allows for a simple fitting scheme and efficiency on CPUs, while the tree structure operates as a regularization to find an optimal solution and avoid overfitting.
> Compared computational efficiency:
![Learning speed, Yandex \[2\]](https://towardsdatascience.com/wp-content/uploads/2021/02/1hHGzxN8sm2GEZQhbiT7vtg.png)
Learning speed, Yandex \[2\]
> According to Google trends, CatBoost still remains relatively unknown in terms of search popularity compared to the much more popular XGBoost algorithm.
![Google Trends (2021) \[3\]](https://towardsdatascience.com/wp-content/uploads/2021/02/1DOS1in0CjBaiLMu_gwuQcQ.png)
Google Trends (2021) \[3\]
CatBoost still remains fairly unknown, but the algorithm offers immense flexibility with its approach to handling heterogeneous, sparse, and categorical data while still supporting fast training time and already optimized hyperparameters.
***
## Application
The objective of this tutorial is to provide a hands-on experience to CatBoost regression in Python. In this simple exercise, we will use the Boston Housing dataset to predict Boston house prices. But the applied logic on this data is also applicable to more complex datasets.
So letâs get started.
First, we need to import the required libraries along with the dataset:
```
import catboost as cb
import numpy as np
import pandas as pd
import seaborn as sns
import shap
import load_boston
from matplotlib import pyplot as pltfrom sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance
```
```
boston=load_boston()
```
```
boston = pd.DataFrame(boston.data, columns=boston.feature_names)
```
### Data exploration
It is always considered good practice to check for any Na values in your dataset, as it can confuse or at worst, hurt the performance of the algorithm.
```
boston.isnull().sum()
```
However, this dataset does not contain any Naâs.
The data exploration and feature engineering phase are some of the most crucial (and time-consuming) phases when making data science projects. But in this context, the main emphasis is on introducing the CatBoost algorithm. Hence, if you want to dive deeper into the descriptive analysis, please visit [EDA & Boston House Cost Prediction](https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155) \[4\].
### Training
Next, we need to split our data into 80% training and 20% test set.
The target variable is âMEDVâ â Median value of owner-occupied homes in \$1000âs.
```
X, y = load_boston(return_X_y=True)
```
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)
```
In order to train and optimize our model, we need to utilize CatBoost library integrated tool for combining features and target variables into a train and test dataset. This pooling allows you to pinpoint target variables, predictors, and the list of categorical features, while the pool constructor will combine those inputs and pass them to the model.
```
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)
```
Next, we will introduce our model.
```
model = cb.CatBoostRegressor(loss_function='RMSE')
```
We will use the RMSE measure as our loss function because it is a regression task.
In situations where the algorithms are tailored to specific tasks, it might benefit from parameter tuning. The CatBoost library offers a flexible interface for inherent grid search techniques, and if you already know the Sci-Kit Grid Search function, you will also be familiar with this procedure.
In this tutorial, only the most common parameters will be included. These parameters include a number of iterations, learning rate, L2 leaf regularization, and tree depth. If you want to discover more hyperparameter tuning possibilities, check out the CatBoost documentation [here](https://catboost.ai/docs/concepts/parameter-tuning.html).
```
grid = {'iterations': [100, 150, 200],
'learning_rate': [0.03, 0.1],
'depth': [2, 4, 6, 8],
'l2_leaf_reg': [0.2, 0.5, 1, 3]}
```
```
model.grid_search(grid, train_dataset)
```
### Performance evaluation
We have now performed the training of our model, and we can finally proceed to the evaluation of the test data.
Letâs see how the model performs.
```
pred = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = r2_score(y_test, pred)
```
```
print("Testing performance")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))
```

Test performance
As depicted above we achieve an R-squared of 90% on our test set, which is quite good, considering the minimal feature engineering.
***
Inference-wise, CatBoost also offers the possibility to extract Variable Importance Plots. Hence, a Variable Importance Plot could reveal underlying data structures that might not be visible to the human eye.
In this example, we are sorting the array in ascending order and making a horizontal bar plot of the features with the ***least*** important features at the bottom and ***most*** important features at the top of the plot.
```
sorted_feature_importance = model.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_feature_importance],
model.feature_importances_[sorted_feature_importance],
color='turquoise')
plt.xlabel("CatBoost Feature Importance")
```

Variable Importance Plot
According to the illustration, these features listed above holds valuable information to predicting Boston house prices. The most influential variables are the average number of rooms per dwelling (RM) and the percentage of the lower status of the population (LSTAT).
SHapley Additive exPlanations (SHAP) plots are also a convenient tool to explain the output of our machine learning model, by assigning an importance value to each feature for a given prediction. SHAP values allow for interpreting what features driving the prediction of our target variable.
```
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
```
```
shap.summary_plot(shap_values, X_test, feature_names = boston.feature_names[sorted_feature_importance])
```

SHAP Plot
In the SHAP plot, the features are ranked based on their average absolute SHAP and the colors represent the feature value (red high, blue low). The higher the SHAP value, the larger the predictorâs attribution. In other words, the SHAP values represent a predictorâs responsibility for a change in the model output, i.e. prediction of Boston house prices. This reveals for example that larger RM are associated with increasing house prices while a higher LSTAT is linked with decreasing house prices, which also intuitively makes sense.
If you want to know more about SHAP plots and CatBoost, you will find the documentation [here](https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Catboost%20tutorial.html).
## Final notes
So, in this tutorial, we have successfully built a CatBoost Regressor using Python, which is capable of predicting 90% of the variability in Boston house prices with an average error of 2,830\$. Additionally, we have looked at Variable Importance Plots and the features associated with Boston house price predictions. If you want to learn more, I recommend you try out other datasets as well and delve further into the many approaches to customizing and evaluating your model.
Thanks for Reading\!
### Sources
\[1\] Yandex, Company description, (2020), <https://yandex.com/company/>
\[2\] Catboost, CatBoost overview (2017), <https://catboost.ai/>
\[3\] Google Trends (2021), [https://trends.google.com/trends/explore?date=2017-04-01%202021-02-18\&q=CatBoost,XGBoost](https://trends.google.com/trends/explore?date=2017-04-01%202021-02-18&q=CatBoost,XGBoost)
\[4\] A. Bajaj, EDA & Boston House Cost Prediction (2019), <https://medium.com/@akashbajaj0149/eda-boston-house-cost-prediction-5fc1bd662673>
***
Written By
Simon Thiesen
[See all from Simon Thiesen](https://towardsdatascience.com/author/simonthiesen/)
[Catboost](https://towardsdatascience.com/tag/catboost/), [Gradient Boosting](https://towardsdatascience.com/tag/gradient-boosting/), [Hands On Tutorials](https://towardsdatascience.com/tag/hands-on-tutorials/), [Machine Learning](https://towardsdatascience.com/tag/machine-learning/), [Python](https://towardsdatascience.com/tag/python/)
Share This Article
- [Share on Facebook](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Ftowardsdatascience.com%2Fcatboost-regression-in-6-minutes-3487f3e5b329%2F&title=CatBoost%20regression%20in%206%20minutes)
- [Share on LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Ftowardsdatascience.com%2Fcatboost-regression-in-6-minutes-3487f3e5b329%2F&title=CatBoost%20regression%20in%206%20minutes)
- [Share on X](https://x.com/share?url=https%3A%2F%2Ftowardsdatascience.com%2Fcatboost-regression-in-6-minutes-3487f3e5b329%2F&text=CatBoost%20regression%20in%206%20minutes)
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
[Write for TDS](https://towardsdatascience.com/questions-96667b06af5/)
## Related Articles
- 
## [Implementing Convolutional Neural Networks in TensorFlow](https://towardsdatascience.com/implementing-convolutional-neural-networks-in-tensorflow-bc1c4f00bd34/)
[Artificial Intelligence](https://towardsdatascience.com/category/artificial-intelligence/)
Step-by-step code guide to building a Convolutional Neural Network
[Shreya Rao](https://towardsdatascience.com/author/shreya-rao/)
August 20, 2024
6 min read
- 
## [How to Forecast Hierarchical Time Series](https://towardsdatascience.com/how-to-forecast-hierarchical-time-series-75f223f79793/)
[Artificial Intelligence](https://towardsdatascience.com/category/artificial-intelligence/)
A beginnerâs guide to forecast reconciliation
[Dr. Robert KĂźbler](https://towardsdatascience.com/author/dr-robert-kuebler/)
August 20, 2024
13 min read
- 
## [Hands-on Time Series Anomaly Detection using Autoencoders, with Python](https://towardsdatascience.com/hands-on-time-series-anomaly-detection-using-autoencoders-with-python-7cd893bbc122/)
[Data Science](https://towardsdatascience.com/category/data-science/)
Hereâs how to use Autoencoders to detect signals with anomalies in a few lines ofâŚ
[Piero Paialunga](https://towardsdatascience.com/author/piero-paialunga/)
August 21, 2024
12 min read
- 
## [3 AI Use Cases (That Are Not a Chatbot)](https://towardsdatascience.com/3-ai-use-cases-that-are-not-a-chatbot-f4f328a2707a/)
[Machine Learning](https://towardsdatascience.com/category/artificial-intelligence/machine-learning/)
Feature engineering, structuring unstructured data, and lead scoring
[Shaw Talebi](https://towardsdatascience.com/author/shawhin/)
August 21, 2024
7 min read
- 
## [Back To Basics, Part Uno: Linear Regression and Cost Function](https://towardsdatascience.com/back-to-basics-part-uno-linear-regression-cost-function-and-gradient-descent-590dcb3eee46/)
[Data Science](https://towardsdatascience.com/category/data-science/)
An illustrated guide on essential machine learning concepts
[Shreya Rao](https://towardsdatascience.com/author/shreya-rao/)
February 3, 2023
6 min read
- 
## [Must-Know in Statistics: The Bivariate Normal Projection Explained](https://towardsdatascience.com/must-know-in-statistics-the-bivariate-normal-projection-explained-ace7b2f70b5b/)
[Data Science](https://towardsdatascience.com/category/data-science/)
Derivation and practical examples of this powerful concept
[Luigi Battistoni](https://towardsdatascience.com/author/lu-battistoni/)
August 14, 2024
7 min read
- 
## [Our Columns](https://towardsdatascience.com/our-columns-53501f74c86d/)
[Data Science](https://towardsdatascience.com/category/data-science/)
Columns on TDS are carefully curated collections of posts on a particular idea or categoryâŚ
[TDS Editors](https://towardsdatascience.com/author/towardsdatascience/)
November 14, 2020
4 min read
- [YouTube](https://www.youtube.com/c/TowardsDataScience)
- [X](https://x.com/TDataScience)
- [LinkedIn](https://www.linkedin.com/company/towards-data-science/?originalSubdomain=ca)
- [Threads](https://www.threads.net/@towardsdatascience)
- [Bluesky](https://bsky.app/profile/towardsdatascience.com)
[](https://towardsdatascience.com/)
Your home for data science and Al. The worldâs leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.
Š Insight Media Group, LLC 2026
Subscribe to Our Newsletter
- [Write For TDS](https://towardsdatascience.com/questions-96667b06af5/)
- [About](https://towardsdatascience.com/about-towards-data-science-d691af11cc2f/)
- [Advertise](https://contact.towardsdatascience.com/advertise-with-towards-data-science)
- [Privacy Policy](https://towardsdatascience.com/privacy-policy/)
- [Terms of Use](https://towardsdatascience.com/website-terms-of-use/)
 |
| Readable Markdown | 
Photo by [Markus Spiske](https://unsplash.com/@markusspiske?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
> This article aims to provide a hands-on tutorial using the CatBoost Regressor on the Boston Housing dataset from the Sci-Kit Learn library.
## Table of Contents
1. Introduction to CatBoost
2. Application
3. Final notes
## Introduction
CatBoost is a relatively new open-source machine learning algorithm, developed in 2017 by a company named Yandex. Yandex is a Russian counterpart to Google, working within search and information services \[1\].
One of CatBoostâs core edges is its ability to integrate a variety of different data types, such as images, audio, or text features into one framework. But CatBoost also offers an idiosyncratic way of handling categorical data, requiring a minimum of categorical feature transformation, opposed to the majority of other machine learning algorithms, that cannot handle non-numeric values. From a feature engineering perspective, the transformation from a non-numeric state to numeric values can be a very non-trivial and tedious task, and CatBoost makes this step obsolete.
CatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model. Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors. This process of adding a new function to existing ones is continued until the selected loss function is no longer minimized.
In the growing procedure of the decision trees, CatBoost does not follow similar gradient boosting models. Instead, CatBoost grows oblivious trees, which means that the trees are grown by imposing the rule that all nodes at the same level, test the same predictor with the same condition, and hence an index of a leaf can be calculated with bitwise operations. The oblivious tree procedure allows for a simple fitting scheme and efficiency on CPUs, while the tree structure operates as a regularization to find an optimal solution and avoid overfitting.
> Compared computational efficiency:
![Learning speed, Yandex \[2\]](https://towardsdatascience.com/wp-content/uploads/2021/02/1hHGzxN8sm2GEZQhbiT7vtg.png)
Learning speed, Yandex \[2\]
> According to Google trends, CatBoost still remains relatively unknown in terms of search popularity compared to the much more popular XGBoost algorithm.
![Google Trends (2021) \[3\]](https://towardsdatascience.com/wp-content/uploads/2021/02/1DOS1in0CjBaiLMu_gwuQcQ.png)
Google Trends (2021) \[3\]
CatBoost still remains fairly unknown, but the algorithm offers immense flexibility with its approach to handling heterogeneous, sparse, and categorical data while still supporting fast training time and already optimized hyperparameters.
***
## Application
The objective of this tutorial is to provide a hands-on experience to CatBoost regression in Python. In this simple exercise, we will use the Boston Housing dataset to predict Boston house prices. But the applied logic on this data is also applicable to more complex datasets.
So letâs get started.
First, we need to import the required libraries along with the dataset:
```
import catboost as cb
import numpy as np
import pandas as pd
import seaborn as sns
import shap
import load_boston
from matplotlib import pyplot as pltfrom sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance
```
```
boston=load_boston()
```
```
boston = pd.DataFrame(boston.data, columns=boston.feature_names)
```
### Data exploration
It is always considered good practice to check for any Na values in your dataset, as it can confuse or at worst, hurt the performance of the algorithm.
```
boston.isnull().sum()
```
However, this dataset does not contain any Naâs.
The data exploration and feature engineering phase are some of the most crucial (and time-consuming) phases when making data science projects. But in this context, the main emphasis is on introducing the CatBoost algorithm. Hence, if you want to dive deeper into the descriptive analysis, please visit [EDA & Boston House Cost Prediction](https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155) \[4\].
### Training
Next, we need to split our data into 80% training and 20% test set.
The target variable is âMEDVâ â Median value of owner-occupied homes in \$1000âs.
```
X, y = load_boston(return_X_y=True)
```
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)
```
In order to train and optimize our model, we need to utilize CatBoost library integrated tool for combining features and target variables into a train and test dataset. This pooling allows you to pinpoint target variables, predictors, and the list of categorical features, while the pool constructor will combine those inputs and pass them to the model.
```
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)
```
Next, we will introduce our model.
```
model = cb.CatBoostRegressor(loss_function='RMSE')
```
We will use the RMSE measure as our loss function because it is a regression task.
In situations where the algorithms are tailored to specific tasks, it might benefit from parameter tuning. The CatBoost library offers a flexible interface for inherent grid search techniques, and if you already know the Sci-Kit Grid Search function, you will also be familiar with this procedure.
In this tutorial, only the most common parameters will be included. These parameters include a number of iterations, learning rate, L2 leaf regularization, and tree depth. If you want to discover more hyperparameter tuning possibilities, check out the CatBoost documentation [here](https://catboost.ai/docs/concepts/parameter-tuning.html).
```
grid = {'iterations': [100, 150, 200],
'learning_rate': [0.03, 0.1],
'depth': [2, 4, 6, 8],
'l2_leaf_reg': [0.2, 0.5, 1, 3]}
```
```
model.grid_search(grid, train_dataset)
```
### Performance evaluation
We have now performed the training of our model, and we can finally proceed to the evaluation of the test data.
Letâs see how the model performs.
```
pred = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = r2_score(y_test, pred)
```
```
print("Testing performance")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))
```

Test performance
As depicted above we achieve an R-squared of 90% on our test set, which is quite good, considering the minimal feature engineering.
***
Inference-wise, CatBoost also offers the possibility to extract Variable Importance Plots. Hence, a Variable Importance Plot could reveal underlying data structures that might not be visible to the human eye.
In this example, we are sorting the array in ascending order and making a horizontal bar plot of the features with the ***least*** important features at the bottom and ***most*** important features at the top of the plot.
```
sorted_feature_importance = model.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_feature_importance],
model.feature_importances_[sorted_feature_importance],
color='turquoise')
plt.xlabel("CatBoost Feature Importance")
```

Variable Importance Plot
According to the illustration, these features listed above holds valuable information to predicting Boston house prices. The most influential variables are the average number of rooms per dwelling (RM) and the percentage of the lower status of the population (LSTAT).
SHapley Additive exPlanations (SHAP) plots are also a convenient tool to explain the output of our machine learning model, by assigning an importance value to each feature for a given prediction. SHAP values allow for interpreting what features driving the prediction of our target variable.
```
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
```
```
shap.summary_plot(shap_values, X_test, feature_names = boston.feature_names[sorted_feature_importance])
```

SHAP Plot
In the SHAP plot, the features are ranked based on their average absolute SHAP and the colors represent the feature value (red high, blue low). The higher the SHAP value, the larger the predictorâs attribution. In other words, the SHAP values represent a predictorâs responsibility for a change in the model output, i.e. prediction of Boston house prices. This reveals for example that larger RM are associated with increasing house prices while a higher LSTAT is linked with decreasing house prices, which also intuitively makes sense.
If you want to know more about SHAP plots and CatBoost, you will find the documentation [here](https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Catboost%20tutorial.html).
## Final notes
So, in this tutorial, we have successfully built a CatBoost Regressor using Python, which is capable of predicting 90% of the variability in Boston house prices with an average error of 2,830\$. Additionally, we have looked at Variable Importance Plots and the features associated with Boston house price predictions. If you want to learn more, I recommend you try out other datasets as well and delve further into the many approaches to customizing and evaluating your model.
Thanks for Reading\!
### Sources
\[1\] Yandex, Company description, (2020), <https://yandex.com/company/>
\[2\] Catboost, CatBoost overview (2017), <https://catboost.ai/>
\[3\] Google Trends (2021), [https://trends.google.com/trends/explore?date=2017-04-01%202021-02-18\&q=CatBoost,XGBoost](https://trends.google.com/trends/explore?date=2017-04-01%202021-02-18&q=CatBoost,XGBoost)
\[4\] A. Bajaj, EDA & Boston House Cost Prediction (2019), <https://medium.com/@akashbajaj0149/eda-boston-house-cost-prediction-5fc1bd662673> |
| Shard | 79 (laksa) |
| Root Hash | 12035788063718406279 |
| Unparsed URL | com,towardsdatascience!/catboost-regression-in-6-minutes-3487f3e5b329/ s443 |