âšď¸ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://forecastegy.com/posts/catboost-binary-classification-python/ |
| Last Crawled | 2026-04-10 04:53:18 (2 days ago) |
| First Indexed | 2023-09-13 00:13:26 (2 years ago) |
| HTTP Status Code | 200 |
| Meta Title | How To Use CatBoost For Binary Classification In Python | Forecastegy |
| Meta Description | Many people find the initial setup of CatBoost a bit daunting. Perhaps youâve heard about its ability to work with categorical features without any preprocessing, but youâre feeling stuck on how to take the first step. In this step-by-step tutorial, Iâm going to simplify things for you. After all, itâs just another gradient boosting library to have in your toolbox. Weâll walk you through the process of installing CatBoost, loading your data, and setting up a CatBoost classifier. |
| Meta Canonical | null |
| Boilerpipe Text | Many people find the initial setup of
CatBoost
a bit daunting.
Perhaps youâve heard about its ability to work with categorical features without any preprocessing, but youâre feeling stuck on how to take the first step.
In this step-by-step tutorial, Iâm going to simplify things for you.
After all, itâs just another
gradient boosting
library to have in your toolbox.
Weâll walk you through the process of installing CatBoost, loading your data, and setting up a CatBoost classifier.
Along the journey, weâll also cover how to divide your data into a training set and a test set, how to manage imbalanced data, and how to train your model on a GPU.
By the end of this guide, youâll be ready and confident to use CatBoost for your own binary classification projects. So, letâs get started!
Installing CatBoost In Python
There are two main ways to install CatBoost in Python: using pip and conda.
If you prefer using pip, you can install CatBoost by running the following command in your terminal:
pip install catboost
If you prefer using conda, you can install it by running:
conda install
-
c conda
-
forge catboost
Make sure you have either pip or conda installed in your Python environment before running these commands.
Once youâve successfully installed CatBoost, youâre ready to move on to the next step: loading your data.
Loading The Data
Weâll be using the Adult Dataset. You can
download it from Kaggle
.
This is a well-known dataset that contains demographic information about the US population.
The goal is to predict whether a person makes over $50,000 a year.
CatBoostâs claim to fame is that it can handle categorical features without any preprocessing, so I picked this dataset because it has a mix of categorical and numerical features.
Weâll use the pandas library to load our data.
First, letâs import the necessary libraries:
import
pandas
as
pd
from
sklearn.model_selection
import
train_test_split
Next, letâs load the data.
Weâll use the pandas
read_csv
function to load the data.
data
=
pd
.
read_csv(data_path)
age
workclass
fnlwgt
education
education.num
marital.status
occupation
relationship
race
sex
capital.gain
capital.loss
hours.per.week
native.country
income
90
?
77053
HS-grad
9
Widowed
?
Not-in-family
White
Female
0
4356
40
United-States
<=50K
82
Private
132870
HS-grad
9
Widowed
Exec-managerial
Not-in-family
White
Female
0
4356
18
United-States
<=50K
66
?
186061
Some-college
10
Widowed
?
Unmarried
Black
Female
0
4356
40
United-States
<=50K
54
Private
140359
7th-8th
4
Divorced
Machine-op-inspct
Unmarried
White
Female
0
3900
40
United-States
<=50K
41
Private
264663
Some-college
10
Separated
Prof-specialty
Own-child
White
Female
0
3900
40
United-States
<=50K
Now we have our data loaded and ready to go.
The next step is to split our data into a training set and a test set.
Letâs use the
train_test_split
function from the
sklearn.model_selection
module.
We will split the data into 80% for training and 20% for testing.
X
=
data
.
drop(
'income'
, axis
=
1
)
y
=
data[
'income'
]
X_train, X_test, y_train, y_test
=
train_test_split(X, y, test_size
=
0.2
, random_state
=
42
)
In the code above,
X
is our feature set and
y
is our target variable which is âincomeâ in this case.
We now have our data split into training and testing sets, and weâre ready to train our CatBoost classifier.
Notice how we donât have to do any preprocessing for numerical or categorical features.
Training A CatBoost Classifier
The CatBoost library provides a class
CatBoostClassifier
for binary and
multiclass classification
tasks.
By default, it uses hyperparameter values that are generally effective for a wide range of datasets.
Still, I recommend you to
tune the hyperparameters
for your specific dataset to get the best performance.
Letâs import the required library and create an instance of the
CatBoostClassifier
:
from
catboost
import
CatBoostClassifier
cat_features
=
X_train
.
select_dtypes(include
=
[
'object'
])
.
columns
.
tolist()
model
=
CatBoostClassifier(cat_features
=
cat_features)
The classifier optimizes the Logloss function, also known as
cross-entropy loss
.
Itâs the most common loss function used for binary classification tasks.
Encoding Categorical Features
By default, CatBoost uses one-hot encoding for categorical features with a small number of different values.
For features with high cardinality (like ZIP codes), CatBoost uses a more complex encoding, including likelihood encoding and categorical level counts.
We just need to specify the categorical feature column names in the
cat_features
parameter when initializing the
CatBoostClassifier
.
I selected all the non-numerical features as categorical features in the code above.
Handling Imbalanced Data
CatBoost provides the
scale_pos_weight
parameter.
This parameter adjusts the cost of misclassifying positive examples. A good default value is the ratio of negative to positive samples in the training dataset.
Just divide the number of negative samples by the number of positive samples and pass the result to the
scale_pos_weight
parameter:
scale_pos_weight
=
len(y_train[y_train
==
'<=50K'
])
/
len(y_train[y_train
==
'>50K'
])
model
=
CatBoostClassifier(cat_features
=
cat_features, scale_pos_weight
=
scale_pos_weight)
Be careful though, as this usually breaks the calibration of the model (how close its predicted probabilities are to the true occurrences of the class).
Our data is not terribly imbalanced, so Iâll not use this parameter in this example.
Training On A GPU
Finally, CatBoost allows you to train your model on a GPU.
This can significantly speed up the training process, especially for large datasets.
To enable GPU training, set the
task_type
parameter to âGPUâ when initializing the
CatBoostClassifier
:
model
=
CatBoostClassifier(cat_features
=
cat_features, task_type
=
'GPU'
)
Finally, letâs train our model on the training data:
model
.
fit(X_train, y_train)
With this, our CatBoost classifier is trained and ready to make predictions.
Making Predictions With CatBoost
There are two types of predictions we can make: class predictions and probability predictions.
Class Predictions
In this case, the model predicts the class label directly. It does this by setting all examples where the probability of the positive class is greater than 0.5 to 1 and the rest to 0.
class_predictions
=
model
.
predict(X_test)
In the above code,
model.predict()
is used to make class predictions on the test data
X_test
.
Predicting Probabilities
Sometimes, we might be interested in the probabilities of each class rather than the class itself.
This is useful when we want to have a measure of the confidence of the model in its predictions, use it in a downstream task, or select a custom threshold for the positive class.
CatBoost allows us to predict these probabilities using the
predict_proba
method:
probability_predictions
=
model
.
predict_proba(X_test)
In the above code,
model.predict_proba()
is used to predict class probabilities.
The output is a 2D array, where the first element of each pair is the probability of the negative class (0) and the second element is the probability of the positive class (1).
Now that we know how to make predictions, letâs move on to evaluate the performance of our model.
Evaluating Model Performance
Weâll use several metrics for this: Log Loss, ROC AUC, and a classification report.
First, letâs import the necessary functions from the
sklearn.metrics
module:
from
sklearn.metrics
import
log_loss, roc_auc_score, classification_report
Log Loss
Log Loss measures how well a model can guess the true probability of each class.
As itâs the one directly optimized by CatBoost, it tends to be a good measure of the modelâs performance.
Lower log loss means better predictions.
Itâs not adequate if you are using
scale_pos_weight
to handle imbalanced data. In that case, you should use ROC AUC instead.
log_loss_value
=
log_loss(y_test, probability_predictions[:,
1
])
print(
f
'Log Loss:
{
log_loss_value
}
'
)
We use
probability_predictions[:,1]
to get the probability of the positive class (1), as log loss works with probabilities, not class labels.
ROC AUC
ROC AUC (Receiver Operating Characteristic Area Under the Curve)
is a performance measurement for classification problems where you are more interested in how well the model predicts the positive examples.
Higher AUC means a better model.
roc_auc
=
roc_auc_score(y_test, probability_predictions[:,
1
])
print(
f
'ROC AUC:
{
roc_auc
}
'
)
Just like before, we need to pass the probability of the positive class (1) to the
roc_auc_score
function.
Classification Report
A classification report displays the precision, recall, F1, and support scores for each class and for the whole model.
class_report
=
classification_report(y_test, class_predictions)
print(
f
'Classification Report:
\n
{
class_report
}
'
)
By evaluating these metrics, you can get a good understanding of how well your CatBoost classifier performs on your binary classification task. |
| Markdown | 
[Forecastegy](https://forecastegy.com/ "Forecastegy (Alt + H)")
- [Search](https://forecastegy.com/search/ "Search (Alt + /)")
- [MOOCs](https://forecastegy.com/moocs/ "MOOCs")
- [Tags](https://forecastegy.com/tags/ "Tags")
- [Archives](https://forecastegy.com/archives/ "Archives")
# How To Use CatBoost For Binary Classification In Python
September 12, 2023 ¡ 7 min ¡ Mario Filho

Table of Contents
- [Installing CatBoost In Python](https://forecastegy.com/posts/catboost-binary-classification-python/#installing-catboost-in-python)
- [Loading The Data](https://forecastegy.com/posts/catboost-binary-classification-python/#loading-the-data)
- [Training A CatBoost Classifier](https://forecastegy.com/posts/catboost-binary-classification-python/#training-a-catboost-classifier)
- [Encoding Categorical Features](https://forecastegy.com/posts/catboost-binary-classification-python/#encoding-categorical-features)
- [Handling Imbalanced Data](https://forecastegy.com/posts/catboost-binary-classification-python/#handling-imbalanced-data)
- [Training On A GPU](https://forecastegy.com/posts/catboost-binary-classification-python/#training-on-a-gpu)
- [Making Predictions With CatBoost](https://forecastegy.com/posts/catboost-binary-classification-python/#making-predictions-with-catboost)
- [Class Predictions](https://forecastegy.com/posts/catboost-binary-classification-python/#class-predictions)
- [Predicting Probabilities](https://forecastegy.com/posts/catboost-binary-classification-python/#predicting-probabilities)
- [Evaluating Model Performance](https://forecastegy.com/posts/catboost-binary-classification-python/#evaluating-model-performance)
- [Log Loss](https://forecastegy.com/posts/catboost-binary-classification-python/#log-loss)
- [ROC AUC](https://forecastegy.com/posts/catboost-binary-classification-python/#roc-auc)
- [Classification Report](https://forecastegy.com/posts/catboost-binary-classification-python/#classification-report)
Many people find the initial setup of [CatBoost](https://catboost.ai/) a bit daunting.
Perhaps youâve heard about its ability to work with categorical features without any preprocessing, but youâre feeling stuck on how to take the first step.
In this step-by-step tutorial, Iâm going to simplify things for you.
After all, itâs just another [gradient boosting](https://forecastegy.com/posts/can-gradient-boosting-learn-simple-arithmetic/) library to have in your toolbox.
Weâll walk you through the process of installing CatBoost, loading your data, and setting up a CatBoost classifier.
Along the journey, weâll also cover how to divide your data into a training set and a test set, how to manage imbalanced data, and how to train your model on a GPU.
By the end of this guide, youâll be ready and confident to use CatBoost for your own binary classification projects. So, letâs get started\!
## Installing CatBoost In Python[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#installing-catboost-in-python)
There are two main ways to install CatBoost in Python: using pip and conda.
If you prefer using pip, you can install CatBoost by running the following command in your terminal:
```
pip install catboost
```
If you prefer using conda, you can install it by running:
```
conda install -c conda-forge catboost
```
Make sure you have either pip or conda installed in your Python environment before running these commands.
Once youâve successfully installed CatBoost, youâre ready to move on to the next step: loading your data.
## Loading The Data[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#loading-the-data)
Weâll be using the Adult Dataset. You can [download it from Kaggle](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset).
This is a well-known dataset that contains demographic information about the US population.
The goal is to predict whether a person makes over \$50,000 a year.
CatBoostâs claim to fame is that it can handle categorical features without any preprocessing, so I picked this dataset because it has a mix of categorical and numerical features.
Weâll use the pandas library to load our data.
First, letâs import the necessary libraries:
```
import pandas as pd
from sklearn.model_selection import train_test_split
```
Next, letâs load the data.
Weâll use the pandas `read_csv` function to load the data.
```
data = pd.read_csv(data_path)
```
| age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90 | ? | 77053 | HS-grad | 9 | Widowed | ? | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | \<=50K |
| 82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | \<=50K |
| 66 | ? | 186061 | Some-college | 10 | Widowed | ? | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | \<=50K |
| 54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | \<=50K |
| 41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | \<=50K |
Now we have our data loaded and ready to go.
The next step is to split our data into a training set and a test set.
Letâs use the `train_test_split` function from the `sklearn.model_selection` module.
We will split the data into 80% for training and 20% for testing.
```
X = data.drop('income', axis=1)
y = data['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
In the code above, `X` is our feature set and `y` is our target variable which is âincomeâ in this case.
We now have our data split into training and testing sets, and weâre ready to train our CatBoost classifier.
Notice how we donât have to do any preprocessing for numerical or categorical features.
## Training A CatBoost Classifier[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#training-a-catboost-classifier)
The CatBoost library provides a class `CatBoostClassifier` for binary and [multiclass classification](https://forecastegy.com/posts/catboost-multiclass-classification-python/) tasks.
By default, it uses hyperparameter values that are generally effective for a wide range of datasets.
Still, I recommend you to [tune the hyperparameters](https://forecastegy.com/posts/catboost-hyperparameter-tuning-guide-with-optuna/) for your specific dataset to get the best performance.
Letâs import the required library and create an instance of the `CatBoostClassifier`:
```
from catboost import CatBoostClassifier
cat_features = X_train.select_dtypes(include=['object']).columns.tolist()
model = CatBoostClassifier(cat_features=cat_features)
```
The classifier optimizes the Logloss function, also known as [cross-entropy loss](https://en.wikipedia.org/wiki/Cross-entropy).
Itâs the most common loss function used for binary classification tasks.
### Encoding Categorical Features[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#encoding-categorical-features)
By default, CatBoost uses one-hot encoding for categorical features with a small number of different values.
For features with high cardinality (like ZIP codes), CatBoost uses a more complex encoding, including likelihood encoding and categorical level counts.
We just need to specify the categorical feature column names in the `cat_features` parameter when initializing the `CatBoostClassifier`.
I selected all the non-numerical features as categorical features in the code above.
### Handling Imbalanced Data[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#handling-imbalanced-data)
CatBoost provides the `scale_pos_weight` parameter.
This parameter adjusts the cost of misclassifying positive examples. A good default value is the ratio of negative to positive samples in the training dataset.
Just divide the number of negative samples by the number of positive samples and pass the result to the `scale_pos_weight` parameter:
```
scale_pos_weight = len(y_train[y_train=='<=50K']) / len(y_train[y_train=='>50K'])
model = CatBoostClassifier(cat_features=cat_features, scale_pos_weight=scale_pos_weight)
```
Be careful though, as this usually breaks the calibration of the model (how close its predicted probabilities are to the true occurrences of the class).
Our data is not terribly imbalanced, so Iâll not use this parameter in this example.
### Training On A GPU[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#training-on-a-gpu)
Finally, CatBoost allows you to train your model on a GPU.
This can significantly speed up the training process, especially for large datasets.
To enable GPU training, set the `task_type` parameter to âGPUâ when initializing the `CatBoostClassifier`:
```
model = CatBoostClassifier(cat_features=cat_features, task_type='GPU')
```
Finally, letâs train our model on the training data:
```
model.fit(X_train, y_train)
```
With this, our CatBoost classifier is trained and ready to make predictions.
## Making Predictions With CatBoost[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#making-predictions-with-catboost)
There are two types of predictions we can make: class predictions and probability predictions.
### Class Predictions[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#class-predictions)
In this case, the model predicts the class label directly. It does this by setting all examples where the probability of the positive class is greater than 0.5 to 1 and the rest to 0.
```
class_predictions = model.predict(X_test)
```
In the above code, `model.predict()` is used to make class predictions on the test data `X_test`.
### Predicting Probabilities[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#predicting-probabilities)
Sometimes, we might be interested in the probabilities of each class rather than the class itself.
This is useful when we want to have a measure of the confidence of the model in its predictions, use it in a downstream task, or select a custom threshold for the positive class.
CatBoost allows us to predict these probabilities using the `predict_proba` method:
```
probability_predictions = model.predict_proba(X_test)
```
In the above code, `model.predict_proba()` is used to predict class probabilities.
The output is a 2D array, where the first element of each pair is the probability of the negative class (0) and the second element is the probability of the positive class (1).
Now that we know how to make predictions, letâs move on to evaluate the performance of our model.
## Evaluating Model Performance[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#evaluating-model-performance)
Weâll use several metrics for this: Log Loss, ROC AUC, and a classification report.
First, letâs import the necessary functions from the `sklearn.metrics` module:
```
from sklearn.metrics import log_loss, roc_auc_score, classification_report
```
### Log Loss[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#log-loss)
Log Loss measures how well a model can guess the true probability of each class.
As itâs the one directly optimized by CatBoost, it tends to be a good measure of the modelâs performance.
Lower log loss means better predictions.
Itâs not adequate if you are using `scale_pos_weight` to handle imbalanced data. In that case, you should use ROC AUC instead.
```
log_loss_value = log_loss(y_test, probability_predictions[:,1])
print(f'Log Loss: {log_loss_value}')
```
We use `probability_predictions[:,1]` to get the probability of the positive class (1), as log loss works with probabilities, not class labels.
### ROC AUC[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#roc-auc)
[ROC AUC (Receiver Operating Characteristic Area Under the Curve)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a performance measurement for classification problems where you are more interested in how well the model predicts the positive examples.
Higher AUC means a better model.
```
roc_auc = roc_auc_score(y_test, probability_predictions[:,1])
print(f'ROC AUC: {roc_auc}')
```
Just like before, we need to pass the probability of the positive class (1) to the `roc_auc_score` function.
### Classification Report[\#](https://forecastegy.com/posts/catboost-binary-classification-python/#classification-report)
A classification report displays the precision, recall, F1, and support scores for each class and for the whole model.
```
class_report = classification_report(y_test, class_predictions)
print(f'Classification Report:\n {class_report}')
```

By evaluating these metrics, you can get a good understanding of how well your CatBoost classifier performs on your binary classification task.
Link to this page
Please link back to this page if you use it in a paper or blog post. Thanks\!
Copy URL
Copy \<a\> tag
Copy APA citation
Copy MLA citation
- [machine learning](https://forecastegy.com/tags/machine-learning/)
- [python](https://forecastegy.com/tags/python/)
- [catboost](https://forecastegy.com/tags/catboost/)
Š 2026 [Forecastegy](https://forecastegy.com/) Powered by [Hugo](https://gohugo.io/) & [PaperMod](https://git.io/hugopapermod) - [Privacy Policy](https://forecastegy.com/privacy-policy/) - [Medium](https://medium.com/@forecastegy) - This page might contain affiliate links |
| Readable Markdown | Many people find the initial setup of [CatBoost](https://catboost.ai/) a bit daunting.
Perhaps youâve heard about its ability to work with categorical features without any preprocessing, but youâre feeling stuck on how to take the first step.
In this step-by-step tutorial, Iâm going to simplify things for you.
After all, itâs just another [gradient boosting](https://forecastegy.com/posts/can-gradient-boosting-learn-simple-arithmetic/) library to have in your toolbox.
Weâll walk you through the process of installing CatBoost, loading your data, and setting up a CatBoost classifier.
Along the journey, weâll also cover how to divide your data into a training set and a test set, how to manage imbalanced data, and how to train your model on a GPU.
By the end of this guide, youâll be ready and confident to use CatBoost for your own binary classification projects. So, letâs get started\!
## Installing CatBoost In Python
There are two main ways to install CatBoost in Python: using pip and conda.
If you prefer using pip, you can install CatBoost by running the following command in your terminal:
```
pip install catboost
```
If you prefer using conda, you can install it by running:
```
conda install -c conda-forge catboost
```
Make sure you have either pip or conda installed in your Python environment before running these commands.
Once youâve successfully installed CatBoost, youâre ready to move on to the next step: loading your data.
## Loading The Data
Weâll be using the Adult Dataset. You can [download it from Kaggle](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset).
This is a well-known dataset that contains demographic information about the US population.
The goal is to predict whether a person makes over \$50,000 a year.
CatBoostâs claim to fame is that it can handle categorical features without any preprocessing, so I picked this dataset because it has a mix of categorical and numerical features.
Weâll use the pandas library to load our data.
First, letâs import the necessary libraries:
```
import pandas as pd
from sklearn.model_selection import train_test_split
```
Next, letâs load the data.
Weâll use the pandas `read_csv` function to load the data.
```
data = pd.read_csv(data_path)
```
| age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90 | ? | 77053 | HS-grad | 9 | Widowed | ? | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | \<=50K |
| 82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | \<=50K |
| 66 | ? | 186061 | Some-college | 10 | Widowed | ? | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | \<=50K |
| 54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | \<=50K |
| 41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | \<=50K |
Now we have our data loaded and ready to go.
The next step is to split our data into a training set and a test set.
Letâs use the `train_test_split` function from the `sklearn.model_selection` module.
We will split the data into 80% for training and 20% for testing.
```
X = data.drop('income', axis=1)
y = data['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
In the code above, `X` is our feature set and `y` is our target variable which is âincomeâ in this case.
We now have our data split into training and testing sets, and weâre ready to train our CatBoost classifier.
Notice how we donât have to do any preprocessing for numerical or categorical features.
## Training A CatBoost Classifier
The CatBoost library provides a class `CatBoostClassifier` for binary and [multiclass classification](https://forecastegy.com/posts/catboost-multiclass-classification-python/) tasks.
By default, it uses hyperparameter values that are generally effective for a wide range of datasets.
Still, I recommend you to [tune the hyperparameters](https://forecastegy.com/posts/catboost-hyperparameter-tuning-guide-with-optuna/) for your specific dataset to get the best performance.
Letâs import the required library and create an instance of the `CatBoostClassifier`:
```
from catboost import CatBoostClassifier
cat_features = X_train.select_dtypes(include=['object']).columns.tolist()
model = CatBoostClassifier(cat_features=cat_features)
```
The classifier optimizes the Logloss function, also known as [cross-entropy loss](https://en.wikipedia.org/wiki/Cross-entropy).
Itâs the most common loss function used for binary classification tasks.
### Encoding Categorical Features
By default, CatBoost uses one-hot encoding for categorical features with a small number of different values.
For features with high cardinality (like ZIP codes), CatBoost uses a more complex encoding, including likelihood encoding and categorical level counts.
We just need to specify the categorical feature column names in the `cat_features` parameter when initializing the `CatBoostClassifier`.
I selected all the non-numerical features as categorical features in the code above.
### Handling Imbalanced Data
CatBoost provides the `scale_pos_weight` parameter.
This parameter adjusts the cost of misclassifying positive examples. A good default value is the ratio of negative to positive samples in the training dataset.
Just divide the number of negative samples by the number of positive samples and pass the result to the `scale_pos_weight` parameter:
```
scale_pos_weight = len(y_train[y_train=='<=50K']) / len(y_train[y_train=='>50K'])
model = CatBoostClassifier(cat_features=cat_features, scale_pos_weight=scale_pos_weight)
```
Be careful though, as this usually breaks the calibration of the model (how close its predicted probabilities are to the true occurrences of the class).
Our data is not terribly imbalanced, so Iâll not use this parameter in this example.
### Training On A GPU
Finally, CatBoost allows you to train your model on a GPU.
This can significantly speed up the training process, especially for large datasets.
To enable GPU training, set the `task_type` parameter to âGPUâ when initializing the `CatBoostClassifier`:
```
model = CatBoostClassifier(cat_features=cat_features, task_type='GPU')
```
Finally, letâs train our model on the training data:
```
model.fit(X_train, y_train)
```
With this, our CatBoost classifier is trained and ready to make predictions.
## Making Predictions With CatBoost
There are two types of predictions we can make: class predictions and probability predictions.
### Class Predictions
In this case, the model predicts the class label directly. It does this by setting all examples where the probability of the positive class is greater than 0.5 to 1 and the rest to 0.
```
class_predictions = model.predict(X_test)
```
In the above code, `model.predict()` is used to make class predictions on the test data `X_test`.
### Predicting Probabilities
Sometimes, we might be interested in the probabilities of each class rather than the class itself.
This is useful when we want to have a measure of the confidence of the model in its predictions, use it in a downstream task, or select a custom threshold for the positive class.
CatBoost allows us to predict these probabilities using the `predict_proba` method:
```
probability_predictions = model.predict_proba(X_test)
```
In the above code, `model.predict_proba()` is used to predict class probabilities.
The output is a 2D array, where the first element of each pair is the probability of the negative class (0) and the second element is the probability of the positive class (1).
Now that we know how to make predictions, letâs move on to evaluate the performance of our model.
## Evaluating Model Performance
Weâll use several metrics for this: Log Loss, ROC AUC, and a classification report.
First, letâs import the necessary functions from the `sklearn.metrics` module:
```
from sklearn.metrics import log_loss, roc_auc_score, classification_report
```
### Log Loss
Log Loss measures how well a model can guess the true probability of each class.
As itâs the one directly optimized by CatBoost, it tends to be a good measure of the modelâs performance.
Lower log loss means better predictions.
Itâs not adequate if you are using `scale_pos_weight` to handle imbalanced data. In that case, you should use ROC AUC instead.
```
log_loss_value = log_loss(y_test, probability_predictions[:,1])
print(f'Log Loss: {log_loss_value}')
```
We use `probability_predictions[:,1]` to get the probability of the positive class (1), as log loss works with probabilities, not class labels.
### ROC AUC
[ROC AUC (Receiver Operating Characteristic Area Under the Curve)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a performance measurement for classification problems where you are more interested in how well the model predicts the positive examples.
Higher AUC means a better model.
```
roc_auc = roc_auc_score(y_test, probability_predictions[:,1])
print(f'ROC AUC: {roc_auc}')
```
Just like before, we need to pass the probability of the positive class (1) to the `roc_auc_score` function.
### Classification Report
A classification report displays the precision, recall, F1, and support scores for each class and for the whole model.
```
class_report = classification_report(y_test, class_predictions)
print(f'Classification Report:\n {class_report}')
```

By evaluating these metrics, you can get a good understanding of how well your CatBoost classifier performs on your binary classification task. |
| Shard | 129 (laksa) |
| Root Hash | 1095914986031676529 |
| Unparsed URL | com,forecastegy!/posts/catboost-binary-classification-python/ s443 |