🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 169 (from laksa116)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
CRAWLED
1 day ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://catboost.ai/docs/en/concepts/algorithm-score-functions
Last Crawled2026-04-09 19:24:20 (1 day ago)
First Indexed2024-11-18 17:11:24 (1 year ago)
HTTP Status Code200
Meta TitleScore functions | CatBoost
Meta DescriptionThe common approach to solve supervised learning tasks is to minimize the loss function L L L:
Meta Canonicalnull
Boilerpipe Text
The common approach to solve supervised learning tasks is to minimize the loss function L L : L ( f ( x ) , y ) = ∑ i w i ⋅ l ( f ( x i ) , y i ) + J ( f ) , w h e r e L\left(f(x), y\right) = \sum\limits_{i} w_{i} \cdot l \left(f(x_{i}), y_{i}\right) + J(f){ , where} l ( f ( x ) , y ) l\left( f(x), y\right) is the value of the loss function at the point ( x , y ) (x, y) w i w_{i} is the weight of the i i -th object J ( f ) J(f) is the regularization term. For example, these formulas take the following form for linear regression: l ( f ( x ) , y ) = w i ( ( θ , x ) − y ) 2 l\left( f(x), y\right) = w_{i} \left( (\theta, x) - y \right)^{2} (mean squared error) J ( f ) = λ ∣ ∣ θ ∣ ∣ l 2 J(f) = \lambda \left| | \theta | \right|_{l2} (L2 regularization) Gradient boosting Boosting is a method which builds a prediction model F T F^{T} as an ensemble of weak learners F T = ∑ t = 1 T f t F^{T} = \sum\limits_{t=1}^{T} f^{t} . In our case, f t f^{t} is a decision tree. Trees are built sequentially and each next tree is built to approximate negative gradients g i g_{i} of the loss function l l at predictions of the current ensemble: g i = − ∂ l ( a , y i ) ∂ a ∣ a = F T − 1 ( x i ) g_{i} = -\frac{\partial l(a, y_{i})}{\partial a} \Bigr|_{a = F^{T-1}(x_{i})} Thus, it performs a gradient descent optimization of the function L L . The quality of the gradient approximation is measured by a score function S c o r e ( a , g ) = S ( a , g ) Score(a, g) = S(a, g) . Types of score functions Let's suppose that it is required to add a new tree to the ensemble. A score function is required in order to choose between candidate trees. Given a candidate tree f f let a i a_{i} denote f ( x i ) f(x_{i}) , w i w_{i}  — the weight of i i -th object, and g i g_{i} – the corresponding gradient of l l . Let’s consider the following score functions: L 2 = − ∑ i w i ⋅ ( a i − g i ) 2 L2 = - \sum\limits_{i} w_{i} \cdot (a_{i} - g_{i})^{2} C o s i n e = ∑ w i ⋅ a i ⋅ g i ∑ w i a i 2 ⋅ ∑ w i g i 2 Cosine = \displaystyle\frac{\sum w_{i} \cdot a_{i} \cdot g_{i}}{\sqrt{\sum w_{i}a_{i}^{2}} \cdot \sqrt{\sum w_{i}g_{i}^{2}}} Finding the optimal tree structure Let's suppose that it is required to find the structure for the tree f f of depth 1. The structure of such tree is determined by the index j j of some feature and a border value c c . Let x i , j x_{i, j} be the value of the j j -th feature on the i i -th object and a l e f t a_{left} and a r i g h t a_{right} be the values at leafs of f f . Then, f ( x i ) f(x_{i}) equals to a l e f t a_{left} if x i , j ≤ c x_{i,j} \leq c and a r i g h t a_{right} if x i , j > c x_{i,j} > c . Now the goal is to find the best j j and c c in terms of the chosen score function. For the L2 score function the formula takes the following form: S ( a , g ) = − ∑ i w i ( a i − g i ) 2 = − ( ∑ i : x i , j ≤ c w i ( a l e f t − g i ) 2 + ∑ i : x i , j > c w i ( a r i g h t − g i ) 2 ) S(a, g) = -\sum\limits_{i} w_{i} (a_{i} - g_{i})^{2} = - \left( \displaystyle\sum\limits_{i:x_{i,j}\leq c} w_{i}(a_{left} - g_{i})^{2} + \sum\limits_{i: x_{i,j}>c} w_{i}(a_{right} - g_{i})^{2} \right) Let's denote W l e f t = ∑ i : x I , j ≤ c w i W_{left} = \displaystyle\sum_{i: x_{I,j} \leq c} w_{i} and W r i g h t = ∑ i : x i , j > c w i W_{right} = \displaystyle\sum_{i: x_{i,j} >c} w_{i} . The optimal values for a l e f t a_{left} and a r i g h t a_{right} are the weighted averages: a l e f t ∗ = ∑ i : x i , j ≤ c w i g i W l e f t a^{*}_{left} =\displaystyle\frac{\sum\limits_{i: x_{i,j} \leq c} w_{i} g_{i}}{W_{left}} a r i g h t ∗ = ∑ i : x i , j > c w i g i W r i g h t a^{*}_{right} =\displaystyle\frac{\sum\limits_{i: x_{i,j} > c} w_{i} g_{i}}{W_{right}} After expanding brackets and removing terms, which are constant in the optimization: j ∗ , c ∗ = a r g m a x j , c W l e f t ⋅ ( a l e f t ∗ ) 2 + W r i g h t ⋅ ( a r i g h t ∗ ) 2 j^{*}, c^{*} = argmax_{j, c} W_{left} \cdot (a^{*}_{left})^{2} + W_{right} \cdot (a^{*}_{right})^{2} The latter argmax can be calculated by brute force search. The situation is slightly more complex when the tree depth is bigger than 1: L2 score function: S is converted into a sum over leaves S ( a , g ) = ∑ l e a f S ( a l e a f , g l e a f ) S(a,g) = \sum_{leaf} S(a_{leaf}, g_{leaf}) . The next step is to find j ∗ , c ∗ = a r g m a x j , c S ( a ˉ , g ) j*, c* = argmax_{j,c}{S(\bar a, g)} , where a ˉ \bar a are the optimal values in leaves after the j ∗ , c ∗ j*, c* split. Depthwise and Lossguide methods: j , c j, c are sets of { j k } , { c k } \{j_k\}, \{c_k\} . k k stands for the index of the leaf, therefore the score function S S takes the following form: S ( a ˉ , g ) = ∑ l = l e a f S ( a ˉ ( j l , c l ) , g l ) S(\bar a, g) = \sum_{l = leaf}S(\bar a(j_l, c_l), g_l) . Since S ( l e a f ) S(leaf) is a convex function, different j k 1 , c k 1 j_{k1}, c_{k1} and j k 2 , c k 2 j_{k2}, c_{k2} (splits for different leaves) can be searched separately by finding the optimal j ∗ , c ∗ = a r g m a x j , c { S ( l e a f l e f t ) + S ( l e a f r i g h t ) − S ( l e a f b e f o r e _ s p l i t ) } j*, c* = argmax_{j,c}\{S(leaf_{left}) + S(leaf_{right}) - S(leaf_{before\_split})\} . SymmetricTree method: The same j , c j, c are attempted to be found for each leaf, thus it's required to optimize the total sum over all leaves S ( a , g ) = ∑ l e a f S ( l e a f ) S(a,g) = \sum_{leaf} S(leaf) . Second-order leaf estimation method Let's apply the Taylor expansion to the loss function at the point a t − 1 = F t − 1 ( x ) a^{t-1} = F^{t-1}(x) : L ( a i t − 1 + ϕ , y ) ≈ ∑ w i [ l i + l i ′ ϕ + 1 2 l i ′ ′ ϕ 2 ] + 1 2 λ ∣ ∣ ϕ ∣ ∣ 2 , w h e r e : L(a^{t-1}_{i} + \phi , y) \approx \displaystyle\sum w_{i} \left[ l_{i} + l^{'}_{i} \phi + \frac{1}{2} l^{''}_{i} \phi^{2} \right] + \frac{1}{2} \lambda ||\phi||_{2}{ , where:} l i = l ( a i t − 1 , y i ) l_{i} = l(a^{t-1}_{i}, y_{i}) l i ′ = − ∂ l ( a , y i ) ∂ a ∣ a = a i t − 1 l'_{i} = -\frac{\partial l(a, y_{i})}{\partial a}\Bigr|_{a=a^{t-1}_{i}} l i ′ ′ = − ∂ 2 l ( a , y i ) ∂ a 2 ∣ a = a i t − 1 l''_{i} = -\frac{\partial^{2} l(a, y_{i})}{\partial a^{2}}\displaystyle\Bigr|_{a=a^{t-1}_{i}} λ \lambda is the l2 regularization parameter Since the first term is constant in optimization, the formula takes the following form after regrouping by leaves: ∑ l e a f = 1 L ( ∑ i ∈ l e a f w i [ l i + l i ′ ϕ l e a f + 1 2 l i ′ ′ ϕ 2 ] + 1 2 λ ϕ l e a f 2 ) → m i n \sum\limits_{leaf=1}^{L} \left( \sum\limits_{i \in leaf} w_{i} \left[ l_{i} + l^{'}_{i} \phi_{leaf} + \frac{1}{2} l^{''}_{i} \phi^{2} \right] + \frac{1}{2} \lambda \phi_{leaf}^{2} \right) \to min Then let's minimize this expression for each leaf independently: ∑ i ∈ l e a f w i [ l i + l i ′ ϕ l e a f + 1 2 l i ′ ′ ϕ l e a f 2 ] + 1 2 λ ϕ l e a f 2 → m i n \sum\limits_{i \in leaf} w_{i} \left[ l_{i} + l^{'}_{i} \phi_{leaf} + \frac{1}{2} l^{''}_{i} \phi^{2}_{leaf} \right] + \frac{1}{2} \lambda \phi_{leaf}^2 \to min Differentiate by leaf value ϕ l e a f \phi_{leaf} : ∑ i ∈ l e a f w i [ l i ′ + l i ′ ′ ϕ l e a f ] + λ ϕ l e a f = 0 \sum\limits_{i \in leaf} w_{i} \left[ l^{'}_{i} + l^{''}_{i} \phi_{leaf} \right] + \lambda \phi_{leaf} = 0 So, the optimal value of ϕ l e a f \phi_{leaf} is: − ∑ i w i l i ′ ∑ i w i l i ′ ′ + λ - \displaystyle\frac{\sum_{i}w_{i}l^{'}_{i}}{\sum_{i}w_{i}l^{''}_{i}+\lambda} The summation is over i i such that the object x i x_{i} gets to the considered leaf. Then these optimal values of ϕ l e a f \phi_{leaf} can be used instead of weighted averages of gradients ( a l e f t ∗ a^{*}_{left} and a r i g h t ∗ a^{*}_{right} in the example above) in the same score functions. CatBoost score functions CatBoost provides the following score functions: Score function: L2 Description Use the first derivatives during the calculation. Score function: Cosine (can not be used with the Lossguide tree growing policy) Score function: NewtonL2 Description Use the second derivatives during the calculation. This may improve the resulting quality of the model. Score function: NewtonCosine (can not be used with the Lossguide tree growing policy) Per-object and per-feature penalties CatBoost provides the following methods to affect the score with penalties: Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. The final score is calculated as follows: S c o r e ′ = S c o r e ⋅ ∏ f ∈ S W f − ∑ f ∈ S P f ⋅ U ( f ) − ∑ f ∈ S ∑ x ∈ L E P f ⋅ U ( f , x ) Score' = Score \cdot \prod_{f\in S}W_{f} - \sum_{f\in S}P_{f} \cdot U(f) - \sum_{f\in S}\sum_{x \in L}EP_{f} \cdot U(f, x) W f W_{f}   is the feature weight P f P_{f}   is the per-feature penalty E P f EP_{f} is the per-object penalty S S is the current split L L is the current leaf U ( f ) = { 0 , if  f  was used in model already 1 , otherwise U(f) = \begin{cases} 0,& \text{if } f \text{ was used in model already}\\ 1,& \text{otherwise} \end{cases} U ( f , x ) = { 0 , if  f  was used already for object  x 1 , otherwise U(f, x) = \begin{cases} 0,& \text{if } f \text{ was used already for object } x\\ 1,& \text{otherwise} \end{cases} Usage Use the corresponding parameter to set the score function during the training: Alert The supported score functions vary depending on the processing unit type: GPU — All score types CPU — Cosine, L2 Python package: score_function R package: score_function Command-line interface: --score-function Description The score type used to select the next split during the tree construction. Possible values: Cosine (do not use this score type with the Lossguide tree growing policy) L2 NewtonCosine (do not use this score type with the Lossguide tree growing policy) NewtonL2
Markdown
[![Logo icon](https://yastatic.net/s3/locdoc/daas-static/catboost/71b237a322eec6f2889af0dae2a9c549.svg)](https://catboost.ai/ "CatBoost") - Installation - [Overview](https://catboost.ai/docs/en/concepts/en/concepts/installation) - Python package installation - CatBoost for Apache Spark installation - R package installation - Command-line version binary - Build from source - Key Features - Training parameters - Python package - CatBoost for Apache Spark - R package - Command-line version - Applying models - Objectives and metrics - Model analysis - Data format description - [Parameter tuning](https://catboost.ai/docs/en/concepts/en/concepts/parameter-tuning) - [Speeding up the training](https://catboost.ai/docs/en/concepts/en/concepts/speed-up-training) - Data visualization - Algorithm details - How training is performed - [Quantization](https://catboost.ai/docs/en/concepts/en/concepts/quantization) - [Overfitting detector](https://catboost.ai/docs/en/concepts/en/concepts/overfitting-detector) - [Recovering training after an interruption](https://catboost.ai/docs/en/concepts/en/features/snapshots) - [Missing values processing](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-missing-values-processing) - [Score functions](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions) - [FAQ](https://catboost.ai/docs/en/concepts/en/concepts/faq) - Educational materials - [Development and contributions](https://catboost.ai/docs/en/concepts/en/concepts/development-and-contributions) - [Contacts](https://catboost.ai/docs/en/concepts/en/concepts/contacts) Score functions ## In this article: - [Gradient boosting](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#gradient-boosting) - [Types of score functions](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#types-of-score-functions) - [Finding the optimal tree structure](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#optimal-tree-structure) - [Second-order leaf estimation method](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#second-order-functions) - [CatBoost score functions](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#score-functions) - [Per-object and per-feature penalties](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#per-object-feature-penalties) - [Usage](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#usage) 1. Algorithm details 2. Score functions # Score functions - [Gradient boosting](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#gradient-boosting) - [Types of score functions](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#types-of-score-functions) - [Finding the optimal tree structure](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#optimal-tree-structure) - [Second-order leaf estimation method](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#second-order-functions) - [CatBoost score functions](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#score-functions) - [Per-object and per-feature penalties](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#per-object-feature-penalties) - [Usage](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions#usage) The common approach to solve supervised learning tasks is to minimize the loss function L L L: L ( f ( x ) , y ) \= ∑ i w i ⋅ l ( f ( x i ) , y i ) \+ J ( f ) , w h e r e L\\left(f(x), y\\right) = \\sum\\limits\_{i} w\_{i} \\cdot l \\left(f(x\_{i}), y\_{i}\\right) + J(f){ , where} L(f(x),y)\=i∑​wi​⋅l(f(xi​),yi​)\+J(f),where - l ( f ( x ) , y ) l\\left( f(x), y\\right) l(f(x),y) is the value of the loss function at the point ( x , y ) (x, y) (x,y) - w i w\_{i} wi​ is the weight of the i i i \-th object - J ( f ) J(f) J(f) is the regularization term. For example, these formulas take the following form for linear regression: - l ( f ( x ) , y ) \= w i ( ( θ , x ) − y ) 2 l\\left( f(x), y\\right) = w\_{i} \\left( (\\theta, x) - y \\right)^{2} l(f(x),y)\=wi​((θ,x)−y)2 (mean squared error) - J ( f ) \= λ ∣ ∣ θ ∣ ∣ l 2 J(f) = \\lambda \\left\| \| \\theta \| \\right\|\_{l2} J(f)\=λ∣∣θ∣∣l2​ (L2 regularization) ## Gradient boosting Boosting is a method which builds a prediction model F T F^{T} FT as an ensemble of weak learners F T \= ∑ t \= 1 T f t F^{T} = \\sum\\limits\_{t=1}^{T} f^{t} FT\=t\=1∑T​ft. In our case, f t f^{t} ft is a decision tree. Trees are built sequentially and each next tree is built to approximate negative gradients g i g\_{i} gi​ of the loss function l l l at predictions of the current ensemble: g i \= − ∂ l ( a , y i ) ∂ a ∣ a \= F T − 1 ( x i ) g\_{i} = -\\frac{\\partial l(a, y\_{i})}{\\partial a} \\Bigr\|\_{a = F^{T-1}(x\_{i})} gi​\= −∂a∂l(a,yi​)​ ​ a\=FT−1(xi​)​ Thus, it performs a gradient descent optimization of the function L L L. The quality of the gradient approximation is measured by a score function S c o r e ( a , g ) \= S ( a , g ) Score(a, g) = S(a, g) Score(a,g)\=S(a,g). ## Types of score functions Let's suppose that it is required to add a new tree to the ensemble. A score function is required in order to choose between candidate trees. Given a candidate tree f f f let a i a\_{i} ai​ denote f ( x i ) f(x\_{i}) f(xi​), w i w\_{i} wi​ — the weight of i i i\-th object, and g i g\_{i} gi​ – the corresponding gradient of l l l. Let’s consider the following score functions: - L 2 \= − ∑ i w i ⋅ ( a i − g i ) 2 L2 = - \\sum\\limits\_{i} w\_{i} \\cdot (a\_{i} - g\_{i})^{2} L2\=−i∑​wi​⋅(ai​−gi​)2 - C o s i n e \= ∑ w i ⋅ a i ⋅ g i ∑ w i a i 2 ⋅ ∑ w i g i 2 Cosine = \\displaystyle\\frac{\\sum w\_{i} \\cdot a\_{i} \\cdot g\_{i}}{\\sqrt{\\sum w\_{i}a\_{i}^{2}} \\cdot \\sqrt{\\sum w\_{i}g\_{i}^{2}}} Cosine\= ∑wi​ai2​ ​ ⋅ ∑wi​gi2​ ​ ∑wi​⋅ai​⋅gi​ ​ ## Finding the optimal tree structure Let's suppose that it is required to find the structure for the tree f f f of depth 1. The structure of such tree is determined by the index j j j of some feature and a border value c c c. Let x i , j x\_{i, j} xi,j​ be the value of the j j j\-th feature on the i i i\-th object and a l e f t a\_{left} aleft​ and a r i g h t a\_{right} aright​ be the values at leafs of f f f. Then, f ( x i ) f(x\_{i}) f(xi​) equals to a l e f t a\_{left} aleft​ if x i , j ≤ c x\_{i,j} \\leq c xi,j​≤c and a r i g h t a\_{right} aright​ if x i , j \> c x\_{i,j} \> c xi,j​\>c. Now the goal is to find the best j j j and c c c in terms of the chosen score function. For the L2 score function the formula takes the following form: S ( a , g ) \= − ∑ i w i ( a i − g i ) 2 \= − ( ∑ i : x i , j ≤ c w i ( a l e f t − g i ) 2 \+ ∑ i : x i , j \> c w i ( a r i g h t − g i ) 2 ) S(a, g) = -\\sum\\limits\_{i} w\_{i} (a\_{i} - g\_{i})^{2} = - \\left( \\displaystyle\\sum\\limits\_{i:x\_{i,j}\\leq c} w\_{i}(a\_{left} - g\_{i})^{2} + \\sum\\limits\_{i: x\_{i,j}\>c} w\_{i}(a\_{right} - g\_{i})^{2} \\right) S(a,g)\=−i∑​wi​(ai​−gi​)2\= − ​ i:xi,j​≤c∑​wi​(aleft​−gi​)2\+i:xi,j​\>c∑​wi​(aright​−gi​)2 ​ Let's denote W l e f t \= ∑ i : x I , j ≤ c w i W\_{left} = \\displaystyle\\sum\_{i: x\_{I,j} \\leq c} w\_{i} Wleft​\=i:xI,j​≤c∑​wi​ and W r i g h t \= ∑ i : x i , j \> c w i W\_{right} = \\displaystyle\\sum\_{i: x\_{i,j} \>c} w\_{i} Wright​\=i:xi,j​\>c∑​wi​. The optimal values for a l e f t a\_{left} aleft​ and a r i g h t a\_{right} aright​ are the weighted averages: - a l e f t ∗ \= ∑ i : x i , j ≤ c w i g i W l e f t a^{\*}\_{left} =\\displaystyle\\frac{\\sum\\limits\_{i: x\_{i,j} \\leq c} w\_{i} g\_{i}}{W\_{left}} aleft∗​\=Wleft​i:xi,j​≤c∑​wi​gi​​ - a r i g h t ∗ \= ∑ i : x i , j \> c w i g i W r i g h t a^{\*}\_{right} =\\displaystyle\\frac{\\sum\\limits\_{i: x\_{i,j} \> c} w\_{i} g\_{i}}{W\_{right}} aright∗​\=Wright​i:xi,j​\>c∑​wi​gi​​ After expanding brackets and removing terms, which are constant in the optimization: j ∗ , c ∗ \= a r g m a x j , c W l e f t ⋅ ( a l e f t ∗ ) 2 \+ W r i g h t ⋅ ( a r i g h t ∗ ) 2 j^{\*}, c^{\*} = argmax\_{j, c} W\_{left} \\cdot (a^{\*}\_{left})^{2} + W\_{right} \\cdot (a^{\*}\_{right})^{2} j∗,c∗\=argmaxj,c​Wleft​⋅(aleft∗​)2\+Wright​⋅(aright∗​)2 The latter argmax can be calculated by brute force search. The situation is slightly more complex when the tree depth is bigger than 1: - L2 score function: S is converted into a sum over leaves S ( a , g ) \= ∑ l e a f S ( a l e a f , g l e a f ) S(a,g) = \\sum\_{leaf} S(a\_{leaf}, g\_{leaf}) S(a,g)\=∑leaf​S(aleaf​,gleaf​) . The next step is to find j ∗ , c ∗ \= a r g m a x j , c S ( a ˉ , g ) j\*, c\* = argmax\_{j,c}{S(\\bar a, g)} j∗,c∗\=argmaxj,c​S(aˉ,g) , where a ˉ \\bar a aˉ are the optimal values in leaves after the j ∗ , c ∗ j\*, c\* j∗,c∗ split. - Depthwise and Lossguide methods: j , c j, c j,c are sets of { j k } , { c k } \\{j\_k\\}, \\{c\_k\\} {jk​},{ck​} . k k k stands for the index of the leaf, therefore the score function S S S takes the following form: S ( a ˉ , g ) \= ∑ l \= l e a f S ( a ˉ ( j l , c l ) , g l ) S(\\bar a, g) = \\sum\_{l = leaf}S(\\bar a(j\_l, c\_l), g\_l) S(aˉ,g)\=∑l\=leaf​S(aˉ(jl​,cl​),gl​) . Since S ( l e a f ) S(leaf) S(leaf) is a convex function, different j k 1 , c k 1 j\_{k1}, c\_{k1} jk1​,ck1​ and j k 2 , c k 2 j\_{k2}, c\_{k2} jk2​,ck2​ (splits for different leaves) can be searched separately by finding the optimal j ∗ , c ∗ \= a r g m a x j , c { S ( l e a f l e f t ) \+ S ( l e a f r i g h t ) − S ( l e a f b e f o r e \_ s p l i t ) } j\*, c\* = argmax\_{j,c}\\{S(leaf\_{left}) + S(leaf\_{right}) - S(leaf\_{before\\\_split})\\} j∗,c∗\=argmaxj,c​{S(leafleft​)\+S(leafright​)−S(leafbefore\_split​)} . - SymmetricTree method: The same j , c j, c j,c are attempted to be found for each leaf, thus it's required to optimize the total sum over all leaves S ( a , g ) \= ∑ l e a f S ( l e a f ) S(a,g) = \\sum\_{leaf} S(leaf) S(a,g)\=∑leaf​S(leaf) . ## Second-order leaf estimation method Let's apply the Taylor expansion to the loss function at the point a t − 1 \= F t − 1 ( x ) a^{t-1} = F^{t-1}(x) at−1\=Ft−1(x): L ( a i t − 1 \+ ϕ , y ) ≈ ∑ w i \[ l i \+ l i ′ ϕ \+ 1 2 l i ′ ′ ϕ 2 \] \+ 1 2 λ ∣ ∣ ϕ ∣ ∣ 2 , w h e r e : L(a^{t-1}\_{i} + \\phi , y) \\approx \\displaystyle\\sum w\_{i} \\left\[ l\_{i} + l^{'}\_{i} \\phi + \\frac{1}{2} l^{''}\_{i} \\phi^{2} \\right\] + \\frac{1}{2} \\lambda \|\|\\phi\|\|\_{2}{ , where:} L(ait−1​\+ϕ,y)≈∑wi​\[li​\+li′​ϕ\+21​li′′​ϕ2\]\+21​λ∣∣ϕ∣∣2​,where: - l i \= l ( a i t − 1 , y i ) l\_{i} = l(a^{t-1}\_{i}, y\_{i}) li​\=l(ait−1​,yi​) - l i ′ \= − ∂ l ( a , y i ) ∂ a ∣ a \= a i t − 1 l'\_{i} = -\\frac{\\partial l(a, y\_{i})}{\\partial a}\\Bigr\|\_{a=a^{t-1}\_{i}} li′​\= −∂a∂l(a,yi​)​ ​ a\=ait−1​​ - l i ′ ′ \= − ∂ 2 l ( a , y i ) ∂ a 2 ∣ a \= a i t − 1 l''\_{i} = -\\frac{\\partial^{2} l(a, y\_{i})}{\\partial a^{2}}\\displaystyle\\Bigr\|\_{a=a^{t-1}\_{i}} li′′​\= −∂a2∂2l(a,yi​)​ ​ a\=ait−1​​ - λ \\lambda λ is the l2 regularization parameter Since the first term is constant in optimization, the formula takes the following form after regrouping by leaves: ∑ l e a f \= 1 L ( ∑ i ∈ l e a f w i \[ l i \+ l i ′ ϕ l e a f \+ 1 2 l i ′ ′ ϕ 2 \] \+ 1 2 λ ϕ l e a f 2 ) → m i n \\sum\\limits\_{leaf=1}^{L} \\left( \\sum\\limits\_{i \\in leaf} w\_{i} \\left\[ l\_{i} + l^{'}\_{i} \\phi\_{leaf} + \\frac{1}{2} l^{''}\_{i} \\phi^{2} \\right\] + \\frac{1}{2} \\lambda \\phi\_{leaf}^{2} \\right) \\to min leaf\=1∑L​(i∈leaf∑​wi​\[li​\+li′​ϕleaf​\+21​li′′​ϕ2\]\+21​λϕleaf2​)→min Then let's minimize this expression for each leaf independently: ∑ i ∈ l e a f w i \[ l i \+ l i ′ ϕ l e a f \+ 1 2 l i ′ ′ ϕ l e a f 2 \] \+ 1 2 λ ϕ l e a f 2 → m i n \\sum\\limits\_{i \\in leaf} w\_{i} \\left\[ l\_{i} + l^{'}\_{i} \\phi\_{leaf} + \\frac{1}{2} l^{''}\_{i} \\phi^{2}\_{leaf} \\right\] + \\frac{1}{2} \\lambda \\phi\_{leaf}^2 \\to min i∈leaf∑​wi​\[li​\+li′​ϕleaf​\+21​li′′​ϕleaf2​\]\+21​λϕleaf2​→min Differentiate by leaf value ϕ l e a f \\phi\_{leaf} ϕleaf​: ∑ i ∈ l e a f w i \[ l i ′ \+ l i ′ ′ ϕ l e a f \] \+ λ ϕ l e a f \= 0 \\sum\\limits\_{i \\in leaf} w\_{i} \\left\[ l^{'}\_{i} + l^{''}\_{i} \\phi\_{leaf} \\right\] + \\lambda \\phi\_{leaf} = 0 i∈leaf∑​wi​\[li′​\+li′′​ϕleaf​\]\+λϕleaf​\=0 So, the optimal value of ϕ l e a f \\phi\_{leaf} ϕleaf​ is: − ∑ i w i l i ′ ∑ i w i l i ′ ′ \+ λ \- \\displaystyle\\frac{\\sum\_{i}w\_{i}l^{'}\_{i}}{\\sum\_{i}w\_{i}l^{''}\_{i}+\\lambda} −∑i​wi​li′′​\+λ∑i​wi​li′​​ The summation is over i i i such that the object x i x\_{i} xi​ gets to the considered leaf. Then these optimal values of ϕ l e a f \\phi\_{leaf} ϕleaf​ can be used instead of weighted averages of gradients (a l e f t ∗ a^{\*}\_{left} aleft∗​ and a r i g h t ∗ a^{\*}\_{right} aright∗​ in the example above) in the same score functions. ## CatBoost score functions CatBoost provides the following score functions: **Score function:** L2 #### Description Use the first derivatives during the calculation. **Score function:** Cosine (can not be used with the Lossguide tree growing policy) **Score function:** NewtonL2 #### Description Use the second derivatives during the calculation. This may improve the resulting quality of the model. **Score function:** NewtonCosine (can not be used with the Lossguide tree growing policy) ## Per-object and per-feature penalties CatBoost provides the following methods to affect the score with penalties: - Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. - Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. The final score is calculated as follows: S c o r e ′ \= S c o r e ⋅ ∏ f ∈ S W f − ∑ f ∈ S P f ⋅ U ( f ) − ∑ f ∈ S ∑ x ∈ L E P f ⋅ U ( f , x ) Score' = Score \\cdot \\prod\_{f\\in S}W\_{f} - \\sum\_{f\\in S}P\_{f} \\cdot U(f) - \\sum\_{f\\in S}\\sum\_{x \\in L}EP\_{f} \\cdot U(f, x) Score′\=Score⋅∏f∈S​Wf​−∑f∈S​Pf​⋅U(f)−∑f∈S​∑x∈L​EPf​⋅U(f,x) - W f W\_{f} Wf​ is the feature weight - P f P\_{f} Pf​ is the per-feature penalty - E P f EP\_{f} EPf​ is the per-object penalty - S S S is the current split - L L L is the current leaf - U ( f ) \= { 0 , if f was used in model already 1 , otherwise U(f) = \\begin{cases} 0,& \\text{if } f \\text{ was used in model already}\\\\ 1,& \\text{otherwise} \\end{cases} U(f)\={0,1,​if f was used in model alreadyotherwise​ - U ( f , x ) \= { 0 , if f was used already for object x 1 , otherwise U(f, x) = \\begin{cases} 0,& \\text{if } f \\text{ was used already for object } x\\\\ 1,& \\text{otherwise} \\end{cases} U(f,x)\={0,1,​if f was used already for object xotherwise​ ## Usage Use the corresponding parameter to set the score function during the training: Alert The supported score functions vary depending on the processing unit type: - GPU — All score types - CPU — Cosine, L2 **Python package:** `score_function` **R package:** `score_function` **Command-line interface:** `--score-function` #### Description The [score type](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-score-functions) used to select the next split during the tree construction. Possible values: - Cosine (do not use this score type with the Lossguide tree growing policy) - L2 - NewtonCosine (do not use this score type with the Lossguide tree growing policy) - NewtonL2 ### Was the article helpful? Yes No Previous [Missing values processing](https://catboost.ai/docs/en/concepts/en/concepts/algorithm-missing-values-processing) Next [FAQ](https://catboost.ai/docs/en/concepts/en/concepts/faq) ![](https://mc.yandex.ru/watch/60763294)
Readable Markdown
The common approach to solve supervised learning tasks is to minimize the loss function L L: L ( f ( x ) , y ) \= ∑ i w i ⋅ l ( f ( x i ) , y i ) \+ J ( f ) , w h e r e L\\left(f(x), y\\right) = \\sum\\limits\_{i} w\_{i} \\cdot l \\left(f(x\_{i}), y\_{i}\\right) + J(f){ , where} - l ( f ( x ) , y ) l\\left( f(x), y\\right) is the value of the loss function at the point ( x , y ) (x, y) - w i w\_{i} is the weight of the i i \-th object - J ( f ) J(f) is the regularization term. For example, these formulas take the following form for linear regression: - l ( f ( x ) , y ) \= w i ( ( θ , x ) − y ) 2 l\\left( f(x), y\\right) = w\_{i} \\left( (\\theta, x) - y \\right)^{2} (mean squared error) - J ( f ) \= λ ∣ ∣ θ ∣ ∣ l 2 J(f) = \\lambda \\left\| \| \\theta \| \\right\|\_{l2} (L2 regularization) ## Gradient boosting Boosting is a method which builds a prediction model F T F^{T} as an ensemble of weak learners F T \= ∑ t \= 1 T f t F^{T} = \\sum\\limits\_{t=1}^{T} f^{t}. In our case, f t f^{t} is a decision tree. Trees are built sequentially and each next tree is built to approximate negative gradients g i g\_{i} of the loss function l l at predictions of the current ensemble: g i \= − ∂ l ( a , y i ) ∂ a ∣ a \= F T − 1 ( x i ) g\_{i} = -\\frac{\\partial l(a, y\_{i})}{\\partial a} \\Bigr\|\_{a = F^{T-1}(x\_{i})} Thus, it performs a gradient descent optimization of the function L L. The quality of the gradient approximation is measured by a score function S c o r e ( a , g ) \= S ( a , g ) Score(a, g) = S(a, g). ## Types of score functions Let's suppose that it is required to add a new tree to the ensemble. A score function is required in order to choose between candidate trees. Given a candidate tree f f let a i a\_{i} denote f ( x i ) f(x\_{i}), w i w\_{i} — the weight of i i\-th object, and g i g\_{i} – the corresponding gradient of l l. Let’s consider the following score functions: - L 2 \= − ∑ i w i ⋅ ( a i − g i ) 2 L2 = - \\sum\\limits\_{i} w\_{i} \\cdot (a\_{i} - g\_{i})^{2} - C o s i n e \= ∑ w i ⋅ a i ⋅ g i ∑ w i a i 2 ⋅ ∑ w i g i 2 Cosine = \\displaystyle\\frac{\\sum w\_{i} \\cdot a\_{i} \\cdot g\_{i}}{\\sqrt{\\sum w\_{i}a\_{i}^{2}} \\cdot \\sqrt{\\sum w\_{i}g\_{i}^{2}}} ## Finding the optimal tree structure Let's suppose that it is required to find the structure for the tree f f of depth 1. The structure of such tree is determined by the index j j of some feature and a border value c c. Let x i , j x\_{i, j} be the value of the j j\-th feature on the i i\-th object and a l e f t a\_{left} and a r i g h t a\_{right} be the values at leafs of f f. Then, f ( x i ) f(x\_{i}) equals to a l e f t a\_{left} if x i , j ≤ c x\_{i,j} \\leq c and a r i g h t a\_{right} if x i , j \> c x\_{i,j} \> c. Now the goal is to find the best j j and c c in terms of the chosen score function. For the L2 score function the formula takes the following form: S ( a , g ) \= − ∑ i w i ( a i − g i ) 2 \= − ( ∑ i : x i , j ≤ c w i ( a l e f t − g i ) 2 \+ ∑ i : x i , j \> c w i ( a r i g h t − g i ) 2 ) S(a, g) = -\\sum\\limits\_{i} w\_{i} (a\_{i} - g\_{i})^{2} = - \\left( \\displaystyle\\sum\\limits\_{i:x\_{i,j}\\leq c} w\_{i}(a\_{left} - g\_{i})^{2} + \\sum\\limits\_{i: x\_{i,j}\>c} w\_{i}(a\_{right} - g\_{i})^{2} \\right) Let's denote W l e f t \= ∑ i : x I , j ≤ c w i W\_{left} = \\displaystyle\\sum\_{i: x\_{I,j} \\leq c} w\_{i} and W r i g h t \= ∑ i : x i , j \> c w i W\_{right} = \\displaystyle\\sum\_{i: x\_{i,j} \>c} w\_{i}. The optimal values for a l e f t a\_{left} and a r i g h t a\_{right} are the weighted averages: - a l e f t ∗ \= ∑ i : x i , j ≤ c w i g i W l e f t a^{\*}\_{left} =\\displaystyle\\frac{\\sum\\limits\_{i: x\_{i,j} \\leq c} w\_{i} g\_{i}}{W\_{left}} - a r i g h t ∗ \= ∑ i : x i , j \> c w i g i W r i g h t a^{\*}\_{right} =\\displaystyle\\frac{\\sum\\limits\_{i: x\_{i,j} \> c} w\_{i} g\_{i}}{W\_{right}} After expanding brackets and removing terms, which are constant in the optimization: j ∗ , c ∗ \= a r g m a x j , c W l e f t ⋅ ( a l e f t ∗ ) 2 \+ W r i g h t ⋅ ( a r i g h t ∗ ) 2 j^{\*}, c^{\*} = argmax\_{j, c} W\_{left} \\cdot (a^{\*}\_{left})^{2} + W\_{right} \\cdot (a^{\*}\_{right})^{2} The latter argmax can be calculated by brute force search. The situation is slightly more complex when the tree depth is bigger than 1: - L2 score function: S is converted into a sum over leaves S ( a , g ) \= ∑ l e a f S ( a l e a f , g l e a f ) S(a,g) = \\sum\_{leaf} S(a\_{leaf}, g\_{leaf}) . The next step is to find j ∗ , c ∗ \= a r g m a x j , c S ( a ˉ , g ) j\*, c\* = argmax\_{j,c}{S(\\bar a, g)} , where a ˉ \\bar a are the optimal values in leaves after the j ∗ , c ∗ j\*, c\* split. - Depthwise and Lossguide methods: j , c j, c are sets of { j k } , { c k } \\{j\_k\\}, \\{c\_k\\} . k k stands for the index of the leaf, therefore the score function S S takes the following form: S ( a ˉ , g ) \= ∑ l \= l e a f S ( a ˉ ( j l , c l ) , g l ) S(\\bar a, g) = \\sum\_{l = leaf}S(\\bar a(j\_l, c\_l), g\_l) . Since S ( l e a f ) S(leaf) is a convex function, different j k 1 , c k 1 j\_{k1}, c\_{k1} and j k 2 , c k 2 j\_{k2}, c\_{k2} (splits for different leaves) can be searched separately by finding the optimal j ∗ , c ∗ \= a r g m a x j , c { S ( l e a f l e f t ) \+ S ( l e a f r i g h t ) − S ( l e a f b e f o r e \_ s p l i t ) } j\*, c\* = argmax\_{j,c}\\{S(leaf\_{left}) + S(leaf\_{right}) - S(leaf\_{before\\\_split})\\} . - SymmetricTree method: The same j , c j, c are attempted to be found for each leaf, thus it's required to optimize the total sum over all leaves S ( a , g ) \= ∑ l e a f S ( l e a f ) S(a,g) = \\sum\_{leaf} S(leaf) . ## Second-order leaf estimation method Let's apply the Taylor expansion to the loss function at the point a t − 1 \= F t − 1 ( x ) a^{t-1} = F^{t-1}(x): L ( a i t − 1 \+ ϕ , y ) ≈ ∑ w i \[ l i \+ l i ′ ϕ \+ 1 2 l i ′ ′ ϕ 2 \] \+ 1 2 λ ∣ ∣ ϕ ∣ ∣ 2 , w h e r e : L(a^{t-1}\_{i} + \\phi , y) \\approx \\displaystyle\\sum w\_{i} \\left\[ l\_{i} + l^{'}\_{i} \\phi + \\frac{1}{2} l^{''}\_{i} \\phi^{2} \\right\] + \\frac{1}{2} \\lambda \|\|\\phi\|\|\_{2}{ , where:} - l i \= l ( a i t − 1 , y i ) l\_{i} = l(a^{t-1}\_{i}, y\_{i}) - l i ′ \= − ∂ l ( a , y i ) ∂ a ∣ a \= a i t − 1 l'\_{i} = -\\frac{\\partial l(a, y\_{i})}{\\partial a}\\Bigr\|\_{a=a^{t-1}\_{i}} - l i ′ ′ \= − ∂ 2 l ( a , y i ) ∂ a 2 ∣ a \= a i t − 1 l''\_{i} = -\\frac{\\partial^{2} l(a, y\_{i})}{\\partial a^{2}}\\displaystyle\\Bigr\|\_{a=a^{t-1}\_{i}} - λ \\lambda is the l2 regularization parameter Since the first term is constant in optimization, the formula takes the following form after regrouping by leaves: ∑ l e a f \= 1 L ( ∑ i ∈ l e a f w i \[ l i \+ l i ′ ϕ l e a f \+ 1 2 l i ′ ′ ϕ 2 \] \+ 1 2 λ ϕ l e a f 2 ) → m i n \\sum\\limits\_{leaf=1}^{L} \\left( \\sum\\limits\_{i \\in leaf} w\_{i} \\left\[ l\_{i} + l^{'}\_{i} \\phi\_{leaf} + \\frac{1}{2} l^{''}\_{i} \\phi^{2} \\right\] + \\frac{1}{2} \\lambda \\phi\_{leaf}^{2} \\right) \\to min Then let's minimize this expression for each leaf independently: ∑ i ∈ l e a f w i \[ l i \+ l i ′ ϕ l e a f \+ 1 2 l i ′ ′ ϕ l e a f 2 \] \+ 1 2 λ ϕ l e a f 2 → m i n \\sum\\limits\_{i \\in leaf} w\_{i} \\left\[ l\_{i} + l^{'}\_{i} \\phi\_{leaf} + \\frac{1}{2} l^{''}\_{i} \\phi^{2}\_{leaf} \\right\] + \\frac{1}{2} \\lambda \\phi\_{leaf}^2 \\to min Differentiate by leaf value ϕ l e a f \\phi\_{leaf}: ∑ i ∈ l e a f w i \[ l i ′ \+ l i ′ ′ ϕ l e a f \] \+ λ ϕ l e a f \= 0 \\sum\\limits\_{i \\in leaf} w\_{i} \\left\[ l^{'}\_{i} + l^{''}\_{i} \\phi\_{leaf} \\right\] + \\lambda \\phi\_{leaf} = 0 So, the optimal value of ϕ l e a f \\phi\_{leaf} is: − ∑ i w i l i ′ ∑ i w i l i ′ ′ \+ λ \- \\displaystyle\\frac{\\sum\_{i}w\_{i}l^{'}\_{i}}{\\sum\_{i}w\_{i}l^{''}\_{i}+\\lambda} The summation is over i i such that the object x i x\_{i} gets to the considered leaf. Then these optimal values of ϕ l e a f \\phi\_{leaf} can be used instead of weighted averages of gradients (a l e f t ∗ a^{\*}\_{left} and a r i g h t ∗ a^{\*}\_{right} in the example above) in the same score functions. ## CatBoost score functions CatBoost provides the following score functions: **Score function:** L2 #### Description Use the first derivatives during the calculation. **Score function:** Cosine (can not be used with the Lossguide tree growing policy) **Score function:** NewtonL2 #### Description Use the second derivatives during the calculation. This may improve the resulting quality of the model. **Score function:** NewtonCosine (can not be used with the Lossguide tree growing policy) ## Per-object and per-feature penalties CatBoost provides the following methods to affect the score with penalties: - Per-feature penalties for the first occurrence of the feature in the model. The given value is subtracted from the score if the current candidate is the first one to include the feature in the model. - Per-object penalties for the first use of the feature for the object. The given value is multiplied by the number of objects that are divided by the current split and use the feature for the first time. The final score is calculated as follows: S c o r e ′ \= S c o r e ⋅ ∏ f ∈ S W f − ∑ f ∈ S P f ⋅ U ( f ) − ∑ f ∈ S ∑ x ∈ L E P f ⋅ U ( f , x ) Score' = Score \\cdot \\prod\_{f\\in S}W\_{f} - \\sum\_{f\\in S}P\_{f} \\cdot U(f) - \\sum\_{f\\in S}\\sum\_{x \\in L}EP\_{f} \\cdot U(f, x) - W f W\_{f} is the feature weight - P f P\_{f} is the per-feature penalty - E P f EP\_{f} is the per-object penalty - S S is the current split - L L is the current leaf - U ( f ) \= { 0 , if f was used in model already 1 , otherwise U(f) = \\begin{cases} 0,& \\text{if } f \\text{ was used in model already}\\\\ 1,& \\text{otherwise} \\end{cases} - U ( f , x ) \= { 0 , if f was used already for object x 1 , otherwise U(f, x) = \\begin{cases} 0,& \\text{if } f \\text{ was used already for object } x\\\\ 1,& \\text{otherwise} \\end{cases} ## Usage Use the corresponding parameter to set the score function during the training: Alert The supported score functions vary depending on the processing unit type: - GPU — All score types - CPU — Cosine, L2 **Python package:** `score_function` **R package:** `score_function` **Command-line interface:** `--score-function` #### Description The [score type](https://catboost.ai/docs/en/concepts/algorithm-score-functions) used to select the next split during the tree construction. Possible values: - Cosine (do not use this score type with the Lossguide tree growing policy) - L2 - NewtonCosine (do not use this score type with the Lossguide tree growing policy) - NewtonL2
Shard169 (laksa)
Root Hash17435841955170310369
Unparsed URLai,catboost!/docs/en/concepts/algorithm-score-functions s443