ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 2.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://allmodelsarewrong.github.io/ols.html |
| Last Crawled | 2026-02-13 06:07:37 (2 months ago) |
| First Indexed | not set |
| HTTP Status Code | 200 |
| Meta Title | 5 Linear Regression | All Models Are Wrong: Concepts of Statistical Learning |
| Meta Description | This is a text about the fundamental concepts of Statistical Learning Methods. |
| Meta Canonical | null |
| Boilerpipe Text | Before entering supervised learning territory, we want to discuss the general
framework of linear regression. We will introduce this topic from a pure
model-fitting point of view. In other words, we will postpone the learning
aspect (the prediction of new data) after the
chapter of theory of learning . The reason to cover linear regression in this way is for us to have something
to work with when we start talking about the theory of supervised learning. Motivation Consider, again, the NBA dataset example from previous chapters. Suppose we want
to use this data to predict the salary of NBA players, in terms of certain
variables like player’s team, player’s height, player’s weight,
player’s position, player’s years of professional experience,
player’s number of 2pts, player’s number of 3 points, number of blocks, etc.
Of course, we need information on the salaries of some current NBA player’s: 1 ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ 2 ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ 3 ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ … … … … … … As usual, we use the symbol x i x i to denote the vector of measurements
of player i i ’s statistics (e.g. height, weight, etc.); in turn, the salary
of the i i -th player is represented with y i y i . Ideally, we assume the existance of some function f : X → y f : X → y
(i.e. a function that takes values from X X space and maps them to a
single value y y ). We will refer to this function the ideal “formula” for salary.
Here we are using the word formula in a very lose sense, and not necessarily
using the word “formula” in the mathematical sense. We now seek a hypothesized model (which we call ˆ f : X → y f ^ : X → y ), which we select from
some set of candidate functions h 1 , h 2 , … , h m h 1 , h 2 , … , h m .
Our task is to obtain ˆ f f ^ in a way that we can claim that it is a
good approximation of the (unknown) function f f . The Idea/Intuition of Regression Let’s go back to our example with NBA players. Recall that y i y i denotes the
salary for the i i -th player. For simplicity’s sake let’s not worry about
inflation. Say we now have a new prospective player from Europe; and we are
tasked with predicting their salary denoted by y 0 y 0 . Let’s review a couple of
scenarios to get a high-level intuition for this task. Scenario 1 . Suppose we have no information
on this new player. How would we compute y 0 y 0 (i.e. the salary of this new
player)? One possibility is to guesstimate y 0 y 0 using the historical average
salary ¯ y y ¯ of NBA players. In other words, we would simply calculate:
^ y 0 = ¯ y y ^ 0 = y ¯ .
In this case we are using ¯ y y ¯ as the typical score (e.g. a measure of
center) as a plausible guesstimate for y 0 y 0 . We could also look at the median
of the existing salaries, if we are concerned about outliers or some skewed
distribution of salaries. Scenario 2 . Now, suppose we know that this new player will sign on to the
LA Lakers. Compared to scenario 1, we now have a new bit of information since
we know which team will hire this player. Therefore, we can use this fact to
have a more educated guess for y 0 y 0 . How? Instead of using the salaries of all
playes, we can focus on the salaries of Laker’s players. We could then use
^ y 0 = avg ( Laker's Salaries ) y ^ 0 = avg ( Laker's Salaries ) :
that is, the average salary of all Laker’s players. It is reasonable that
^ y 0 y ^ 0 is “closer” to the average salary of Laker’s than to the overall
average salary of all NBA players.
Figure 5.1: Average salary by team
Scenario 3 . Similarly, if we know this new player’s years of experience
(e.g. 6 years), we would look at the average of salaries corresponding to
players with the same level of experience.
Figure 5.2: Average salary by years of experience
What do the three previous scenarios correspond to?
In all of these examples, the prediction is basically a conditional mean: ^ y 0 = ave ( y i | x i = x 0 ) (5.1) (5.1) y ^ 0 = ave ( y i | x i = x 0 ) Of course, the previous strategy only makes sense when we have data points x i x i
that are equal to the query point x 0 x 0 . But what if none of the available x i x i
values are equal to x 0 x 0 ? We’ll talk about this later. The previous hypothetical scenarios illustrate the core idea of regression:
we obtain predictions ^ y 0 y ^ 0 using quantities of the form
ave ( y i | x i = x 0 ) ave ( y i | x i = x 0 )
which can be formalized—under some assumptions—into the notion of
conditional expectations of the form: E ( y i ∣ x ∗ i 1 , x ∗ i 2 , … , x ∗ i p ) ⟶ ^ y (5.2) (5.2) E ( y i ∣ x i 1 ∗ , x i 2 ∗ , … , x i p ∗ ) ⟶ y ^ where x ∗ i j x i j ∗ represents the i i -th measurement of the j j -th variable.
The above equation is what we call the regression function ; note that the
regression function is nothing more than a conditional expectation! The Linear Regression Model In a regression model we use one or more features X X to say something about the
reponse Y Y . In turn, a linear regression model tells us to combine our
features in a linear way in order to approximate the response, In the
univarite case, we have a linear equation: ^ Y = b 0 + b 1 X (5.3) (5.3) Y ^ = b 0 + b 1 X In pointwise format, that is for a given individual i i , we have: ^ y i = b 0 + b 1 x i (5.4) (5.4) y ^ i = b 0 + b 1 x i In vector notation: ^ y = b 0 + b 1 x (5.5) (5.5) y ^ = b 0 + b 1 x To simplify notation, sometimes we prefer to add an auxiliary constant feature
in the form of a vector of 1’s with n n elements, and then use matrix notation
with the following elements: X = ⎡ ⎢
⎢
⎢
⎢ ⎣ 1 x 1 1 x 2 ⋮ ⋮ 1 x n ⎤ ⎥
⎥
⎥
⎥ ⎦ , ^ y = ⎡ ⎢
⎢
⎢
⎢
⎢ ⎣ ^ y 1 ^ y 2 ⋮ ^ y n ⎤ ⎥
⎥
⎥
⎥
⎥ ⎦ , b = [ b 0 b 1 ] X = [ 1 x 1 1 x 2 ⋮ ⋮ 1 x n ] , y ^ = [ y ^ 1 y ^ 2 ⋮ y ^ n ] , b = [ b 0 b 1 ] In the multidimensional case when we have p > 1 p > 1 predictors: X = ⎡ ⎢
⎢
⎢
⎢
⎢ ⎣ 1 x 11 … x 1 p 1 x 21 … x 2 p ⋮ ⋮ ⋱ ⋮ 1 x n 1 … x n p ⎤ ⎥
⎥
⎥
⎥
⎥ ⎦ , ^ y = ⎡ ⎢
⎢
⎢
⎢
⎢ ⎣ ^ y 1 ^ y 2 ⋮ ^ y n ⎤ ⎥
⎥
⎥
⎥
⎥ ⎦ , b = ⎡ ⎢
⎢
⎢
⎢
⎢ ⎣ b 0 b 1 ⋮ b p ⎤ ⎥
⎥
⎥
⎥
⎥ ⎦ X = [ 1 x 11 … x 1 p 1 x 21 … x 2 p ⋮ ⋮ ⋱ ⋮ 1 x n 1 … x n p ] , y ^ = [ y ^ 1 y ^ 2 ⋮ y ^ n ] , b = [ b 0 b 1 ⋮ b p ] With the matrix of features, the response, and the coefficients we have a
compact expression for the predicted outcomes: ^ y = X b (5.6) (5.6) y ^ = X b In path diagram form, the linear model looks like this:
Figure 5.3: Linear combination with constant term
If we assume that the predictors and the response are mean-centered, then we
don’t have to worry about the constant term x 0 x 0 :
Figure 5.4: Linear combination without constant term
Obviously the question becomes: how do we obtain the vector of coefficients b b ? The Error Measure We would like to get ^ y i y ^ i to be “as close as” possible to y i y i .
This requires to come up with some type of measure of closeness . Among the
various functions that we can use to measure how close ^ y i y ^ i and y i y i
are, the most common option is to use the squared distance between such values: d 2 ( y i , ^ y i ) = ( y i − ^ y i ) 2 = ( ^ y i − y i ) 2 (5.7) (5.7) d 2 ( y i , y ^ i ) = ( y i − y ^ i ) 2 = ( y ^ i − y i ) 2 Replacing ^ y i y ^ i with b T → x i b T x → i we have: d 2 ( y i , ^ y i ) = ( b T → x i − y i ) 2 (5.8) (5.8) d 2 ( y i , y ^ i ) = ( b T x → i − y i ) 2 Notice that d 2 ( y i , ^ y i ) d 2 ( y i , y ^ i ) is a pointwise error measure that we can
generally denote as err i err i . But we also need to define a global
measure of error. This is typically done by adding all the
pointwise error measures err i err i . There are two flavors of overall error measures based on squared pointwise
differences: the sum of squared errors or SSE SSE , and the mean squared error or MSE MSE . The sum of squared errors, SSE SSE , is defined as: SSE = n ∑ i = 1 err i (5.9) (5.9) SSE = ∑ i = 1 n err i The mean squared error, MSE MSE , is defined as: MSE = 1 n n ∑ i = 1 err i (5.10) (5.10) MSE = 1 n ∑ i = 1 n err i As you can tell, SSE = n MSE SSE = n MSE , and viceversa, MSE = SSE / n MSE = SSE / n Throughout this book, unless mentioned otherwise, when dealing with regression
problems, we will consider the MSE MSE as the default overall error
function to be minimized (you could also take SSE SSE instead).
Let e i e i be: e i = ( y i − ^ y i ) → e 2 i = ( y i − ^ y i ) 2 = err i (5.11) (5.11) e i = ( y i − y ^ i ) → e i 2 = ( y i − y ^ i ) 2 = err i Doing some algebra, it’s easy to see that: MSE = 1 n n ∑ i = 1 e 2 i = 1 n n ∑ i = 1 ( ^ y i − y i ) 2 = 1 n n ∑ i = 1 ( b T → x i − y i ) 2 = 1 n ( X b − y ) T ( X b − y ) = 1 n ∥ X b − y ∥ 2 = 1 n ∥ ^ y − y ∥ 2 (5.12) MSE = 1 n ∑ i = 1 n e i 2 = 1 n ∑ i = 1 n ( y ^ i − y i ) 2 = 1 n ∑ i = 1 n ( b T x → i − y i ) 2 = 1 n ( X b − y ) T ( X b − y ) = 1 n ‖ X b − y ‖ 2 (5.12) = 1 n ‖ y ^ − y ‖ 2 As you can tell, the Mean Squared Error MSE MSE is proportional to the
squared norm of the residual vector e = ^ y − y e = y ^ − y , MSE = 1 n ∥ e ∥ 2 = 1 n ( ^ y − y ) T ( ^ y − y ) (5.13) (5.13) MSE = 1 n ‖ e ‖ 2 = 1 n ( y ^ − y ) T ( y ^ − y ) The Least Squares Algorithm In (ordinary) least squares regression, we want to minimize the mean of squared
errors ( MSE MSE ). This minimization problem involves computing the derivative
of MSE MSE with respect to b b . In other words, we compute the
gradient of MSE MSE , denoted ∇ MSE ( b ) ∇ MSE ( b ) , which is the
vector of partial derivatives of MSE MSE with respecto to each parameter
b 0 , b 1 , … , b p b 0 , b 1 , … , b p : ∇ MSE ( b ) = ∂ ∂ b MSE ( b ) = ∂ ∂ b ( 1 n b T X T X b − 2 n b T X T y + 1 n y T y ) (5.14) ∇ MSE ( b ) = ∂ ∂ b MSE ( b ) (5.14) = ∂ ∂ b ( 1 n b T X T X b − 2 n b T X T y + 1 n y T y ) which becomes: ∇ MSE ( b ) = 2 n X T X b − 2 n X T y (5.15) (5.15) ∇ MSE ( b ) = 2 n X T X b − 2 n X T y Equating to zero we get that: X T X b = X T y ( normal equations ) (5.16) (5.16) X T X b = X T y ( normal equations ) The above equation defines a system of equations that most authors refer to as
the so-called Normal equations. It is a system of n n equations with
p + 1 p + 1 unknowns (assuming we have a constant term b 0 b 0 ). If the cross-product matrix X T X X T X is invertible, which is not
a minor assumption, then the vector of regression coefficients b b that
we are looking for is given by: b = ( X T X ) − 1 X T y (5.17) (5.17) b = ( X T X ) − 1 X T y Having obtained b b , we can easily compute the response vector: ^ y = X b = X ( X T X ) − 1 X T y (5.18) y ^ = X b (5.18) = X ( X T X ) − 1 X T y If we denote H = X ( X T X ) − 1 X T H = X ( X T X ) − 1 X T , then the predicted response is: ^ y = H y (5.19) (5.19) y ^ = H y This matrix H H is better known as the hat matrix , because it puts
the hat on the response. More importantly, the matrix H H is an
orthogonal projector . From linear algebra, orthogonal projectors have very interesting properties: they are symmetric they are idempotent their eigenvalues are either 0 or 1 Geometries of OLS Now that we’ve seen the algebra, it’s time to look at the geometric interpretation
of all the action that is going on within linear regression via OLS.
We will discuss three geometric perspectives: OLS from the individuals point of view (i.e. rows of the data matrix). OLS from the variables point of view (i.e. columns of the data matrix). OLS from the parameters point of view, and the error surface. Rows Perspective This is probably the most popular perspective covered in most textbooks. For illustration purposes let’s assume that our data has just p = 1 p = 1 predictor.
In other words, we have the response Y Y and one predictor X X . We can depict
individuals as points in this space:
Figure 5.5: Scatterplot of individuals
In linear regression, we want to predict y i y i by linearly mixing the inputs
^ y i = b 0 + b 1 x i y ^ i = b 0 + b 1 x i . In two dimensions, the fitted model corresponds
to a line. In three dimensions it would correspond to a plane. And in higher
dimensions this would correspond to a hyperplane.
Figure 5.6: Scatterplot with regression line
With a fitted line, we obtain predicted values ^ y i y ^ i . Some predicted values
may be equal to the observed values. Other predicted values will be greater than
the observed values. And some predicted values will be smaller than
the observed values.
Figure 5.7: Observed values and predicted values
As you can imagine, given a set of data points, you can fit an infinite number
of lines (in general). So which line are we looking for? We want to obtain the
line that minimizes the square of the errors e i = ^ y i − y i e i = y ^ i − y i .
In the figure below, these errors (also known as _residuals) are represented by
the vertical difference between the observed values y i y i and the predicted
values ^ y i y ^ i .
Figure 5.8: OLS focuses on minimizing the squared errors
Combining all residuals, we want to obtain parameters b 0 , … , b p b 0 , … , b p
that minimize the squared norm of the vector of residuals: n ∑ i = 1 e 2 i = n ∑ i = 1 ( ^ y i − y i ) 2 = n ∑ i = 1 ( b 0 + b 1 x i − y i ) 2 (5.20) (5.20) ∑ i = 1 n e i 2 = ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( b 0 + b 1 x i − y i ) 2 In vector-matrix form we have: ∥ e ∥ 2 = ∥ ^ y − y ∥ 2 = ∥ X b − y ∥ 2 = ( X b − y ) T ( X b − y ) ∝ MSE (5.21) ‖ e ‖ 2 = ‖ y ^ − y ‖ 2 = ‖ X b − y ‖ 2 = ( X b − y ) T ( X b − y ) (5.21) ∝ MSE As you can tell, minimizing the squared norm of the vector of residuals,
is equivalent to minimizing the mean squared error. Columns Perspective We can also look at the geometry of OLS from the columns perspective. This is
less common than the rows perspective, but still very enlightening. Imagine variables in an n n -dimensional space, both the response and the predictors.
In this space, the X X variables will span some subspace S X S X .
This subspace is not supposed to contain the response—unless Y Y happens to be
a linear combination of X 1 , … , X p X 1 , … , X p .
Figure 5.9: Features and Response view in n-dim space
What are we looking for? We’re looking for a linear combination X b X b
that gives us a good approximation to y y .
As you can tell, there’s an infinite number of linear combinations that can be
formed with X 1 , … , X p X 1 , … , X p .
Figure 5.10: Linear combination of features
The mix of features that we are interested in, ^ y = X b y ^ = X b ,
is the one that gives us the closest approximation to y y .
Figure 5.11: Linear combination to be as close as possible to response
Now, what do we mean by closest approximation ?
How do we determine the closeness between ^ y y ^ and y y ?
By looking at the difference, which results in a vector
e = ^ y − y e = y ^ − y . And then measuring the size
or norm of this vector. Well, the squared norm to be precise. In other words,
we want to obtain ^ y y ^ such that the squared norm ∥ e ∥ 2 ‖ e ‖ 2
is as small as possible. Minimize ∥ e ∥ 2 = ∥ ^ y − y ∥ 2 (5.22) (5.22) Minimize ‖ e ‖ 2 = ‖ y ^ − y ‖ 2 Minimizing the squared norm of e e involves minimizing
the mean squared error. Parameters Perspective In addition to the two previously discussed perspectives (rows and columns),
we could also visualize the regression problem from the point of view of the
parameters b b and the error surface from MSE MSE .
This is the least common perspective discussed in the literature that has
to do with linear regression in general. However, it is not that uncommon
within the Statistical Learning literature. For illustration purposes, assume that we have only two predictors X 1 X 1 and
X 2 X 2 . Recall that the Mean Squared Error (MSE) is: E ( y , ^ y ) = 1 n ( b T X T X b − 2 b T X T y + y T y ) (5.23) (5.23) E ( y , y ^ ) = 1 n ( b T X T X b − 2 b T X T y + y T y ) Now, from the point of view of b = ( b 1 , b 2 ) b = ( b 1 , b 2 ) , we can classify the
order of each term: E ( y , ^ y ) = 1 n ( b T X T X b Quadratic Form − 2 b T X T y Linear + y T y Constant ) (5.24) (5.24) E ( y , y ^ ) = 1 n ( b T X T X b ⏟ Quadratic Form − 2 b T X T y ⏟ Linear + y T y ⏟ Constant ) Since X T X X T X is positive semidefinite, we know that
b T X T X b ≥ 0 b T X T X b ≥ 0 . Furthermore, we know that
(from vector calculus) it will be a paraboloid (bowl-shaped surface) in the
( E , b 1 , b 2 ) ( E , b 1 , b 2 ) space. The following diagram depicts this situation.
Figure 5.12: Error Surface
Imagine that we get horizontal slices of the error surface. For any of those
slices, we can project them onto the plane spanned by the parameters
b 1 b 1 and b 2 b 2 . The resulting projections will be like a topographic map,
with error contours on this plane. In general, those contours will be ellipses.
Figure 5.13: Error Surface with slices, and their projections
With quadratic error surfaces like this, they have a minimum value, and we are
guaranteed the existence of
b ∗ = ( b ∗ 1 , b ∗ 2 ) b ∗ = ( b 1 ∗ , b 2 ∗ ) s.t. b T X T X b b T X T X b
is minimized. This is a powerful result!
Consider, for example, a parabolic cylinder .
Such a shape has no unique minimum;
rather, it has an infinite number of points (all lying on a line) that minimize
the function. The point being; with positive semi-definite matrices, we
never have this latter case.
Figure 5.14: Error Surface with contour errors and the minimum
The minimum of the error surface occurs at the point ( b ∗ 1 , b ∗ 2 ) ( b 1 ∗ , b 2 ∗ ) .
This is the precisely the OLS solution. |
| Markdown | Type to search
- [**All Models Are Wrong (CSL)** G. Sanchez & E. Marzban](https://allmodelsarewrong.github.io/)
- **I Welcome**
- [Preface](https://allmodelsarewrong.github.io/index.html)
- [**1** About this book](https://allmodelsarewrong.github.io/about.html)
- [Prerequisites](https://allmodelsarewrong.github.io/about.html#prerequisites)
- [Acknowledgements](https://allmodelsarewrong.github.io/about.html#acknowledgements)
- **II Intro**
- [**2** Introduction](https://allmodelsarewrong.github.io/intro.html)
- [**2\.1** Basic Notation](https://allmodelsarewrong.github.io/intro.html#basic-notation)
- [**3** Geometric Duality](https://allmodelsarewrong.github.io/duality.html)
- [**3\.1** Rows Space](https://allmodelsarewrong.github.io/duality.html#rows-space)
- [**3\.2** Columns Space](https://allmodelsarewrong.github.io/duality.html#columns-space)
- [**3\.3** Cloud of Individuals](https://allmodelsarewrong.github.io/duality.html#cloud-of-individuals)
- [**3\.3.1** Average Individual](https://allmodelsarewrong.github.io/duality.html#average-individual)
- [**3\.3.2** Centered Data](https://allmodelsarewrong.github.io/duality.html#centered-data)
- [**3\.3.3** Distance between individuals](https://allmodelsarewrong.github.io/duality.html#distance-between-individuals)
- [**3\.3.4** Distance to the centroid](https://allmodelsarewrong.github.io/duality.html#distance-to-the-centroid)
- [**3\.3.5** Measures of Dispersion](https://allmodelsarewrong.github.io/duality.html#measures-of-dispersion)
- [**3\.4** Cloud of Variables](https://allmodelsarewrong.github.io/duality.html#cloud-of-variables)
- [**3\.4.1** Mean of a Variable](https://allmodelsarewrong.github.io/duality.html#mean-of-a-variable)
- [**3\.4.2** Variance of a Variable](https://allmodelsarewrong.github.io/duality.html#variance-of-a-variable)
- [**3\.4.3** Variance with Vector Notation](https://allmodelsarewrong.github.io/duality.html#variance-with-vector-notation)
- [**3\.4.4** Standard Deviation as a Norm](https://allmodelsarewrong.github.io/duality.html#standard-deviation-as-a-norm)
- [**3\.4.5** Covariance](https://allmodelsarewrong.github.io/duality.html#covariance)
- [**3\.4.6** Correlation](https://allmodelsarewrong.github.io/duality.html#correlation)
- [**3\.4.7** Geometry of Correlation](https://allmodelsarewrong.github.io/duality.html#geometry-of-correlation)
- [**3\.4.8** Orthogonal Projections](https://allmodelsarewrong.github.io/duality.html#orthogonal-projections)
- [**3\.4.9** The mean as an orthogonal projection](https://allmodelsarewrong.github.io/duality.html#the-mean-as-an-orthogonal-projection)
- **III Unsupervised I: PCA**
- [**4** Principal Components Analysis](https://allmodelsarewrong.github.io/pca.html)
- [**4\.1** Low-dimensional Representations](https://allmodelsarewrong.github.io/pca.html#low-dimensional-representations)
- [**4\.2** Projections](https://allmodelsarewrong.github.io/pca.html#projections)
- [**4\.2.1** Vector and Scalar Projections](https://allmodelsarewrong.github.io/pca.html#vector-and-scalar-projections)
- [**4\.2.2** Projected Inertia](https://allmodelsarewrong.github.io/pca.html#projected-inertia)
- [**4\.3** Maximization Problem](https://allmodelsarewrong.github.io/pca.html#maximization-problem)
- [**4\.3.1** Eigenvectors of S S](https://allmodelsarewrong.github.io/pca.html#eigenvectors-of-mathbfs)
- [**4\.4** Another Perspective of PCA](https://allmodelsarewrong.github.io/pca.html#another-perspective-of-pca)
- [**4\.4.1** Finding Principal Components](https://allmodelsarewrong.github.io/pca.html#finding-principal-components)
- [**4\.4.2** Finding the first PC](https://allmodelsarewrong.github.io/pca.html#finding-the-first-pc)
- [**4\.4.3** Finding the second PC](https://allmodelsarewrong.github.io/pca.html#finding-the-second-pc)
- [**4\.4.4** Finding all PCs](https://allmodelsarewrong.github.io/pca.html#finding-all-pcs)
- [**4\.5** Data Decomposition Model](https://allmodelsarewrong.github.io/pca.html#data-decomposition-model)
- [**4\.5.1** Alternative Approaches](https://allmodelsarewrong.github.io/pca.html#alternative-approaches)
- **IV Linear Regression**
- [**5** Linear Regression](https://allmodelsarewrong.github.io/ols.html)
- [**5\.1** Motivation](https://allmodelsarewrong.github.io/ols.html#motivation)
- [**5\.2** The Idea/Intuition of Regression](https://allmodelsarewrong.github.io/ols.html#the-ideaintuition-of-regression "5.2 The Idea/Intuition of Regression")
- [**5\.3** The Linear Regression Model](https://allmodelsarewrong.github.io/ols.html#the-linear-regression-model)
- [**5\.4** The Error Measure](https://allmodelsarewrong.github.io/ols.html#the-error-measure)
- [**5\.5** The Least Squares Algorithm](https://allmodelsarewrong.github.io/ols.html#the-least-squares-algorithm)
- [**5\.6** Geometries of OLS](https://allmodelsarewrong.github.io/ols.html#geometries-of-ols)
- [**5\.6.1** Rows Perspective](https://allmodelsarewrong.github.io/ols.html#rows-perspective)
- [**5\.6.2** Columns Perspective](https://allmodelsarewrong.github.io/ols.html#columns-perspective)
- [**5\.6.3** Parameters Perspective](https://allmodelsarewrong.github.io/ols.html#parameters-perspective)
- [**6** Gradient Descent](https://allmodelsarewrong.github.io/gradient.html)
- [**6\.1** Error Surface](https://allmodelsarewrong.github.io/gradient.html#error-surface)
- [**6\.2** Idea of Gradient Descent](https://allmodelsarewrong.github.io/gradient.html#idea-of-gradient-descent)
- [**6\.3** Moving Down an Error Surface](https://allmodelsarewrong.github.io/gradient.html#moving-down-an-error-surface)
- [**6\.3.1** The direction of v v](https://allmodelsarewrong.github.io/gradient.html#the-direction-of-mathbfv)
- [**6\.4** Gradient Descent and our Model](https://allmodelsarewrong.github.io/gradient.html#gradient-descent-and-our-model)
- [**7** Regression via Maximum Likelihood](https://allmodelsarewrong.github.io/olsml.html "7 Regression via Maximum Likelihood")
- [**7\.1** Linear Regression Reminder](https://allmodelsarewrong.github.io/olsml.html#linear-regression-reminder)
- [**7\.1.1** Maximum Likelihood](https://allmodelsarewrong.github.io/olsml.html#maximum-likelihood)
- [**7\.1.2** ML Estimator of σ2 σ 2](https://allmodelsarewrong.github.io/olsml.html#ml-estimator-of-sigma2)
- **V Learning Concepts**
- [**8** Theoretical Framework](https://allmodelsarewrong.github.io/learning.html)
- [**8\.1** Mental Map](https://allmodelsarewrong.github.io/learning.html#mental-map)
- [**8\.2** Kinds of Predictions](https://allmodelsarewrong.github.io/learning.html#kinds-of-predictions)
- [**8\.2.1** Two Types of Data](https://allmodelsarewrong.github.io/learning.html#two-types-of-data)
- [**8\.2.2** Two Types of Predictions](https://allmodelsarewrong.github.io/learning.html#two-types-of-predictions)
- [**8\.3** Two Types of Errors](https://allmodelsarewrong.github.io/learning.html#two-types-of-errors)
- [**8\.3.1** Individual Errors](https://allmodelsarewrong.github.io/learning.html#individual-errors)
- [**8\.3.2** Overall Errors](https://allmodelsarewrong.github.io/learning.html#overall-errors)
- [**8\.3.3** Auxiliary Technicality](https://allmodelsarewrong.github.io/learning.html#auxiliary-technicality)
- [**8\.4** Noisy Targets](https://allmodelsarewrong.github.io/learning.html#noisy-targets)
- [**9** MSE of Estimator](https://allmodelsarewrong.github.io/mse.html)
- [**9\.1** MSE of an Estimator](https://allmodelsarewrong.github.io/mse.html#mse-of-an-estimator)
- [**9\.1.1** Prototypical Cases of Bias and Variance](https://allmodelsarewrong.github.io/mse.html#prototypical-cases-of-bias-and-variance)
- [**10** Bias-Variance Tradeoff](https://allmodelsarewrong.github.io/biasvar.html)
- [**10\.1** Introduction](https://allmodelsarewrong.github.io/biasvar.html#introduction)
- [**10\.2** Motivation Example](https://allmodelsarewrong.github.io/biasvar.html#motivation-example)
- [**10\.2.1** Two Hypotheses](https://allmodelsarewrong.github.io/biasvar.html#two-hypotheses)
- [**10\.3** Learning from two points](https://allmodelsarewrong.github.io/biasvar.html#learning-from-two-points)
- [**10\.4** Bias-Variance Derivation](https://allmodelsarewrong.github.io/biasvar.html#bias-variance-derivation)
- [**10\.4.1** Out-of-Sample Predictions](https://allmodelsarewrong.github.io/biasvar.html#out-of-sample-predictions)
- [**10\.4.2** Noisy Target](https://allmodelsarewrong.github.io/biasvar.html#noisy-target)
- [**10\.4.3** Types of Theoretical MSEs](https://allmodelsarewrong.github.io/biasvar.html#types-of-theoretical-mses)
- [**10\.5** The Tradeoff](https://allmodelsarewrong.github.io/biasvar.html#the-tradeoff)
- [**10\.5.1** Bias-Variance Tradeoff Picture](https://allmodelsarewrong.github.io/biasvar.html#bias-variance-tradeoff-picture)
- [**11** Overfitting](https://allmodelsarewrong.github.io/overfit.html)
- [**12** Learning Phases](https://allmodelsarewrong.github.io/phases.html)
- [**12\.1** Introduction](https://allmodelsarewrong.github.io/phases.html#introduction-1)
- [**12\.2** Model Assessment](https://allmodelsarewrong.github.io/phases.html#model-assessment)
- [**12\.2.1** Holdout Test Set](https://allmodelsarewrong.github.io/phases.html#holdout-test-set)
- [**12\.2.2** Why does a test set work?](https://allmodelsarewrong.github.io/phases.html#why-does-a-test-set-work)
- [**12\.3** Model Selection](https://allmodelsarewrong.github.io/phases.html#model-selection)
- [**12\.3.1** Three-way Holdout Method](https://allmodelsarewrong.github.io/phases.html#three-way-holdout-method)
- [**12\.4** Model Training](https://allmodelsarewrong.github.io/phases.html#model-training)
- [**13** Resampling Approaches](https://allmodelsarewrong.github.io/resampling.html)
- [**13\.1** General Sampling Blueprint](https://allmodelsarewrong.github.io/resampling.html#general-sampling-blueprint)
- [**13\.2** Monte Carlo Cross Validation](https://allmodelsarewrong.github.io/resampling.html#monte-carlo-cross-validation)
- [**13\.3** Bootstrap Method](https://allmodelsarewrong.github.io/resampling.html#bootstrap-method)
- [**13\.4** K K-Fold Cross-Validation](https://allmodelsarewrong.github.io/resampling.html#k-fold-cross-validation)
- [**13\.4.1** Leave-One-Out Cross Validation (LOOCV)](https://allmodelsarewrong.github.io/resampling.html#leave-one-out-cross-validation-loocv)
- **VI Regularization**
- [**14** Regularization Techniques](https://allmodelsarewrong.github.io/regular.html)
- [**14\.1** Multicollinearity Issues](https://allmodelsarewrong.github.io/regular.html#multicollinearity-issues)
- [**14\.1.1** Toy Example](https://allmodelsarewrong.github.io/regular.html#toy-example)
- [**14\.2** Irregular Coefficients](https://allmodelsarewrong.github.io/regular.html#irregular-coefficients)
- [**14\.3** Connection to Regularization](https://allmodelsarewrong.github.io/regular.html#connection-to-regularization)
- [**15** Principal Components Regression](https://allmodelsarewrong.github.io/pcr.html)
- [**15\.1** Motivation Example](https://allmodelsarewrong.github.io/pcr.html#motivation-example-1)
- [**15\.2** The PCR Model](https://allmodelsarewrong.github.io/pcr.html#the-pcr-model)
- [**15\.3** How does PCR work?](https://allmodelsarewrong.github.io/pcr.html#how-does-pcr-work)
- [**15\.3.1** Transition Formula](https://allmodelsarewrong.github.io/pcr.html#transition-formula)
- [**15\.3.2** Size of Coefficients](https://allmodelsarewrong.github.io/pcr.html#size-of-coefficients)
- [**15\.4** Selecting Number of PCs](https://allmodelsarewrong.github.io/pcr.html#selecting-number-of-pcs)
- [**16** Partial Least Squares Regression](https://allmodelsarewrong.github.io/pls.html)
- [**16\.1** Motivation Example](https://allmodelsarewrong.github.io/pls.html#motivation-example-2)
- [**16\.2** The PLSR Model](https://allmodelsarewrong.github.io/pls.html#the-plsr-model)
- [**16\.3** How does PLSR work?](https://allmodelsarewrong.github.io/pls.html#how-does-plsr-work)
- [**16\.4** PLSR Algorithm](https://allmodelsarewrong.github.io/pls.html#plsr-algorithm)
- [**16\.4.1** PLS Solution with original variables](https://allmodelsarewrong.github.io/pls.html#pls-solution-with-original-variables)
- [**16\.4.2** Some Properties](https://allmodelsarewrong.github.io/pls.html#some-properties)
- [**16\.4.3** PLS Regression for Price of cars](https://allmodelsarewrong.github.io/pls.html#pls-regression-for-price-of-cars)
- [**16\.4.4** Size of Coefficients](https://allmodelsarewrong.github.io/pls.html#size-of-coefficients-1)
- [**16\.5** Selecting Number of PLS Components](https://allmodelsarewrong.github.io/pls.html#selecting-number-of-pls-components)
- [**17** Ridge Regression](https://allmodelsarewrong.github.io/ridge.html)
- [**17\.1** A New Minimization Problem](https://allmodelsarewrong.github.io/ridge.html#a-new-minimization-problem)
- [**17\.1.1** Constraining Regression Coefficients](https://allmodelsarewrong.github.io/ridge.html#constraining-regression-coefficients)
- [**17\.2** A New Minimization Solution](https://allmodelsarewrong.github.io/ridge.html#a-new-minimization-solution)
- [**17\.3** What does RR accomplish?](https://allmodelsarewrong.github.io/ridge.html#what-does-rr-accomplish)
- [**17\.3.1** How to find λ λ?](https://allmodelsarewrong.github.io/ridge.html#how-to-find-lambda)
- [**18** Lasso Regression](https://allmodelsarewrong.github.io/lasso.html)
- [**18\.1** Mathematical Setup](https://allmodelsarewrong.github.io/lasso.html#mathematical-setup)
- [**18\.1.1** Closed Form?](https://allmodelsarewrong.github.io/lasso.html#closed-form)
- [**18\.2** Geometric Visualization](https://allmodelsarewrong.github.io/lasso.html#geometric-visualization)
- [**18\.2.1** Some More Math: Variable Selection in Action](https://allmodelsarewrong.github.io/lasso.html#some-more-math-variable-selection-in-action)
- [**18\.2.2** Example with `mtcars`](https://allmodelsarewrong.github.io/lasso.html#example-with-mtcars)
- [**18\.3** Going Beyond](https://allmodelsarewrong.github.io/lasso.html#going-beyond)
- **VII Extending Linear Regression**
- [**19** Beyond Linear Regression](https://allmodelsarewrong.github.io/linear-extensions.html)
- [**19\.1** Introduction](https://allmodelsarewrong.github.io/linear-extensions.html#introduction-2)
- [**19\.2** Expanding the Regression Horizon](https://allmodelsarewrong.github.io/linear-extensions.html#expanding-the-regression-horizon)
- [**19\.2.1** Linearity](https://allmodelsarewrong.github.io/linear-extensions.html#linearity)
- [**19\.2.2** Parametric and Nonparametric](https://allmodelsarewrong.github.io/linear-extensions.html#parametric-and-nonparametric)
- [**19\.3** Transforming Features](https://allmodelsarewrong.github.io/linear-extensions.html#transforming-features)
- [**20** Basis Expansion](https://allmodelsarewrong.github.io/basis.html)
- [**20\.1** Basis Functions](https://allmodelsarewrong.github.io/basis.html#basis-functions)
- [**20\.2** Linear Regression](https://allmodelsarewrong.github.io/basis.html#linear-regression)
- [**20\.3** Polynomial Regression](https://allmodelsarewrong.github.io/basis.html#polynomial-regression)
- [**20\.4** Gaussian RBF’s](https://allmodelsarewrong.github.io/basis.html#gaussian-rbfs)
- [**21** Nonparametric Regression](https://allmodelsarewrong.github.io/nonparametric.html)
- [**21\.1** Conditional Averages](https://allmodelsarewrong.github.io/nonparametric.html#conditional-averages)
- [**21\.2** Looking at the Neighbors](https://allmodelsarewrong.github.io/nonparametric.html#looking-at-the-neighbors)
- [**22** Nearest Neighbor Estimates](https://allmodelsarewrong.github.io/knn.html)
- [**22\.1** Introduction](https://allmodelsarewrong.github.io/knn.html#introduction-3)
- [**22\.2** k k Nearest Neighbors (KNN)](https://allmodelsarewrong.github.io/knn.html#k-nearest-neighbors-knn)
- [**22\.2.1** Distance Measures](https://allmodelsarewrong.github.io/knn.html#distance-measures)
- [**22\.2.2** KNN Estimator](https://allmodelsarewrong.github.io/knn.html#knn-estimator)
- [**22\.2.3** How to find k k?](https://allmodelsarewrong.github.io/knn.html#how-to-find-k)
- [**23** Kernel Smoothers](https://allmodelsarewrong.github.io/kernel-smoothers.html)
- [**23\.1** Introduction](https://allmodelsarewrong.github.io/kernel-smoothers.html#introduction-4)
- [**23\.2** Kernel Smoothers](https://allmodelsarewrong.github.io/kernel-smoothers.html#kernel-smoothers-1)
- [**23\.2.1** Kernel Functions](https://allmodelsarewrong.github.io/kernel-smoothers.html#kernel-functions)
- [**23\.2.2** Weights from Kernels](https://allmodelsarewrong.github.io/kernel-smoothers.html#weights-from-kernels)
- [**23\.2.3** Kernel Estimator](https://allmodelsarewrong.github.io/kernel-smoothers.html#kernel-estimator)
- [**23\.3** Local Polynomial Estimators](https://allmodelsarewrong.github.io/kernel-smoothers.html#local-polynomial-estimators)
- **VIII Classification**
- [**24** Classification](https://allmodelsarewrong.github.io/classif.html)
- [**24\.1** Introduction](https://allmodelsarewrong.github.io/classif.html#introduction-5)
- [**24\.1.1** Credit Score Example](https://allmodelsarewrong.github.io/classif.html#credit-score-example)
- [**24\.1.2** Toy Example](https://allmodelsarewrong.github.io/classif.html#toy-example-1)
- [**24\.1.3** Two-class Problem](https://allmodelsarewrong.github.io/classif.html#two-class-problem)
- [**24\.1.4** Bayes’ Rule Reminder](https://allmodelsarewrong.github.io/classif.html#bayes-rule-reminder)
- [**24\.2** Bayes Classifier](https://allmodelsarewrong.github.io/classif.html#bayes-classifier)
- [**25** Logistic Regression](https://allmodelsarewrong.github.io/logistic.html)
- [**25\.1** Motivation](https://allmodelsarewrong.github.io/logistic.html#motivation-1)
- [**25\.1.1** First Approach: Fitting a Line](https://allmodelsarewrong.github.io/logistic.html#first-approach-fitting-a-line)
- [**25\.1.2** Second Approach: Harsh Thresholding](https://allmodelsarewrong.github.io/logistic.html#second-approach-harsh-thresholding)
- [**25\.1.3** Third Approach: Conditional Means](https://allmodelsarewrong.github.io/logistic.html#third-approach-conditional-means)
- [**25\.2** Logistic Regression Model](https://allmodelsarewrong.github.io/logistic.html#logistic-regression-model)
- [**25\.2.1** The Criterion Being Optimized](https://allmodelsarewrong.github.io/logistic.html#the-criterion-being-optimized)
- [**25\.2.2** Another Way to Solve Logistic Regression](https://allmodelsarewrong.github.io/logistic.html#another-way-to-solve-logistic-regression)
- [**26** Preamble for Discriminant Analysis](https://allmodelsarewrong.github.io/discrim.html "26 Preamble for Discriminant Analysis")
- [**26\.1** Motivation](https://allmodelsarewrong.github.io/discrim.html#motivation-2)
- [**26\.1.1** Distinguishing Species](https://allmodelsarewrong.github.io/discrim.html#distinguishing-species)
- [**26\.1.2** Sum of Squares Decomposition](https://allmodelsarewrong.github.io/discrim.html#sum-of-squares-decomposition)
- [**26\.2** Derived Ratios from Sum-of-Squares](https://allmodelsarewrong.github.io/discrim.html#derived-ratios-from-sum-of-squares)
- [**26\.2.1** Correlation Ratio](https://allmodelsarewrong.github.io/discrim.html#correlation-ratio)
- [**26\.2.2** F-Ratio](https://allmodelsarewrong.github.io/discrim.html#f-ratio)
- [**26\.3** Geometric Perspective](https://allmodelsarewrong.github.io/discrim.html#geometric-perspective)
- [**26\.3.1** Clouds from Class Structure](https://allmodelsarewrong.github.io/discrim.html#clouds-from-class-structure)
- [**26\.3.2** Dispersion Decomposition](https://allmodelsarewrong.github.io/discrim.html#dispersion-decomposition)
- [**27** Canonical Discriminant Analysis](https://allmodelsarewrong.github.io/cda.html)
- [**27\.1** CDA: Semi-Supervised Aspect](https://allmodelsarewrong.github.io/cda.html#cda-semi-supervised-aspect)
- [**27\.2** Looking for a discriminant axis](https://allmodelsarewrong.github.io/cda.html#looking-for-a-discriminant-axis)
- [**27\.3** Looking for a Compromise Criterion](https://allmodelsarewrong.github.io/cda.html#looking-for-a-compromise-criterion)
- [**27\.3.1** Correlation Ratio Criterion](https://allmodelsarewrong.github.io/cda.html#correlation-ratio-criterion)
- [**27\.3.2** F-ratio Criterion](https://allmodelsarewrong.github.io/cda.html#f-ratio-criterion)
- [**27\.3.3** A Special PCA](https://allmodelsarewrong.github.io/cda.html#a-special-pca)
- [**27\.4** CDA: Supervised Aspect](https://allmodelsarewrong.github.io/cda.html#cda-supervised-aspect)
- [**27\.4.1** Distance behind CDA](https://allmodelsarewrong.github.io/cda.html#distance-behind-cda)
- [**27\.4.2** Predictive Idea](https://allmodelsarewrong.github.io/cda.html#predictive-idea)
- [**27\.4.3** CDA Classifier](https://allmodelsarewrong.github.io/cda.html#cda-classifier)
- [**27\.4.4** Limitations of CDA classifier](https://allmodelsarewrong.github.io/cda.html#limitations-of-cda-classifier)
- [**28** Discriminant Analysis](https://allmodelsarewrong.github.io/discanalysis.html)
- [**28\.1** Probabilistic DA](https://allmodelsarewrong.github.io/discanalysis.html#probabilistic-da)
- [**28\.1.1** Normal Distributions](https://allmodelsarewrong.github.io/discanalysis.html#normal-distributions)
- [**28\.1.2** Estimating Parameters of Normal Distributions](https://allmodelsarewrong.github.io/discanalysis.html#estimating-parameters-of-normal-distributions)
- [**28\.2** Discriminant Functions](https://allmodelsarewrong.github.io/discanalysis.html#discriminant-functions)
- [**28\.3** Quadratic Discriminant Analysis (QDA)](https://allmodelsarewrong.github.io/discanalysis.html#quadratic-discriminant-analysis-qda)
- [**28\.4** Linear Discriminant Analysis](https://allmodelsarewrong.github.io/discanalysis.html#linear-discriminant-analysis)
- [**28\.4.1** Canonical Discriminant Analysis](https://allmodelsarewrong.github.io/discanalysis.html#canonical-discriminant-analysis)
- [**28\.4.2** Naive Bayes](https://allmodelsarewrong.github.io/discanalysis.html#naive-bayes)
- [**28\.4.3** Fifth Case](https://allmodelsarewrong.github.io/discanalysis.html#fifth-case)
- [**28\.5** Comparing the Cases](https://allmodelsarewrong.github.io/discanalysis.html#comparing-the-cases)
- [**29** Performance of Classifiers](https://allmodelsarewrong.github.io/classperformance.html)
- [**29\.1** Classification Error Measures](https://allmodelsarewrong.github.io/classperformance.html#classification-error-measures)
- [**29\.1.1** Errors for Binary Response](https://allmodelsarewrong.github.io/classperformance.html#errors-for-binary-response)
- [**29\.1.2** Error for Categorical Response](https://allmodelsarewrong.github.io/classperformance.html#error-for-categorical-response)
- [**29\.2** Confusion Matrices](https://allmodelsarewrong.github.io/classperformance.html#confusion-matrices)
- [**29\.3** Binary Response Example](https://allmodelsarewrong.github.io/classperformance.html#binary-response-example)
- [**29\.3.1** Application for Checking Account](https://allmodelsarewrong.github.io/classperformance.html#application-for-checking-account)
- [**29\.3.2** Application for Loan](https://allmodelsarewrong.github.io/classperformance.html#application-for-loan)
- [**29\.4** Decision Rules and Errors](https://allmodelsarewrong.github.io/classperformance.html#decision-rules-and-errors)
- [**29\.5** ROC Curves](https://allmodelsarewrong.github.io/classperformance.html#roc-curves)
- [**29\.5.1** Graphing ROC curves](https://allmodelsarewrong.github.io/classperformance.html#graphing-roc-curves)
- **IX Unsupervised II: Clustering**
- [**30** Clustering](https://allmodelsarewrong.github.io/clustering.html)
- [**30\.1** About Clustering](https://allmodelsarewrong.github.io/clustering.html#about-clustering)
- [**30\.1.1** Types of Clustering](https://allmodelsarewrong.github.io/clustering.html#types-of-clustering)
- [**30\.1.2** Hard Clustering](https://allmodelsarewrong.github.io/clustering.html#hard-clustering)
- [**30\.2** Dispersion Measures](https://allmodelsarewrong.github.io/clustering.html#dispersion-measures)
- [**30\.3** Complexity in Clustering](https://allmodelsarewrong.github.io/clustering.html#complexity-in-clustering)
- [**31** K-Means](https://allmodelsarewrong.github.io/kmeans.html)
- [**31\.1** Toy Example](https://allmodelsarewrong.github.io/kmeans.html#toy-example-2)
- [**31\.2** What does K-means do?](https://allmodelsarewrong.github.io/kmeans.html#what-does-k-means-do)
- [**31\.3** K-Means Algorithms](https://allmodelsarewrong.github.io/kmeans.html#k-means-algorithms)
- [**31\.3.1** Classic Version](https://allmodelsarewrong.github.io/kmeans.html#classic-version)
- [**31\.3.2** Moving Centers Algorithm](https://allmodelsarewrong.github.io/kmeans.html#moving-centers-algorithm)
- [**31\.3.3** Dynamic Clouds](https://allmodelsarewrong.github.io/kmeans.html#dynamic-clouds)
- [**31\.3.4** Choosing K K](https://allmodelsarewrong.github.io/kmeans.html#choosing-k)
- [**31\.3.5** Comments](https://allmodelsarewrong.github.io/kmeans.html#comments)
- [**32** Hierarchical Clustering](https://allmodelsarewrong.github.io/hclus.html)
- [**32\.1** Agglomerative Methods](https://allmodelsarewrong.github.io/hclus.html#agglomerative-methods)
- [**32\.2** Example: Single Linkage](https://allmodelsarewrong.github.io/hclus.html#example-single-linkage)
- [**32\.2.1** Dendrogram](https://allmodelsarewrong.github.io/hclus.html#dendrogram)
- [**32\.3** Example: Complete Linkage](https://allmodelsarewrong.github.io/hclus.html#example-complete-linkage)
- [**32\.3.1** Cutting Dendograms](https://allmodelsarewrong.github.io/hclus.html#cutting-dendograms)
- [**32\.3.2** Pros and COons](https://allmodelsarewrong.github.io/hclus.html#pros-and-coons)
- **X Tree-based Methods**
- [**33** Intro to Decision Trees](https://allmodelsarewrong.github.io/trees.html)
- [**33\.1** Introduction](https://allmodelsarewrong.github.io/trees.html#introduction-6)
- [**33\.2** Some Terminology](https://allmodelsarewrong.github.io/trees.html#some-terminology)
- [**33\.2.1** Binary Trees](https://allmodelsarewrong.github.io/trees.html#binary-trees)
- [**33\.3** Space Partitions](https://allmodelsarewrong.github.io/trees.html#space-partitions)
- [**33\.3.1** The Process of Building a Tree](https://allmodelsarewrong.github.io/trees.html#the-process-of-building-a-tree)
- [**34** Binary Splits and Impurity](https://allmodelsarewrong.github.io/tree-impurities.html)
- [**34\.1** Binary Partitions](https://allmodelsarewrong.github.io/tree-impurities.html#binary-partitions)
- [**34\.1.1** Splits of Binary variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-of-binary-variables)
- [**34\.1.2** Splits of Nominal Variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-of-nominal-variables)
- [**34\.1.3** Splits of Ordinal Variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-of-ordinal-variables)
- [**34\.1.4** Splits Continuous variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-continuous-variables)
- [**34\.2** Measures of Impurity](https://allmodelsarewrong.github.io/tree-impurities.html#measures-of-impurity)
- [**34\.2.1** Entropy](https://allmodelsarewrong.github.io/tree-impurities.html#entropy)
- [**34\.2.2** The Math Behind Entropy](https://allmodelsarewrong.github.io/tree-impurities.html#the-math-behind-entropy)
- [**34\.2.3** Gini Impurity](https://allmodelsarewrong.github.io/tree-impurities.html#gini-impurity)
- [**34\.2.4** Variance-based Impurity](https://allmodelsarewrong.github.io/tree-impurities.html#variance-based-impurity)
- [**35** Splitting Nodes](https://allmodelsarewrong.github.io/tree-splits.html)
- [**35\.1** Entropy-based Splits](https://allmodelsarewrong.github.io/tree-splits.html#entropy-based-splits)
- [**35\.2** Gini-index based Splits](https://allmodelsarewrong.github.io/tree-splits.html#gini-index-based-splits)
- [**35\.3** Looking for the best split](https://allmodelsarewrong.github.io/tree-splits.html#looking-for-the-best-split)
- [**36** Building Binary Trees](https://allmodelsarewrong.github.io/tree-basics.html)
- [**36\.1** Node-Splitting Stopping Criteria](https://allmodelsarewrong.github.io/tree-basics.html#node-splitting-stopping-criteria)
- [**36\.2** Issues with Trees](https://allmodelsarewrong.github.io/tree-basics.html#issues-with-trees)
- [**36\.2.1** Bias-Variance of Trees](https://allmodelsarewrong.github.io/tree-basics.html#bias-variance-of-trees)
- [**36\.3** Pruning a Tree](https://allmodelsarewrong.github.io/tree-basics.html#pruning-a-tree)
- [**36\.4** Pros and Cons of Trees](https://allmodelsarewrong.github.io/tree-basics.html#pros-and-cons-of-trees)
- [**36\.4.1** Advantages of Trees](https://allmodelsarewrong.github.io/tree-basics.html#advantages-of-trees)
- [**36\.4.2** Disadvantages of Trees](https://allmodelsarewrong.github.io/tree-basics.html#disadvantages-of-trees)
- [**37** Bagging](https://allmodelsarewrong.github.io/bagging.html)
- [**37\.1** Introduction](https://allmodelsarewrong.github.io/bagging.html#introduction-7)
- [**37\.1.1** Idea of Bagging](https://allmodelsarewrong.github.io/bagging.html#idea-of-bagging)
- [**37\.2** Why Bother Bagging?](https://allmodelsarewrong.github.io/bagging.html#why-bother-bagging)
- [**38** Random Forests](https://allmodelsarewrong.github.io/forest.html)
- [**38\.1** Introduction](https://allmodelsarewrong.github.io/forest.html#introduction-8)
- [**38\.2** Algorithm](https://allmodelsarewrong.github.io/forest.html#algorithm-1)
- [**38\.2.1** Two Sources of Randomness](https://allmodelsarewrong.github.io/forest.html#two-sources-of-randomness)
- [**38\.2.2** Regressions and Classification Forests](https://allmodelsarewrong.github.io/forest.html#regressions-and-classification-forests)
- [**38\.2.3** Key Advantage of Random Forests](https://allmodelsarewrong.github.io/forest.html#key-advantage-of-random-forests)
- [Made with bookdown](https://github.com/rstudio/bookdown)
Facebook
Twitter
LinkedIn
Weibo
Instapaper
A
A
Serif
Sans
White
Sepia
Night
EPUB
# [All Models Are Wrong: Concepts of Statistical Learning](https://allmodelsarewrong.github.io/)
# 5 Linear Regression
Before entering supervised learning territory, we want to discuss the general framework of linear regression. We will introduce this topic from a pure model-fitting point of view. In other words, we will postpone the learning aspect (the prediction of new data) after the [chapter of theory of learning](https://allmodelsarewrong.github.io/learning.html#learning).
The reason to cover linear regression in this way is for us to have something to work with when we start talking about the theory of supervised learning.
## 5\.1 Motivation
Consider, again, the NBA dataset example from previous chapters. Suppose we want to use this data to predict the salary of NBA players, in terms of certain variables like player’s team, player’s height, player’s weight, player’s position, player’s years of professional experience, player’s number of 2pts, player’s number of 3 points, number of blocks, etc. Of course, we need information on the salaries of some current NBA player’s:
| Player | Height | Weight | Yrs Expr | 2 Points | 3 Points |
|---|---|---|---|---|---|
| 1 | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ |
| 2 | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ |
| 3 | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ |
| … | … | … | … | … | … |
As usual, we use the symbol xi x i to denote the vector of measurements of player i i’s statistics (e.g. height, weight, etc.); in turn, the salary of the i i\-th player is represented with yi y i.
Ideally, we assume the existance of some function f:X→y f : X → y (i.e. a function that takes values from X X space and maps them to a single value y y). We will refer to this function the ideal “formula” for salary. Here we are using the word *formula* in a very lose sense, and not necessarily using the word “formula” in the mathematical sense. We now seek a hypothesized model (which we call ˆf:X→y f ^ : X → y), which we select from some set of candidate functions h1,h2,…,hm h 1 , h 2 , … , h m. Our task is to obtain ˆf f ^ in a way that we can claim that it is a good approximation of the (unknown) function f f.
## 5\.2 The Idea/Intuition of Regression
Let’s go back to our example with NBA players. Recall that yi y i denotes the salary for the i i\-th player. For simplicity’s sake let’s not worry about inflation. Say we now have a new prospective player from Europe; and we are tasked with predicting their salary denoted by y0 y 0. Let’s review a couple of scenarios to get a high-level intuition for this task.
**Scenario 1**. Suppose we have **no** information on this new player. How would we compute y0 y 0 (i.e. the salary of this new player)? One possibility is to guesstimate y0 y 0 using the historical average salary ¯y y ¯ of NBA players. In other words, we would simply calculate: ^y0\=¯y y ^ 0 \= y ¯. In this case we are using ¯y y ¯ as the *typical* score (e.g. a measure of center) as a plausible guesstimate for y0 y 0. We could also look at the median of the existing salaries, if we are concerned about outliers or some skewed distribution of salaries.
**Scenario 2**. Now, suppose we know that this new player will sign on to the LA Lakers. Compared to scenario 1, we now have a new bit of information since we know which team will hire this player. Therefore, we can use this fact to have a more educated guess for y0 y 0. How? Instead of using the salaries of all playes, we can focus on the salaries of Laker’s players. We could then use ^y0\=avg(Laker's Salaries) y ^ 0 \= avg ( Laker's Salaries ): that is, the average salary of all Laker’s players. It is reasonable that ^y0 y ^ 0 is “closer” to the average salary of Laker’s than to the overall average salary of all NBA players.

Figure 5.1: Average salary by team
**Scenario 3**. Similarly, if we know this new player’s years of experience (e.g. 6 years), we would look at the average of salaries corresponding to players with the same level of experience.

Figure 5.2: Average salary by years of experience
What do the three previous scenarios correspond to? In all of these examples, the prediction is basically a conditional mean:
^y0\=ave(yi\|xi\=x0)(5.1) (5.1) y ^ 0 \= ave ( y i \| x i \= x 0 )
Of course, the previous strategy only makes sense when we have data points xi x i that are equal to the query point x0 x 0. But what if none of the available xi x i values are equal to x0 x 0? We’ll talk about this later.
The previous hypothetical scenarios illustrate the core idea of regression: we obtain predictions ^y0 y ^ 0 using quantities of the form ave(yi\|xi\=x0) ave ( y i \| x i \= x 0 ) which can be formalized—under some assumptions—into the notion of conditional expectations of the form:
E(yi∣x∗i1,x∗i2,…,x∗ip)⟶^y(5.2) (5.2) E ( y i ∣ x i 1 ∗ , x i 2 ∗ , … , x i p ∗ ) ⟶ y ^
where x∗ij x i j ∗ represents the i i\-th measurement of the j j\-th variable. The above equation is what we call the **regression function**; note that the regression function is nothing more than a conditional expectation\!
## 5\.3 The Linear Regression Model
In a regression model we use one or more features X X to say something about the reponse Y Y. In turn, a **linear regression** model tells us to combine our features in a **linear** way in order to approximate the response, In the univarite case, we have a linear equation:
^Y\=b0\+b1X(5.3) (5.3) Y ^ \= b 0 \+ b 1 X
In pointwise format, that is for a given individual i i, we have:
^yi\=b0\+b1xi(5.4) (5.4) y ^ i \= b 0 \+ b 1 x i
In vector notation:
^y\=b0\+b1x(5.5) (5.5) y ^ \= b 0 \+ b 1 x
To simplify notation, sometimes we prefer to add an auxiliary constant feature in the form of a vector of 1’s with n n elements, and then use matrix notation with the following elements:
X\= ⎡⎢ ⎢ ⎢ ⎢⎣1x11x2⋮⋮1xn⎤⎥ ⎥ ⎥ ⎥⎦,^y\=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣^y1^y2⋮^yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,b\=\[b0b1\] X \= \[ 1 x 1 1 x 2 ⋮ ⋮ 1 x n \] , y ^ \= \[ y ^ 1 y ^ 2 ⋮ y ^ n \] , b \= \[ b 0 b 1 \]
In the multidimensional case when we have p\>1 p \> 1 predictors:
X\= ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1x11…x1p1x21…x2p⋮⋮⋱⋮1xn1…xnp⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,^y\=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣^y1^y2⋮^yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,b\=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣b0b1⋮bp⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ X \= \[ 1 x 11 … x 1 p 1 x 21 … x 2 p ⋮ ⋮ ⋱ ⋮ 1 x n 1 … x n p \] , y ^ \= \[ y ^ 1 y ^ 2 ⋮ y ^ n \] , b \= \[ b 0 b 1 ⋮ b p \]
With the matrix of features, the response, and the coefficients we have a compact expression for the predicted outcomes:
^y\=Xb(5.6) (5.6) y ^ \= X b
In path diagram form, the linear model looks like this:

Figure 5.3: Linear combination with constant term
If we assume that the predictors and the response are mean-centered, then we don’t have to worry about the constant term x0 x 0:

Figure 5.4: Linear combination without constant term
Obviously the question becomes: how do we obtain the vector of coefficients b b?
## 5\.4 The Error Measure
We would like to get ^yi y ^ i to be “as close as” possible to yi y i. This requires to come up with some type of measure of *closeness*. Among the various functions that we can use to measure how close ^yi y ^ i and yi y i are, the most common option is to use the squared distance between such values:
d2(yi,^yi)\=(yi−^yi)2\=(^yi−yi)2(5.7) (5.7) d 2 ( y i , y ^ i ) \= ( y i − y ^ i ) 2 \= ( y ^ i − y i ) 2
Replacing ^yi y ^ i with bT→xi b T x → i we have:
d2(yi,^yi)\=(bT→xi−yi)2(5.8) (5.8) d 2 ( y i , y ^ i ) \= ( b T x → i − y i ) 2
Notice that d2(yi,^yi) d 2 ( y i , y ^ i ) is a pointwise error measure that we can generally denote as erri err i. But we also need to define a global measure of error. This is typically done by adding all the pointwise error measures erri err i.
There are two flavors of overall error measures based on squared pointwise differences:
1. the sum of squared errors or
SSE
SSE
, and
2. the mean squared error or
MSE
MSE
.
The sum of squared errors, SSE SSE, is defined as:
SSE\=n∑i\=1erri(5.9) (5.9) SSE \= ∑ i \= 1 n err i
The mean squared error, MSE MSE, is defined as:
MSE\=1nn∑i\=1erri(5.10) (5.10) MSE \= 1 n ∑ i \= 1 n err i
As you can tell, SSE\=nMSE SSE \= n MSE, and viceversa, MSE\=SSE/n MSE \= SSE / n
Throughout this book, unless mentioned otherwise, when dealing with regression problems, we will consider the MSE MSE as the default overall error function to be minimized (you could also take SSE SSE instead). Let ei e i be:
ei\=(yi−^yi)→e2i\=(yi−^yi)2\=erri(5.11) (5.11) e i \= ( y i − y ^ i ) → e i 2 \= ( y i − y ^ i ) 2 \= err i
Doing some algebra, it’s easy to see that:
MSE\=1nn∑i\=1e2i\=1nn∑i\=1(^yi−yi)2\=1nn∑i\=1(bT→xi−yi)2\=1n(Xb−y)T(Xb−y)\=1n∥Xb−y∥2\=1n∥^y−y∥2(5.12) MSE \= 1 n ∑ i \= 1 n e i 2 \= 1 n ∑ i \= 1 n ( y ^ i − y i ) 2 \= 1 n ∑ i \= 1 n ( b T x → i − y i ) 2 \= 1 n ( X b − y ) T ( X b − y ) \= 1 n ‖ X b − y ‖ 2 (5.12) \= 1 n ‖ y ^ − y ‖ 2
As you can tell, the Mean Squared Error MSE MSE is proportional to the squared norm of the residual vector e\=^y−y e \= y ^ − y,
MSE\=1n∥e∥2\=1n(^y−y)T(^y−y)(5.13) (5.13) MSE \= 1 n ‖ e ‖ 2 \= 1 n ( y ^ − y ) T ( y ^ − y )
## 5\.5 The Least Squares Algorithm
In (ordinary) least squares regression, we want to minimize the mean of squared errors (MSE MSE). This minimization problem involves computing the derivative of MSE MSE with respect to b b. In other words, we compute the gradient of MSE MSE, denoted ∇MSE(b) ∇ MSE ( b ), which is the vector of partial derivatives of MSE MSE with respecto to each parameter b0,b1,…,bp b 0 , b 1 , … , b p:
∇MSE(b)\=∂∂bMSE(b)\=∂∂b(1nbTXTXb−2nbTXTy\+1nyTy)(5.14) ∇ MSE ( b ) \= ∂ ∂ b MSE ( b ) (5.14) \= ∂ ∂ b ( 1 n b T X T X b − 2 n b T X T y \+ 1 n y T y )
which becomes:
∇MSE(b)\=2nXTXb−2nXTy(5.15) (5.15) ∇ MSE ( b ) \= 2 n X T X b − 2 n X T y
Equating to zero we get that:
XTXb\=XTy(normal equations)(5.16) (5.16) X T X b \= X T y ( normal equations )
The above equation defines a system of equations that most authors refer to as the so-called *Normal* equations. It is a system of n n equations with p\+1 p \+ 1 unknowns (assuming we have a constant term b0 b 0).
If the cross-product matrix XTX X T X is invertible, which is not a minor assumption, then the vector of regression coefficients b b that we are looking for is given by:
b\=(XTX)−1XTy(5.17) (5.17) b \= ( X T X ) − 1 X T y
Having obtained b b, we can easily compute the response vector:
^y\=Xb\=X(XTX)−1XTy(5.18) y ^ \= X b (5.18) \= X ( X T X ) − 1 X T y
If we denote H\=X(XTX)−1XT H \= X ( X T X ) − 1 X T, then the predicted response is:
^y\=Hy(5.19) (5.19) y ^ \= H y
This matrix H H is better known as the **hat matrix**, because it puts the hat on the response. More importantly, the matrix H H is an **orthogonal projector**.
From linear algebra, orthogonal projectors have very interesting properties:
- they are symmetric
- they are idempotent
- their eigenvalues are either 0 or 1
## 5\.6 Geometries of OLS
Now that we’ve seen the algebra, it’s time to look at the geometric interpretation of all the action that is going on within linear regression via OLS. We will discuss three geometric perspectives:
1. OLS from the individuals point of view (i.e. rows of the data matrix).
2. OLS from the variables point of view (i.e. columns of the data matrix).
3. OLS from the parameters point of view, and the error surface.
### 5\.6.1 Rows Perspective
This is probably the most popular perspective covered in most textbooks.
For illustration purposes let’s assume that our data has just p\=1 p \= 1 predictor. In other words, we have the response Y Y and one predictor X X. We can depict individuals as points in this space:

Figure 5.5: Scatterplot of individuals
In linear regression, we want to predict yi y i by linearly mixing the inputs ^yi\=b0\+b1xi y ^ i \= b 0 \+ b 1 x i. In two dimensions, the fitted model corresponds to a line. In three dimensions it would correspond to a plane. And in higher dimensions this would correspond to a hyperplane.

Figure 5.6: Scatterplot with regression line
With a fitted line, we obtain predicted values ^yi y ^ i. Some predicted values may be equal to the observed values. Other predicted values will be greater than the observed values. And some predicted values will be smaller than the observed values.

Figure 5.7: Observed values and predicted values
As you can imagine, given a set of data points, you can fit an infinite number of lines (in general). So which line are we looking for? We want to obtain the line that minimizes the square of the errors ei\=^yi−yi e i \= y ^ i − y i. In the figure below, these errors (also known as \_residuals) are represented by the vertical difference between the observed values yi y i and the predicted values ^yi y ^ i.

Figure 5.8: OLS focuses on minimizing the squared errors
Combining all residuals, we want to obtain parameters b0,…,bp b 0 , … , b p that minimize the squared norm of the vector of residuals:
n∑i\=1e2i\=n∑i\=1(^yi−yi)2\=n∑i\=1(b0\+b1xi−yi)2(5.20) (5.20) ∑ i \= 1 n e i 2 \= ∑ i \= 1 n ( y ^ i − y i ) 2 \= ∑ i \= 1 n ( b 0 \+ b 1 x i − y i ) 2
In vector-matrix form we have:
∥e∥2\=∥^y−y∥2\=∥Xb−y∥2\=(Xb−y)T(Xb−y)∝MSE(5.21) ‖ e ‖ 2 \= ‖ y ^ − y ‖ 2 \= ‖ X b − y ‖ 2 \= ( X b − y ) T ( X b − y ) (5.21) ∝ MSE
As you can tell, minimizing the squared norm of the vector of residuals, is equivalent to minimizing the mean squared error.
### 5\.6.2 Columns Perspective
We can also look at the geometry of OLS from the columns perspective. This is less common than the rows perspective, but still very enlightening.
Imagine variables in an n n\-dimensional space, both the response and the predictors. In this space, the X X variables will span some subspace SX S X. This subspace is not supposed to contain the response—unless Y Y happens to be a linear combination of X1,…,Xp X 1 , … , X p.

Figure 5.9: Features and Response view in n-dim space
What are we looking for? We’re looking for a linear combination Xb X b that gives us a good approximation to y y. As you can tell, there’s an infinite number of linear combinations that can be formed with X1,…,Xp X 1 , … , X p.

Figure 5.10: Linear combination of features
The mix of features that we are interested in, ^y\=Xb y ^ \= X b, is the one that gives us the closest approximation to y y.

Figure 5.11: Linear combination to be as close as possible to response
Now, what do we mean by *closest approximation*? How do we determine the closeness between ^y y ^ and y y? By looking at the difference, which results in a vector e\=^y−y e \= y ^ − y. And then measuring the size or *norm* of this vector. Well, the squared norm to be precise. In other words, we want to obtain ^y y ^ such that the squared norm ∥e∥2 ‖ e ‖ 2 is as small as possible.
Minimize∥e∥2\=∥^y−y∥2(5.22) (5.22) Minimize ‖ e ‖ 2 \= ‖ y ^ − y ‖ 2
Minimizing the squared norm of e e involves minimizing the mean squared error.
### 5\.6.3 Parameters Perspective
In addition to the two previously discussed perspectives (rows and columns), we could also visualize the regression problem from the point of view of the parameters b b and the error surface from MSE MSE. This is the least common perspective discussed in the literature that has to do with linear regression in general. However, it is not that uncommon within the Statistical Learning literature.
For illustration purposes, assume that we have only two predictors X1 X 1 and X2 X 2. Recall that the Mean Squared Error (MSE) is:
E(y,^y)\=1n(bTXTXb−2bTXTy\+yTy)(5.23) (5.23) E ( y , y ^ ) \= 1 n ( b T X T X b − 2 b T X T y \+ y T y )
Now, from the point of view of b\=(b1,b2) b \= ( b 1 , b 2 ), we can classify the order of each term:
E(y,^y)\=1n(bTXTXbQuadratic Form−2bTXTyLinear\+yTyConstant)(5.24) (5.24) E ( y , y ^ ) \= 1 n ( b T X T X b ⏟ Quadratic Form − 2 b T X T y ⏟ Linear \+ y T y ⏟ Constant )
Since XTX X T X is positive semidefinite, we know that bTXTXb≥0 b T X T X b ≥ 0. Furthermore, we know that (from vector calculus) it will be a paraboloid (bowl-shaped surface) in the (E,b1,b2) ( E , b 1 , b 2 ) space. The following diagram depicts this situation.

Figure 5.12: Error Surface
Imagine that we get horizontal slices of the error surface. For any of those slices, we can project them onto the plane spanned by the parameters b1 b 1 and b2 b 2. The resulting projections will be like a topographic map, with error contours on this plane. In general, those contours will be ellipses.

Figure 5.13: Error Surface with slices, and their projections
With quadratic error surfaces like this, they have a minimum value, and we are guaranteed the existence of b∗\=(b∗1,b∗2) b ∗ \= ( b 1 ∗ , b 2 ∗ ) s.t. bTXTXb b T X T X b is minimized. This is a powerful result! Consider, for example, a [parabolic cylinder](http://mathworld.wolfram.com/ParabolicCylinder.html). Such a shape has no unique minimum; rather, it has an infinite number of points (all lying on a line) that minimize the function. The point being; with positive semi-definite matrices, we **never** have this latter case.

Figure 5.14: Error Surface with contour errors and the minimum
The minimum of the error surface occurs at the point (b∗1,b∗2) ( b 1 ∗ , b 2 ∗ ). This is the precisely the OLS solution. |
| Readable Markdown | null |
| Shard | 143 (laksa) |
| Root Hash | 2566890010099092343 |
| Unparsed URL | io,github!allmodelsarewrong,/ols.html s443 |