🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 143 (from laksa038)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
CRAWLED
2 months ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH2.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://allmodelsarewrong.github.io/ols.html
Last Crawled2026-02-13 06:07:37 (2 months ago)
First Indexednot set
HTTP Status Code200
Meta Title5 Linear Regression | All Models Are Wrong: Concepts of Statistical Learning
Meta DescriptionThis is a text about the fundamental concepts of Statistical Learning Methods.
Meta Canonicalnull
Boilerpipe Text
Before entering supervised learning territory, we want to discuss the general framework of linear regression. We will introduce this topic from a pure model-fitting point of view. In other words, we will postpone the learning aspect (the prediction of new data) after the chapter of theory of learning . The reason to cover linear regression in this way is for us to have something to work with when we start talking about the theory of supervised learning. Motivation Consider, again, the NBA dataset example from previous chapters. Suppose we want to use this data to predict the salary of NBA players, in terms of certain variables like player’s team, player’s height, player’s weight, player’s position, player’s years of professional experience, player’s number of 2pts, player’s number of 3 points, number of blocks, etc. Of course, we need information on the salaries of some current NBA player’s: 1 ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ 2 ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ 3 ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ ◯ … … … … … … As usual, we use the symbol x i x i to denote the vector of measurements of player i i ’s statistics (e.g. height, weight, etc.); in turn, the salary of the i i -th player is represented with y i y i . Ideally, we assume the existance of some function f : X → y f : X → y (i.e. a function that takes values from X X space and maps them to a single value y y ). We will refer to this function the ideal “formula” for salary. Here we are using the word formula in a very lose sense, and not necessarily using the word “formula” in the mathematical sense. We now seek a hypothesized model (which we call ˆ f : X → y f ^ : X → y ), which we select from some set of candidate functions h 1 , h 2 , … , h m h 1 , h 2 , … , h m . Our task is to obtain ˆ f f ^ in a way that we can claim that it is a good approximation of the (unknown) function f f . The Idea/Intuition of Regression Let’s go back to our example with NBA players. Recall that y i y i denotes the salary for the i i -th player. For simplicity’s sake let’s not worry about inflation. Say we now have a new prospective player from Europe; and we are tasked with predicting their salary denoted by y 0 y 0 . Let’s review a couple of scenarios to get a high-level intuition for this task. Scenario 1 . Suppose we have no information on this new player. How would we compute y 0 y 0 (i.e. the salary of this new player)? One possibility is to guesstimate y 0 y 0 using the historical average salary ¯ y y ¯ of NBA players. In other words, we would simply calculate: ^ y 0 = ¯ y y ^ 0 = y ¯ . In this case we are using ¯ y y ¯ as the typical score (e.g. a measure of center) as a plausible guesstimate for y 0 y 0 . We could also look at the median of the existing salaries, if we are concerned about outliers or some skewed distribution of salaries. Scenario 2 . Now, suppose we know that this new player will sign on to the LA Lakers. Compared to scenario 1, we now have a new bit of information since we know which team will hire this player. Therefore, we can use this fact to have a more educated guess for y 0 y 0 . How? Instead of using the salaries of all playes, we can focus on the salaries of Laker’s players. We could then use ^ y 0 = avg ( Laker's Salaries ) y ^ 0 = avg ( Laker's Salaries ) : that is, the average salary of all Laker’s players. It is reasonable that ^ y 0 y ^ 0 is “closer” to the average salary of Laker’s than to the overall average salary of all NBA players. Figure 5.1: Average salary by team Scenario 3 . Similarly, if we know this new player’s years of experience (e.g. 6 years), we would look at the average of salaries corresponding to players with the same level of experience. Figure 5.2: Average salary by years of experience What do the three previous scenarios correspond to? In all of these examples, the prediction is basically a conditional mean: ^ y 0 = ave ( y i | x i = x 0 ) (5.1) (5.1) y ^ 0 = ave ( y i | x i = x 0 ) Of course, the previous strategy only makes sense when we have data points x i x i that are equal to the query point x 0 x 0 . But what if none of the available x i x i values are equal to x 0 x 0 ? We’ll talk about this later. The previous hypothetical scenarios illustrate the core idea of regression: we obtain predictions ^ y 0 y ^ 0 using quantities of the form ave ( y i | x i = x 0 ) ave ( y i | x i = x 0 ) which can be formalized—under some assumptions—into the notion of conditional expectations of the form: E ( y i ∣ x ∗ i 1 , x ∗ i 2 , … , x ∗ i p ) ⟶ ^ y (5.2) (5.2) E ( y i ∣ x i 1 ∗ , x i 2 ∗ , … , x i p ∗ ) ⟶ y ^ where x ∗ i j x i j ∗ represents the i i -th measurement of the j j -th variable. The above equation is what we call the regression function ; note that the regression function is nothing more than a conditional expectation! The Linear Regression Model In a regression model we use one or more features X X to say something about the reponse Y Y . In turn, a linear regression model tells us to combine our features in a linear way in order to approximate the response, In the univarite case, we have a linear equation: ^ Y = b 0 + b 1 X (5.3) (5.3) Y ^ = b 0 + b 1 X In pointwise format, that is for a given individual i i , we have: ^ y i = b 0 + b 1 x i (5.4) (5.4) y ^ i = b 0 + b 1 x i In vector notation: ^ y = b 0 + b 1 x (5.5) (5.5) y ^ = b 0 + b 1 x To simplify notation, sometimes we prefer to add an auxiliary constant feature in the form of a vector of 1’s with n n elements, and then use matrix notation with the following elements: X = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 x 1 1 x 2 ⋮ ⋮ 1 x n ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ , ^ y = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ^ y 1 ^ y 2 ⋮ ^ y n ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , b = [ b 0 b 1 ] X = [ 1 x 1 1 x 2 ⋮ ⋮ 1 x n ] , y ^ = [ y ^ 1 y ^ 2 ⋮ y ^ n ] , b = [ b 0 b 1 ] In the multidimensional case when we have p > 1 p > 1 predictors: X = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 x 11 … x 1 p 1 x 21 … x 2 p ⋮ ⋮ ⋱ ⋮ 1 x n 1 … x n p ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , ^ y = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ^ y 1 ^ y 2 ⋮ ^ y n ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , b = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ b 0 b 1 ⋮ b p ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ X = [ 1 x 11 … x 1 p 1 x 21 … x 2 p ⋮ ⋮ ⋱ ⋮ 1 x n 1 … x n p ] , y ^ = [ y ^ 1 y ^ 2 ⋮ y ^ n ] , b = [ b 0 b 1 ⋮ b p ] With the matrix of features, the response, and the coefficients we have a compact expression for the predicted outcomes: ^ y = X b (5.6) (5.6) y ^ = X b In path diagram form, the linear model looks like this: Figure 5.3: Linear combination with constant term If we assume that the predictors and the response are mean-centered, then we don’t have to worry about the constant term x 0 x 0 : Figure 5.4: Linear combination without constant term Obviously the question becomes: how do we obtain the vector of coefficients b b ? The Error Measure We would like to get ^ y i y ^ i to be “as close as” possible to y i y i . This requires to come up with some type of measure of closeness . Among the various functions that we can use to measure how close ^ y i y ^ i and y i y i are, the most common option is to use the squared distance between such values: d 2 ( y i , ^ y i ) = ( y i − ^ y i ) 2 = ( ^ y i − y i ) 2 (5.7) (5.7) d 2 ( y i , y ^ i ) = ( y i − y ^ i ) 2 = ( y ^ i − y i ) 2 Replacing ^ y i y ^ i with b T → x i b T x → i we have: d 2 ( y i , ^ y i ) = ( b T → x i − y i ) 2 (5.8) (5.8) d 2 ( y i , y ^ i ) = ( b T x → i − y i ) 2 Notice that d 2 ( y i , ^ y i ) d 2 ( y i , y ^ i ) is a pointwise error measure that we can generally denote as err i err i . But we also need to define a global measure of error. This is typically done by adding all the pointwise error measures err i err i . There are two flavors of overall error measures based on squared pointwise differences: the sum of squared errors or SSE SSE , and the mean squared error or MSE MSE . The sum of squared errors, SSE SSE , is defined as: SSE = n ∑ i = 1 err i (5.9) (5.9) SSE = ∑ i = 1 n err i The mean squared error, MSE MSE , is defined as: MSE = 1 n n ∑ i = 1 err i (5.10) (5.10) MSE = 1 n ∑ i = 1 n err i As you can tell, SSE = n MSE SSE = n MSE , and viceversa, MSE = SSE / n MSE = SSE / n Throughout this book, unless mentioned otherwise, when dealing with regression problems, we will consider the MSE MSE as the default overall error function to be minimized (you could also take SSE SSE instead). Let e i e i be: e i = ( y i − ^ y i ) → e 2 i = ( y i − ^ y i ) 2 = err i (5.11) (5.11) e i = ( y i − y ^ i ) → e i 2 = ( y i − y ^ i ) 2 = err i Doing some algebra, it’s easy to see that: MSE = 1 n n ∑ i = 1 e 2 i = 1 n n ∑ i = 1 ( ^ y i − y i ) 2 = 1 n n ∑ i = 1 ( b T → x i − y i ) 2 = 1 n ( X b − y ) T ( X b − y ) = 1 n ∥ X b − y ∥ 2 = 1 n ∥ ^ y − y ∥ 2 (5.12) MSE = 1 n ∑ i = 1 n e i 2 = 1 n ∑ i = 1 n ( y ^ i − y i ) 2 = 1 n ∑ i = 1 n ( b T x → i − y i ) 2 = 1 n ( X b − y ) T ( X b − y ) = 1 n ‖ X b − y ‖ 2 (5.12) = 1 n ‖ y ^ − y ‖ 2 As you can tell, the Mean Squared Error MSE MSE is proportional to the squared norm of the residual vector e = ^ y − y e = y ^ − y , MSE = 1 n ∥ e ∥ 2 = 1 n ( ^ y − y ) T ( ^ y − y ) (5.13) (5.13) MSE = 1 n ‖ e ‖ 2 = 1 n ( y ^ − y ) T ( y ^ − y ) The Least Squares Algorithm In (ordinary) least squares regression, we want to minimize the mean of squared errors ( MSE MSE ). This minimization problem involves computing the derivative of MSE MSE with respect to b b . In other words, we compute the gradient of MSE MSE , denoted ∇ MSE ( b ) ∇ MSE ( b ) , which is the vector of partial derivatives of MSE MSE with respecto to each parameter b 0 , b 1 , … , b p b 0 , b 1 , … , b p : ∇ MSE ( b ) = ∂ ∂ b MSE ( b ) = ∂ ∂ b ( 1 n b T X T X b − 2 n b T X T y + 1 n y T y ) (5.14) ∇ MSE ( b ) = ∂ ∂ b MSE ( b ) (5.14) = ∂ ∂ b ( 1 n b T X T X b − 2 n b T X T y + 1 n y T y ) which becomes: ∇ MSE ( b ) = 2 n X T X b − 2 n X T y (5.15) (5.15) ∇ MSE ( b ) = 2 n X T X b − 2 n X T y Equating to zero we get that: X T X b = X T y ( normal equations ) (5.16) (5.16) X T X b = X T y ( normal equations ) The above equation defines a system of equations that most authors refer to as the so-called Normal equations. It is a system of n n equations with p + 1 p + 1 unknowns (assuming we have a constant term b 0 b 0 ). If the cross-product matrix X T X X T X is invertible, which is not a minor assumption, then the vector of regression coefficients b b that we are looking for is given by: b = ( X T X ) − 1 X T y (5.17) (5.17) b = ( X T X ) − 1 X T y Having obtained b b , we can easily compute the response vector: ^ y = X b = X ( X T X ) − 1 X T y (5.18) y ^ = X b (5.18) = X ( X T X ) − 1 X T y If we denote H = X ( X T X ) − 1 X T H = X ( X T X ) − 1 X T , then the predicted response is: ^ y = H y (5.19) (5.19) y ^ = H y This matrix H H is better known as the hat matrix , because it puts the hat on the response. More importantly, the matrix H H is an orthogonal projector . From linear algebra, orthogonal projectors have very interesting properties: they are symmetric they are idempotent their eigenvalues are either 0 or 1 Geometries of OLS Now that we’ve seen the algebra, it’s time to look at the geometric interpretation of all the action that is going on within linear regression via OLS. We will discuss three geometric perspectives: OLS from the individuals point of view (i.e. rows of the data matrix). OLS from the variables point of view (i.e. columns of the data matrix). OLS from the parameters point of view, and the error surface. Rows Perspective This is probably the most popular perspective covered in most textbooks. For illustration purposes let’s assume that our data has just p = 1 p = 1 predictor. In other words, we have the response Y Y and one predictor X X . We can depict individuals as points in this space: Figure 5.5: Scatterplot of individuals In linear regression, we want to predict y i y i by linearly mixing the inputs ^ y i = b 0 + b 1 x i y ^ i = b 0 + b 1 x i . In two dimensions, the fitted model corresponds to a line. In three dimensions it would correspond to a plane. And in higher dimensions this would correspond to a hyperplane. Figure 5.6: Scatterplot with regression line With a fitted line, we obtain predicted values ^ y i y ^ i . Some predicted values may be equal to the observed values. Other predicted values will be greater than the observed values. And some predicted values will be smaller than the observed values. Figure 5.7: Observed values and predicted values As you can imagine, given a set of data points, you can fit an infinite number of lines (in general). So which line are we looking for? We want to obtain the line that minimizes the square of the errors e i = ^ y i − y i e i = y ^ i − y i . In the figure below, these errors (also known as _residuals) are represented by the vertical difference between the observed values y i y i and the predicted values ^ y i y ^ i . Figure 5.8: OLS focuses on minimizing the squared errors Combining all residuals, we want to obtain parameters b 0 , … , b p b 0 , … , b p that minimize the squared norm of the vector of residuals: n ∑ i = 1 e 2 i = n ∑ i = 1 ( ^ y i − y i ) 2 = n ∑ i = 1 ( b 0 + b 1 x i − y i ) 2 (5.20) (5.20) ∑ i = 1 n e i 2 = ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( b 0 + b 1 x i − y i ) 2 In vector-matrix form we have: ∥ e ∥ 2 = ∥ ^ y − y ∥ 2 = ∥ X b − y ∥ 2 = ( X b − y ) T ( X b − y ) ∝ MSE (5.21) ‖ e ‖ 2 = ‖ y ^ − y ‖ 2 = ‖ X b − y ‖ 2 = ( X b − y ) T ( X b − y ) (5.21) ∝ MSE As you can tell, minimizing the squared norm of the vector of residuals, is equivalent to minimizing the mean squared error. Columns Perspective We can also look at the geometry of OLS from the columns perspective. This is less common than the rows perspective, but still very enlightening. Imagine variables in an n n -dimensional space, both the response and the predictors. In this space, the X X variables will span some subspace S X S X . This subspace is not supposed to contain the response—unless Y Y happens to be a linear combination of X 1 , … , X p X 1 , … , X p . Figure 5.9: Features and Response view in n-dim space What are we looking for? We’re looking for a linear combination X b X b that gives us a good approximation to y y . As you can tell, there’s an infinite number of linear combinations that can be formed with X 1 , … , X p X 1 , … , X p . Figure 5.10: Linear combination of features The mix of features that we are interested in, ^ y = X b y ^ = X b , is the one that gives us the closest approximation to y y . Figure 5.11: Linear combination to be as close as possible to response Now, what do we mean by closest approximation ? How do we determine the closeness between ^ y y ^ and y y ? By looking at the difference, which results in a vector e = ^ y − y e = y ^ − y . And then measuring the size or norm of this vector. Well, the squared norm to be precise. In other words, we want to obtain ^ y y ^ such that the squared norm ∥ e ∥ 2 ‖ e ‖ 2 is as small as possible. Minimize ∥ e ∥ 2 = ∥ ^ y − y ∥ 2 (5.22) (5.22) Minimize ‖ e ‖ 2 = ‖ y ^ − y ‖ 2 Minimizing the squared norm of e e involves minimizing the mean squared error. Parameters Perspective In addition to the two previously discussed perspectives (rows and columns), we could also visualize the regression problem from the point of view of the parameters b b and the error surface from MSE MSE . This is the least common perspective discussed in the literature that has to do with linear regression in general. However, it is not that uncommon within the Statistical Learning literature. For illustration purposes, assume that we have only two predictors X 1 X 1 and X 2 X 2 . Recall that the Mean Squared Error (MSE) is: E ( y , ^ y ) = 1 n ( b T X T X b − 2 b T X T y + y T y ) (5.23) (5.23) E ( y , y ^ ) = 1 n ( b T X T X b − 2 b T X T y + y T y ) Now, from the point of view of b = ( b 1 , b 2 ) b = ( b 1 , b 2 ) , we can classify the order of each term: E ( y , ^ y ) = 1 n ( b T X T X b      Quadratic Form − 2 b T X T y      Linear + y T y    Constant ) (5.24) (5.24) E ( y , y ^ ) = 1 n ( b T X T X b ⏟ Quadratic Form − 2 b T X T y ⏟ Linear + y T y ⏟ Constant ) Since X T X X T X is positive semidefinite, we know that b T X T X b ≥ 0 b T X T X b ≥ 0 . Furthermore, we know that (from vector calculus) it will be a paraboloid (bowl-shaped surface) in the ( E , b 1 , b 2 ) ( E , b 1 , b 2 ) space. The following diagram depicts this situation. Figure 5.12: Error Surface Imagine that we get horizontal slices of the error surface. For any of those slices, we can project them onto the plane spanned by the parameters b 1 b 1 and b 2 b 2 . The resulting projections will be like a topographic map, with error contours on this plane. In general, those contours will be ellipses. Figure 5.13: Error Surface with slices, and their projections With quadratic error surfaces like this, they have a minimum value, and we are guaranteed the existence of b ∗ = ( b ∗ 1 , b ∗ 2 ) b ∗ = ( b 1 ∗ , b 2 ∗ ) s.t. b T X T X b b T X T X b is minimized. This is a powerful result! Consider, for example, a parabolic cylinder . Such a shape has no unique minimum; rather, it has an infinite number of points (all lying on a line) that minimize the function. The point being; with positive semi-definite matrices, we never have this latter case. Figure 5.14: Error Surface with contour errors and the minimum The minimum of the error surface occurs at the point ( b ∗ 1 , b ∗ 2 ) ( b 1 ∗ , b 2 ∗ ) . This is the precisely the OLS solution.
Markdown
Type to search - [**All Models Are Wrong (CSL)** G. Sanchez & E. Marzban](https://allmodelsarewrong.github.io/) - **I Welcome** - [Preface](https://allmodelsarewrong.github.io/index.html) - [**1** About this book](https://allmodelsarewrong.github.io/about.html) - [Prerequisites](https://allmodelsarewrong.github.io/about.html#prerequisites) - [Acknowledgements](https://allmodelsarewrong.github.io/about.html#acknowledgements) - **II Intro** - [**2** Introduction](https://allmodelsarewrong.github.io/intro.html) - [**2\.1** Basic Notation](https://allmodelsarewrong.github.io/intro.html#basic-notation) - [**3** Geometric Duality](https://allmodelsarewrong.github.io/duality.html) - [**3\.1** Rows Space](https://allmodelsarewrong.github.io/duality.html#rows-space) - [**3\.2** Columns Space](https://allmodelsarewrong.github.io/duality.html#columns-space) - [**3\.3** Cloud of Individuals](https://allmodelsarewrong.github.io/duality.html#cloud-of-individuals) - [**3\.3.1** Average Individual](https://allmodelsarewrong.github.io/duality.html#average-individual) - [**3\.3.2** Centered Data](https://allmodelsarewrong.github.io/duality.html#centered-data) - [**3\.3.3** Distance between individuals](https://allmodelsarewrong.github.io/duality.html#distance-between-individuals) - [**3\.3.4** Distance to the centroid](https://allmodelsarewrong.github.io/duality.html#distance-to-the-centroid) - [**3\.3.5** Measures of Dispersion](https://allmodelsarewrong.github.io/duality.html#measures-of-dispersion) - [**3\.4** Cloud of Variables](https://allmodelsarewrong.github.io/duality.html#cloud-of-variables) - [**3\.4.1** Mean of a Variable](https://allmodelsarewrong.github.io/duality.html#mean-of-a-variable) - [**3\.4.2** Variance of a Variable](https://allmodelsarewrong.github.io/duality.html#variance-of-a-variable) - [**3\.4.3** Variance with Vector Notation](https://allmodelsarewrong.github.io/duality.html#variance-with-vector-notation) - [**3\.4.4** Standard Deviation as a Norm](https://allmodelsarewrong.github.io/duality.html#standard-deviation-as-a-norm) - [**3\.4.5** Covariance](https://allmodelsarewrong.github.io/duality.html#covariance) - [**3\.4.6** Correlation](https://allmodelsarewrong.github.io/duality.html#correlation) - [**3\.4.7** Geometry of Correlation](https://allmodelsarewrong.github.io/duality.html#geometry-of-correlation) - [**3\.4.8** Orthogonal Projections](https://allmodelsarewrong.github.io/duality.html#orthogonal-projections) - [**3\.4.9** The mean as an orthogonal projection](https://allmodelsarewrong.github.io/duality.html#the-mean-as-an-orthogonal-projection) - **III Unsupervised I: PCA** - [**4** Principal Components Analysis](https://allmodelsarewrong.github.io/pca.html) - [**4\.1** Low-dimensional Representations](https://allmodelsarewrong.github.io/pca.html#low-dimensional-representations) - [**4\.2** Projections](https://allmodelsarewrong.github.io/pca.html#projections) - [**4\.2.1** Vector and Scalar Projections](https://allmodelsarewrong.github.io/pca.html#vector-and-scalar-projections) - [**4\.2.2** Projected Inertia](https://allmodelsarewrong.github.io/pca.html#projected-inertia) - [**4\.3** Maximization Problem](https://allmodelsarewrong.github.io/pca.html#maximization-problem) - [**4\.3.1** Eigenvectors of S S](https://allmodelsarewrong.github.io/pca.html#eigenvectors-of-mathbfs) - [**4\.4** Another Perspective of PCA](https://allmodelsarewrong.github.io/pca.html#another-perspective-of-pca) - [**4\.4.1** Finding Principal Components](https://allmodelsarewrong.github.io/pca.html#finding-principal-components) - [**4\.4.2** Finding the first PC](https://allmodelsarewrong.github.io/pca.html#finding-the-first-pc) - [**4\.4.3** Finding the second PC](https://allmodelsarewrong.github.io/pca.html#finding-the-second-pc) - [**4\.4.4** Finding all PCs](https://allmodelsarewrong.github.io/pca.html#finding-all-pcs) - [**4\.5** Data Decomposition Model](https://allmodelsarewrong.github.io/pca.html#data-decomposition-model) - [**4\.5.1** Alternative Approaches](https://allmodelsarewrong.github.io/pca.html#alternative-approaches) - **IV Linear Regression** - [**5** Linear Regression](https://allmodelsarewrong.github.io/ols.html) - [**5\.1** Motivation](https://allmodelsarewrong.github.io/ols.html#motivation) - [**5\.2** The Idea/Intuition of Regression](https://allmodelsarewrong.github.io/ols.html#the-ideaintuition-of-regression "5.2 The Idea/Intuition of Regression") - [**5\.3** The Linear Regression Model](https://allmodelsarewrong.github.io/ols.html#the-linear-regression-model) - [**5\.4** The Error Measure](https://allmodelsarewrong.github.io/ols.html#the-error-measure) - [**5\.5** The Least Squares Algorithm](https://allmodelsarewrong.github.io/ols.html#the-least-squares-algorithm) - [**5\.6** Geometries of OLS](https://allmodelsarewrong.github.io/ols.html#geometries-of-ols) - [**5\.6.1** Rows Perspective](https://allmodelsarewrong.github.io/ols.html#rows-perspective) - [**5\.6.2** Columns Perspective](https://allmodelsarewrong.github.io/ols.html#columns-perspective) - [**5\.6.3** Parameters Perspective](https://allmodelsarewrong.github.io/ols.html#parameters-perspective) - [**6** Gradient Descent](https://allmodelsarewrong.github.io/gradient.html) - [**6\.1** Error Surface](https://allmodelsarewrong.github.io/gradient.html#error-surface) - [**6\.2** Idea of Gradient Descent](https://allmodelsarewrong.github.io/gradient.html#idea-of-gradient-descent) - [**6\.3** Moving Down an Error Surface](https://allmodelsarewrong.github.io/gradient.html#moving-down-an-error-surface) - [**6\.3.1** The direction of v v](https://allmodelsarewrong.github.io/gradient.html#the-direction-of-mathbfv) - [**6\.4** Gradient Descent and our Model](https://allmodelsarewrong.github.io/gradient.html#gradient-descent-and-our-model) - [**7** Regression via Maximum Likelihood](https://allmodelsarewrong.github.io/olsml.html "7 Regression via Maximum Likelihood") - [**7\.1** Linear Regression Reminder](https://allmodelsarewrong.github.io/olsml.html#linear-regression-reminder) - [**7\.1.1** Maximum Likelihood](https://allmodelsarewrong.github.io/olsml.html#maximum-likelihood) - [**7\.1.2** ML Estimator of σ2 σ 2](https://allmodelsarewrong.github.io/olsml.html#ml-estimator-of-sigma2) - **V Learning Concepts** - [**8** Theoretical Framework](https://allmodelsarewrong.github.io/learning.html) - [**8\.1** Mental Map](https://allmodelsarewrong.github.io/learning.html#mental-map) - [**8\.2** Kinds of Predictions](https://allmodelsarewrong.github.io/learning.html#kinds-of-predictions) - [**8\.2.1** Two Types of Data](https://allmodelsarewrong.github.io/learning.html#two-types-of-data) - [**8\.2.2** Two Types of Predictions](https://allmodelsarewrong.github.io/learning.html#two-types-of-predictions) - [**8\.3** Two Types of Errors](https://allmodelsarewrong.github.io/learning.html#two-types-of-errors) - [**8\.3.1** Individual Errors](https://allmodelsarewrong.github.io/learning.html#individual-errors) - [**8\.3.2** Overall Errors](https://allmodelsarewrong.github.io/learning.html#overall-errors) - [**8\.3.3** Auxiliary Technicality](https://allmodelsarewrong.github.io/learning.html#auxiliary-technicality) - [**8\.4** Noisy Targets](https://allmodelsarewrong.github.io/learning.html#noisy-targets) - [**9** MSE of Estimator](https://allmodelsarewrong.github.io/mse.html) - [**9\.1** MSE of an Estimator](https://allmodelsarewrong.github.io/mse.html#mse-of-an-estimator) - [**9\.1.1** Prototypical Cases of Bias and Variance](https://allmodelsarewrong.github.io/mse.html#prototypical-cases-of-bias-and-variance) - [**10** Bias-Variance Tradeoff](https://allmodelsarewrong.github.io/biasvar.html) - [**10\.1** Introduction](https://allmodelsarewrong.github.io/biasvar.html#introduction) - [**10\.2** Motivation Example](https://allmodelsarewrong.github.io/biasvar.html#motivation-example) - [**10\.2.1** Two Hypotheses](https://allmodelsarewrong.github.io/biasvar.html#two-hypotheses) - [**10\.3** Learning from two points](https://allmodelsarewrong.github.io/biasvar.html#learning-from-two-points) - [**10\.4** Bias-Variance Derivation](https://allmodelsarewrong.github.io/biasvar.html#bias-variance-derivation) - [**10\.4.1** Out-of-Sample Predictions](https://allmodelsarewrong.github.io/biasvar.html#out-of-sample-predictions) - [**10\.4.2** Noisy Target](https://allmodelsarewrong.github.io/biasvar.html#noisy-target) - [**10\.4.3** Types of Theoretical MSEs](https://allmodelsarewrong.github.io/biasvar.html#types-of-theoretical-mses) - [**10\.5** The Tradeoff](https://allmodelsarewrong.github.io/biasvar.html#the-tradeoff) - [**10\.5.1** Bias-Variance Tradeoff Picture](https://allmodelsarewrong.github.io/biasvar.html#bias-variance-tradeoff-picture) - [**11** Overfitting](https://allmodelsarewrong.github.io/overfit.html) - [**12** Learning Phases](https://allmodelsarewrong.github.io/phases.html) - [**12\.1** Introduction](https://allmodelsarewrong.github.io/phases.html#introduction-1) - [**12\.2** Model Assessment](https://allmodelsarewrong.github.io/phases.html#model-assessment) - [**12\.2.1** Holdout Test Set](https://allmodelsarewrong.github.io/phases.html#holdout-test-set) - [**12\.2.2** Why does a test set work?](https://allmodelsarewrong.github.io/phases.html#why-does-a-test-set-work) - [**12\.3** Model Selection](https://allmodelsarewrong.github.io/phases.html#model-selection) - [**12\.3.1** Three-way Holdout Method](https://allmodelsarewrong.github.io/phases.html#three-way-holdout-method) - [**12\.4** Model Training](https://allmodelsarewrong.github.io/phases.html#model-training) - [**13** Resampling Approaches](https://allmodelsarewrong.github.io/resampling.html) - [**13\.1** General Sampling Blueprint](https://allmodelsarewrong.github.io/resampling.html#general-sampling-blueprint) - [**13\.2** Monte Carlo Cross Validation](https://allmodelsarewrong.github.io/resampling.html#monte-carlo-cross-validation) - [**13\.3** Bootstrap Method](https://allmodelsarewrong.github.io/resampling.html#bootstrap-method) - [**13\.4** K K-Fold Cross-Validation](https://allmodelsarewrong.github.io/resampling.html#k-fold-cross-validation) - [**13\.4.1** Leave-One-Out Cross Validation (LOOCV)](https://allmodelsarewrong.github.io/resampling.html#leave-one-out-cross-validation-loocv) - **VI Regularization** - [**14** Regularization Techniques](https://allmodelsarewrong.github.io/regular.html) - [**14\.1** Multicollinearity Issues](https://allmodelsarewrong.github.io/regular.html#multicollinearity-issues) - [**14\.1.1** Toy Example](https://allmodelsarewrong.github.io/regular.html#toy-example) - [**14\.2** Irregular Coefficients](https://allmodelsarewrong.github.io/regular.html#irregular-coefficients) - [**14\.3** Connection to Regularization](https://allmodelsarewrong.github.io/regular.html#connection-to-regularization) - [**15** Principal Components Regression](https://allmodelsarewrong.github.io/pcr.html) - [**15\.1** Motivation Example](https://allmodelsarewrong.github.io/pcr.html#motivation-example-1) - [**15\.2** The PCR Model](https://allmodelsarewrong.github.io/pcr.html#the-pcr-model) - [**15\.3** How does PCR work?](https://allmodelsarewrong.github.io/pcr.html#how-does-pcr-work) - [**15\.3.1** Transition Formula](https://allmodelsarewrong.github.io/pcr.html#transition-formula) - [**15\.3.2** Size of Coefficients](https://allmodelsarewrong.github.io/pcr.html#size-of-coefficients) - [**15\.4** Selecting Number of PCs](https://allmodelsarewrong.github.io/pcr.html#selecting-number-of-pcs) - [**16** Partial Least Squares Regression](https://allmodelsarewrong.github.io/pls.html) - [**16\.1** Motivation Example](https://allmodelsarewrong.github.io/pls.html#motivation-example-2) - [**16\.2** The PLSR Model](https://allmodelsarewrong.github.io/pls.html#the-plsr-model) - [**16\.3** How does PLSR work?](https://allmodelsarewrong.github.io/pls.html#how-does-plsr-work) - [**16\.4** PLSR Algorithm](https://allmodelsarewrong.github.io/pls.html#plsr-algorithm) - [**16\.4.1** PLS Solution with original variables](https://allmodelsarewrong.github.io/pls.html#pls-solution-with-original-variables) - [**16\.4.2** Some Properties](https://allmodelsarewrong.github.io/pls.html#some-properties) - [**16\.4.3** PLS Regression for Price of cars](https://allmodelsarewrong.github.io/pls.html#pls-regression-for-price-of-cars) - [**16\.4.4** Size of Coefficients](https://allmodelsarewrong.github.io/pls.html#size-of-coefficients-1) - [**16\.5** Selecting Number of PLS Components](https://allmodelsarewrong.github.io/pls.html#selecting-number-of-pls-components) - [**17** Ridge Regression](https://allmodelsarewrong.github.io/ridge.html) - [**17\.1** A New Minimization Problem](https://allmodelsarewrong.github.io/ridge.html#a-new-minimization-problem) - [**17\.1.1** Constraining Regression Coefficients](https://allmodelsarewrong.github.io/ridge.html#constraining-regression-coefficients) - [**17\.2** A New Minimization Solution](https://allmodelsarewrong.github.io/ridge.html#a-new-minimization-solution) - [**17\.3** What does RR accomplish?](https://allmodelsarewrong.github.io/ridge.html#what-does-rr-accomplish) - [**17\.3.1** How to find λ λ?](https://allmodelsarewrong.github.io/ridge.html#how-to-find-lambda) - [**18** Lasso Regression](https://allmodelsarewrong.github.io/lasso.html) - [**18\.1** Mathematical Setup](https://allmodelsarewrong.github.io/lasso.html#mathematical-setup) - [**18\.1.1** Closed Form?](https://allmodelsarewrong.github.io/lasso.html#closed-form) - [**18\.2** Geometric Visualization](https://allmodelsarewrong.github.io/lasso.html#geometric-visualization) - [**18\.2.1** Some More Math: Variable Selection in Action](https://allmodelsarewrong.github.io/lasso.html#some-more-math-variable-selection-in-action) - [**18\.2.2** Example with `mtcars`](https://allmodelsarewrong.github.io/lasso.html#example-with-mtcars) - [**18\.3** Going Beyond](https://allmodelsarewrong.github.io/lasso.html#going-beyond) - **VII Extending Linear Regression** - [**19** Beyond Linear Regression](https://allmodelsarewrong.github.io/linear-extensions.html) - [**19\.1** Introduction](https://allmodelsarewrong.github.io/linear-extensions.html#introduction-2) - [**19\.2** Expanding the Regression Horizon](https://allmodelsarewrong.github.io/linear-extensions.html#expanding-the-regression-horizon) - [**19\.2.1** Linearity](https://allmodelsarewrong.github.io/linear-extensions.html#linearity) - [**19\.2.2** Parametric and Nonparametric](https://allmodelsarewrong.github.io/linear-extensions.html#parametric-and-nonparametric) - [**19\.3** Transforming Features](https://allmodelsarewrong.github.io/linear-extensions.html#transforming-features) - [**20** Basis Expansion](https://allmodelsarewrong.github.io/basis.html) - [**20\.1** Basis Functions](https://allmodelsarewrong.github.io/basis.html#basis-functions) - [**20\.2** Linear Regression](https://allmodelsarewrong.github.io/basis.html#linear-regression) - [**20\.3** Polynomial Regression](https://allmodelsarewrong.github.io/basis.html#polynomial-regression) - [**20\.4** Gaussian RBF’s](https://allmodelsarewrong.github.io/basis.html#gaussian-rbfs) - [**21** Nonparametric Regression](https://allmodelsarewrong.github.io/nonparametric.html) - [**21\.1** Conditional Averages](https://allmodelsarewrong.github.io/nonparametric.html#conditional-averages) - [**21\.2** Looking at the Neighbors](https://allmodelsarewrong.github.io/nonparametric.html#looking-at-the-neighbors) - [**22** Nearest Neighbor Estimates](https://allmodelsarewrong.github.io/knn.html) - [**22\.1** Introduction](https://allmodelsarewrong.github.io/knn.html#introduction-3) - [**22\.2** k k Nearest Neighbors (KNN)](https://allmodelsarewrong.github.io/knn.html#k-nearest-neighbors-knn) - [**22\.2.1** Distance Measures](https://allmodelsarewrong.github.io/knn.html#distance-measures) - [**22\.2.2** KNN Estimator](https://allmodelsarewrong.github.io/knn.html#knn-estimator) - [**22\.2.3** How to find k k?](https://allmodelsarewrong.github.io/knn.html#how-to-find-k) - [**23** Kernel Smoothers](https://allmodelsarewrong.github.io/kernel-smoothers.html) - [**23\.1** Introduction](https://allmodelsarewrong.github.io/kernel-smoothers.html#introduction-4) - [**23\.2** Kernel Smoothers](https://allmodelsarewrong.github.io/kernel-smoothers.html#kernel-smoothers-1) - [**23\.2.1** Kernel Functions](https://allmodelsarewrong.github.io/kernel-smoothers.html#kernel-functions) - [**23\.2.2** Weights from Kernels](https://allmodelsarewrong.github.io/kernel-smoothers.html#weights-from-kernels) - [**23\.2.3** Kernel Estimator](https://allmodelsarewrong.github.io/kernel-smoothers.html#kernel-estimator) - [**23\.3** Local Polynomial Estimators](https://allmodelsarewrong.github.io/kernel-smoothers.html#local-polynomial-estimators) - **VIII Classification** - [**24** Classification](https://allmodelsarewrong.github.io/classif.html) - [**24\.1** Introduction](https://allmodelsarewrong.github.io/classif.html#introduction-5) - [**24\.1.1** Credit Score Example](https://allmodelsarewrong.github.io/classif.html#credit-score-example) - [**24\.1.2** Toy Example](https://allmodelsarewrong.github.io/classif.html#toy-example-1) - [**24\.1.3** Two-class Problem](https://allmodelsarewrong.github.io/classif.html#two-class-problem) - [**24\.1.4** Bayes’ Rule Reminder](https://allmodelsarewrong.github.io/classif.html#bayes-rule-reminder) - [**24\.2** Bayes Classifier](https://allmodelsarewrong.github.io/classif.html#bayes-classifier) - [**25** Logistic Regression](https://allmodelsarewrong.github.io/logistic.html) - [**25\.1** Motivation](https://allmodelsarewrong.github.io/logistic.html#motivation-1) - [**25\.1.1** First Approach: Fitting a Line](https://allmodelsarewrong.github.io/logistic.html#first-approach-fitting-a-line) - [**25\.1.2** Second Approach: Harsh Thresholding](https://allmodelsarewrong.github.io/logistic.html#second-approach-harsh-thresholding) - [**25\.1.3** Third Approach: Conditional Means](https://allmodelsarewrong.github.io/logistic.html#third-approach-conditional-means) - [**25\.2** Logistic Regression Model](https://allmodelsarewrong.github.io/logistic.html#logistic-regression-model) - [**25\.2.1** The Criterion Being Optimized](https://allmodelsarewrong.github.io/logistic.html#the-criterion-being-optimized) - [**25\.2.2** Another Way to Solve Logistic Regression](https://allmodelsarewrong.github.io/logistic.html#another-way-to-solve-logistic-regression) - [**26** Preamble for Discriminant Analysis](https://allmodelsarewrong.github.io/discrim.html "26 Preamble for Discriminant Analysis") - [**26\.1** Motivation](https://allmodelsarewrong.github.io/discrim.html#motivation-2) - [**26\.1.1** Distinguishing Species](https://allmodelsarewrong.github.io/discrim.html#distinguishing-species) - [**26\.1.2** Sum of Squares Decomposition](https://allmodelsarewrong.github.io/discrim.html#sum-of-squares-decomposition) - [**26\.2** Derived Ratios from Sum-of-Squares](https://allmodelsarewrong.github.io/discrim.html#derived-ratios-from-sum-of-squares) - [**26\.2.1** Correlation Ratio](https://allmodelsarewrong.github.io/discrim.html#correlation-ratio) - [**26\.2.2** F-Ratio](https://allmodelsarewrong.github.io/discrim.html#f-ratio) - [**26\.3** Geometric Perspective](https://allmodelsarewrong.github.io/discrim.html#geometric-perspective) - [**26\.3.1** Clouds from Class Structure](https://allmodelsarewrong.github.io/discrim.html#clouds-from-class-structure) - [**26\.3.2** Dispersion Decomposition](https://allmodelsarewrong.github.io/discrim.html#dispersion-decomposition) - [**27** Canonical Discriminant Analysis](https://allmodelsarewrong.github.io/cda.html) - [**27\.1** CDA: Semi-Supervised Aspect](https://allmodelsarewrong.github.io/cda.html#cda-semi-supervised-aspect) - [**27\.2** Looking for a discriminant axis](https://allmodelsarewrong.github.io/cda.html#looking-for-a-discriminant-axis) - [**27\.3** Looking for a Compromise Criterion](https://allmodelsarewrong.github.io/cda.html#looking-for-a-compromise-criterion) - [**27\.3.1** Correlation Ratio Criterion](https://allmodelsarewrong.github.io/cda.html#correlation-ratio-criterion) - [**27\.3.2** F-ratio Criterion](https://allmodelsarewrong.github.io/cda.html#f-ratio-criterion) - [**27\.3.3** A Special PCA](https://allmodelsarewrong.github.io/cda.html#a-special-pca) - [**27\.4** CDA: Supervised Aspect](https://allmodelsarewrong.github.io/cda.html#cda-supervised-aspect) - [**27\.4.1** Distance behind CDA](https://allmodelsarewrong.github.io/cda.html#distance-behind-cda) - [**27\.4.2** Predictive Idea](https://allmodelsarewrong.github.io/cda.html#predictive-idea) - [**27\.4.3** CDA Classifier](https://allmodelsarewrong.github.io/cda.html#cda-classifier) - [**27\.4.4** Limitations of CDA classifier](https://allmodelsarewrong.github.io/cda.html#limitations-of-cda-classifier) - [**28** Discriminant Analysis](https://allmodelsarewrong.github.io/discanalysis.html) - [**28\.1** Probabilistic DA](https://allmodelsarewrong.github.io/discanalysis.html#probabilistic-da) - [**28\.1.1** Normal Distributions](https://allmodelsarewrong.github.io/discanalysis.html#normal-distributions) - [**28\.1.2** Estimating Parameters of Normal Distributions](https://allmodelsarewrong.github.io/discanalysis.html#estimating-parameters-of-normal-distributions) - [**28\.2** Discriminant Functions](https://allmodelsarewrong.github.io/discanalysis.html#discriminant-functions) - [**28\.3** Quadratic Discriminant Analysis (QDA)](https://allmodelsarewrong.github.io/discanalysis.html#quadratic-discriminant-analysis-qda) - [**28\.4** Linear Discriminant Analysis](https://allmodelsarewrong.github.io/discanalysis.html#linear-discriminant-analysis) - [**28\.4.1** Canonical Discriminant Analysis](https://allmodelsarewrong.github.io/discanalysis.html#canonical-discriminant-analysis) - [**28\.4.2** Naive Bayes](https://allmodelsarewrong.github.io/discanalysis.html#naive-bayes) - [**28\.4.3** Fifth Case](https://allmodelsarewrong.github.io/discanalysis.html#fifth-case) - [**28\.5** Comparing the Cases](https://allmodelsarewrong.github.io/discanalysis.html#comparing-the-cases) - [**29** Performance of Classifiers](https://allmodelsarewrong.github.io/classperformance.html) - [**29\.1** Classification Error Measures](https://allmodelsarewrong.github.io/classperformance.html#classification-error-measures) - [**29\.1.1** Errors for Binary Response](https://allmodelsarewrong.github.io/classperformance.html#errors-for-binary-response) - [**29\.1.2** Error for Categorical Response](https://allmodelsarewrong.github.io/classperformance.html#error-for-categorical-response) - [**29\.2** Confusion Matrices](https://allmodelsarewrong.github.io/classperformance.html#confusion-matrices) - [**29\.3** Binary Response Example](https://allmodelsarewrong.github.io/classperformance.html#binary-response-example) - [**29\.3.1** Application for Checking Account](https://allmodelsarewrong.github.io/classperformance.html#application-for-checking-account) - [**29\.3.2** Application for Loan](https://allmodelsarewrong.github.io/classperformance.html#application-for-loan) - [**29\.4** Decision Rules and Errors](https://allmodelsarewrong.github.io/classperformance.html#decision-rules-and-errors) - [**29\.5** ROC Curves](https://allmodelsarewrong.github.io/classperformance.html#roc-curves) - [**29\.5.1** Graphing ROC curves](https://allmodelsarewrong.github.io/classperformance.html#graphing-roc-curves) - **IX Unsupervised II: Clustering** - [**30** Clustering](https://allmodelsarewrong.github.io/clustering.html) - [**30\.1** About Clustering](https://allmodelsarewrong.github.io/clustering.html#about-clustering) - [**30\.1.1** Types of Clustering](https://allmodelsarewrong.github.io/clustering.html#types-of-clustering) - [**30\.1.2** Hard Clustering](https://allmodelsarewrong.github.io/clustering.html#hard-clustering) - [**30\.2** Dispersion Measures](https://allmodelsarewrong.github.io/clustering.html#dispersion-measures) - [**30\.3** Complexity in Clustering](https://allmodelsarewrong.github.io/clustering.html#complexity-in-clustering) - [**31** K-Means](https://allmodelsarewrong.github.io/kmeans.html) - [**31\.1** Toy Example](https://allmodelsarewrong.github.io/kmeans.html#toy-example-2) - [**31\.2** What does K-means do?](https://allmodelsarewrong.github.io/kmeans.html#what-does-k-means-do) - [**31\.3** K-Means Algorithms](https://allmodelsarewrong.github.io/kmeans.html#k-means-algorithms) - [**31\.3.1** Classic Version](https://allmodelsarewrong.github.io/kmeans.html#classic-version) - [**31\.3.2** Moving Centers Algorithm](https://allmodelsarewrong.github.io/kmeans.html#moving-centers-algorithm) - [**31\.3.3** Dynamic Clouds](https://allmodelsarewrong.github.io/kmeans.html#dynamic-clouds) - [**31\.3.4** Choosing K K](https://allmodelsarewrong.github.io/kmeans.html#choosing-k) - [**31\.3.5** Comments](https://allmodelsarewrong.github.io/kmeans.html#comments) - [**32** Hierarchical Clustering](https://allmodelsarewrong.github.io/hclus.html) - [**32\.1** Agglomerative Methods](https://allmodelsarewrong.github.io/hclus.html#agglomerative-methods) - [**32\.2** Example: Single Linkage](https://allmodelsarewrong.github.io/hclus.html#example-single-linkage) - [**32\.2.1** Dendrogram](https://allmodelsarewrong.github.io/hclus.html#dendrogram) - [**32\.3** Example: Complete Linkage](https://allmodelsarewrong.github.io/hclus.html#example-complete-linkage) - [**32\.3.1** Cutting Dendograms](https://allmodelsarewrong.github.io/hclus.html#cutting-dendograms) - [**32\.3.2** Pros and COons](https://allmodelsarewrong.github.io/hclus.html#pros-and-coons) - **X Tree-based Methods** - [**33** Intro to Decision Trees](https://allmodelsarewrong.github.io/trees.html) - [**33\.1** Introduction](https://allmodelsarewrong.github.io/trees.html#introduction-6) - [**33\.2** Some Terminology](https://allmodelsarewrong.github.io/trees.html#some-terminology) - [**33\.2.1** Binary Trees](https://allmodelsarewrong.github.io/trees.html#binary-trees) - [**33\.3** Space Partitions](https://allmodelsarewrong.github.io/trees.html#space-partitions) - [**33\.3.1** The Process of Building a Tree](https://allmodelsarewrong.github.io/trees.html#the-process-of-building-a-tree) - [**34** Binary Splits and Impurity](https://allmodelsarewrong.github.io/tree-impurities.html) - [**34\.1** Binary Partitions](https://allmodelsarewrong.github.io/tree-impurities.html#binary-partitions) - [**34\.1.1** Splits of Binary variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-of-binary-variables) - [**34\.1.2** Splits of Nominal Variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-of-nominal-variables) - [**34\.1.3** Splits of Ordinal Variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-of-ordinal-variables) - [**34\.1.4** Splits Continuous variables](https://allmodelsarewrong.github.io/tree-impurities.html#splits-continuous-variables) - [**34\.2** Measures of Impurity](https://allmodelsarewrong.github.io/tree-impurities.html#measures-of-impurity) - [**34\.2.1** Entropy](https://allmodelsarewrong.github.io/tree-impurities.html#entropy) - [**34\.2.2** The Math Behind Entropy](https://allmodelsarewrong.github.io/tree-impurities.html#the-math-behind-entropy) - [**34\.2.3** Gini Impurity](https://allmodelsarewrong.github.io/tree-impurities.html#gini-impurity) - [**34\.2.4** Variance-based Impurity](https://allmodelsarewrong.github.io/tree-impurities.html#variance-based-impurity) - [**35** Splitting Nodes](https://allmodelsarewrong.github.io/tree-splits.html) - [**35\.1** Entropy-based Splits](https://allmodelsarewrong.github.io/tree-splits.html#entropy-based-splits) - [**35\.2** Gini-index based Splits](https://allmodelsarewrong.github.io/tree-splits.html#gini-index-based-splits) - [**35\.3** Looking for the best split](https://allmodelsarewrong.github.io/tree-splits.html#looking-for-the-best-split) - [**36** Building Binary Trees](https://allmodelsarewrong.github.io/tree-basics.html) - [**36\.1** Node-Splitting Stopping Criteria](https://allmodelsarewrong.github.io/tree-basics.html#node-splitting-stopping-criteria) - [**36\.2** Issues with Trees](https://allmodelsarewrong.github.io/tree-basics.html#issues-with-trees) - [**36\.2.1** Bias-Variance of Trees](https://allmodelsarewrong.github.io/tree-basics.html#bias-variance-of-trees) - [**36\.3** Pruning a Tree](https://allmodelsarewrong.github.io/tree-basics.html#pruning-a-tree) - [**36\.4** Pros and Cons of Trees](https://allmodelsarewrong.github.io/tree-basics.html#pros-and-cons-of-trees) - [**36\.4.1** Advantages of Trees](https://allmodelsarewrong.github.io/tree-basics.html#advantages-of-trees) - [**36\.4.2** Disadvantages of Trees](https://allmodelsarewrong.github.io/tree-basics.html#disadvantages-of-trees) - [**37** Bagging](https://allmodelsarewrong.github.io/bagging.html) - [**37\.1** Introduction](https://allmodelsarewrong.github.io/bagging.html#introduction-7) - [**37\.1.1** Idea of Bagging](https://allmodelsarewrong.github.io/bagging.html#idea-of-bagging) - [**37\.2** Why Bother Bagging?](https://allmodelsarewrong.github.io/bagging.html#why-bother-bagging) - [**38** Random Forests](https://allmodelsarewrong.github.io/forest.html) - [**38\.1** Introduction](https://allmodelsarewrong.github.io/forest.html#introduction-8) - [**38\.2** Algorithm](https://allmodelsarewrong.github.io/forest.html#algorithm-1) - [**38\.2.1** Two Sources of Randomness](https://allmodelsarewrong.github.io/forest.html#two-sources-of-randomness) - [**38\.2.2** Regressions and Classification Forests](https://allmodelsarewrong.github.io/forest.html#regressions-and-classification-forests) - [**38\.2.3** Key Advantage of Random Forests](https://allmodelsarewrong.github.io/forest.html#key-advantage-of-random-forests) - [Made with bookdown](https://github.com/rstudio/bookdown) Facebook Twitter LinkedIn Weibo Instapaper A A Serif Sans White Sepia Night EPUB # [All Models Are Wrong: Concepts of Statistical Learning](https://allmodelsarewrong.github.io/) # 5 Linear Regression Before entering supervised learning territory, we want to discuss the general framework of linear regression. We will introduce this topic from a pure model-fitting point of view. In other words, we will postpone the learning aspect (the prediction of new data) after the [chapter of theory of learning](https://allmodelsarewrong.github.io/learning.html#learning). The reason to cover linear regression in this way is for us to have something to work with when we start talking about the theory of supervised learning. ## 5\.1 Motivation Consider, again, the NBA dataset example from previous chapters. Suppose we want to use this data to predict the salary of NBA players, in terms of certain variables like player’s team, player’s height, player’s weight, player’s position, player’s years of professional experience, player’s number of 2pts, player’s number of 3 points, number of blocks, etc. Of course, we need information on the salaries of some current NBA player’s: | Player | Height | Weight | Yrs Expr | 2 Points | 3 Points | |---|---|---|---|---|---| | 1 | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | | 2 | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | | 3 | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | ◯ ◯ | | … | … | … | … | … | … | As usual, we use the symbol xi x i to denote the vector of measurements of player i i’s statistics (e.g. height, weight, etc.); in turn, the salary of the i i\-th player is represented with yi y i. Ideally, we assume the existance of some function f:X→y f : X → y (i.e. a function that takes values from X X space and maps them to a single value y y). We will refer to this function the ideal “formula” for salary. Here we are using the word *formula* in a very lose sense, and not necessarily using the word “formula” in the mathematical sense. We now seek a hypothesized model (which we call ˆf:X→y f ^ : X → y), which we select from some set of candidate functions h1,h2,…,hm h 1 , h 2 , … , h m. Our task is to obtain ˆf f ^ in a way that we can claim that it is a good approximation of the (unknown) function f f. ## 5\.2 The Idea/Intuition of Regression Let’s go back to our example with NBA players. Recall that yi y i denotes the salary for the i i\-th player. For simplicity’s sake let’s not worry about inflation. Say we now have a new prospective player from Europe; and we are tasked with predicting their salary denoted by y0 y 0. Let’s review a couple of scenarios to get a high-level intuition for this task. **Scenario 1**. Suppose we have **no** information on this new player. How would we compute y0 y 0 (i.e. the salary of this new player)? One possibility is to guesstimate y0 y 0 using the historical average salary ¯y y ¯ of NBA players. In other words, we would simply calculate: ^y0\=¯y y ^ 0 \= y ¯. In this case we are using ¯y y ¯ as the *typical* score (e.g. a measure of center) as a plausible guesstimate for y0 y 0. We could also look at the median of the existing salaries, if we are concerned about outliers or some skewed distribution of salaries. **Scenario 2**. Now, suppose we know that this new player will sign on to the LA Lakers. Compared to scenario 1, we now have a new bit of information since we know which team will hire this player. Therefore, we can use this fact to have a more educated guess for y0 y 0. How? Instead of using the salaries of all playes, we can focus on the salaries of Laker’s players. We could then use ^y0\=avg(Laker's Salaries) y ^ 0 \= avg ( Laker's Salaries ): that is, the average salary of all Laker’s players. It is reasonable that ^y0 y ^ 0 is “closer” to the average salary of Laker’s than to the overall average salary of all NBA players. ![Average salary by team](https://allmodelsarewrong.github.io/images/ols/ols-avg-salary-team.svg) Figure 5.1: Average salary by team **Scenario 3**. Similarly, if we know this new player’s years of experience (e.g. 6 years), we would look at the average of salaries corresponding to players with the same level of experience. ![Average salary by years of experience](https://allmodelsarewrong.github.io/images/ols/ols-avg-salary-experience.svg) Figure 5.2: Average salary by years of experience What do the three previous scenarios correspond to? In all of these examples, the prediction is basically a conditional mean: ^y0\=ave(yi\|xi\=x0)(5.1) (5.1) y ^ 0 \= ave ( y i \| x i \= x 0 ) Of course, the previous strategy only makes sense when we have data points xi x i that are equal to the query point x0 x 0. But what if none of the available xi x i values are equal to x0 x 0? We’ll talk about this later. The previous hypothetical scenarios illustrate the core idea of regression: we obtain predictions ^y0 y ^ 0 using quantities of the form ave(yi\|xi\=x0) ave ( y i \| x i \= x 0 ) which can be formalized—under some assumptions—into the notion of conditional expectations of the form: E(yi∣x∗i1,x∗i2,…,x∗ip)⟶^y(5.2) (5.2) E ( y i ∣ x i 1 ∗ , x i 2 ∗ , … , x i p ∗ ) ⟶ y ^ where x∗ij x i j ∗ represents the i i\-th measurement of the j j\-th variable. The above equation is what we call the **regression function**; note that the regression function is nothing more than a conditional expectation\! ## 5\.3 The Linear Regression Model In a regression model we use one or more features X X to say something about the reponse Y Y. In turn, a **linear regression** model tells us to combine our features in a **linear** way in order to approximate the response, In the univarite case, we have a linear equation: ^Y\=b0\+b1X(5.3) (5.3) Y ^ \= b 0 \+ b 1 X In pointwise format, that is for a given individual i i, we have: ^yi\=b0\+b1xi(5.4) (5.4) y ^ i \= b 0 \+ b 1 x i In vector notation: ^y\=b0\+b1x(5.5) (5.5) y ^ \= b 0 \+ b 1 x To simplify notation, sometimes we prefer to add an auxiliary constant feature in the form of a vector of 1’s with n n elements, and then use matrix notation with the following elements: X\= ⎡⎢ ⎢ ⎢ ⎢⎣1x11x2⋮⋮1xn⎤⎥ ⎥ ⎥ ⎥⎦,^y\=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣^y1^y2⋮^yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,b\=\[b0b1\] X \= \[ 1 x 1 1 x 2 ⋮ ⋮ 1 x n \] , y ^ \= \[ y ^ 1 y ^ 2 ⋮ y ^ n \] , b \= \[ b 0 b 1 \] In the multidimensional case when we have p\>1 p \> 1 predictors: X\= ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1x11…x1p1x21…x2p⋮⋮⋱⋮1xn1…xnp⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,^y\=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣^y1^y2⋮^yn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,b\=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣b0b1⋮bp⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ X \= \[ 1 x 11 … x 1 p 1 x 21 … x 2 p ⋮ ⋮ ⋱ ⋮ 1 x n 1 … x n p \] , y ^ \= \[ y ^ 1 y ^ 2 ⋮ y ^ n \] , b \= \[ b 0 b 1 ⋮ b p \] With the matrix of features, the response, and the coefficients we have a compact expression for the predicted outcomes: ^y\=Xb(5.6) (5.6) y ^ \= X b In path diagram form, the linear model looks like this: ![Linear combination with constant term](https://allmodelsarewrong.github.io/images/ols/ols-path-diagram-betas0.svg) Figure 5.3: Linear combination with constant term If we assume that the predictors and the response are mean-centered, then we don’t have to worry about the constant term x0 x 0: ![Linear combination without constant term](https://allmodelsarewrong.github.io/images/ols/ols-path-diagram-betas1.svg) Figure 5.4: Linear combination without constant term Obviously the question becomes: how do we obtain the vector of coefficients b b? ## 5\.4 The Error Measure We would like to get ^yi y ^ i to be “as close as” possible to yi y i. This requires to come up with some type of measure of *closeness*. Among the various functions that we can use to measure how close ^yi y ^ i and yi y i are, the most common option is to use the squared distance between such values: d2(yi,^yi)\=(yi−^yi)2\=(^yi−yi)2(5.7) (5.7) d 2 ( y i , y ^ i ) \= ( y i − y ^ i ) 2 \= ( y ^ i − y i ) 2 Replacing ^yi y ^ i with bT→xi b T x → i we have: d2(yi,^yi)\=(bT→xi−yi)2(5.8) (5.8) d 2 ( y i , y ^ i ) \= ( b T x → i − y i ) 2 Notice that d2(yi,^yi) d 2 ( y i , y ^ i ) is a pointwise error measure that we can generally denote as erri err i. But we also need to define a global measure of error. This is typically done by adding all the pointwise error measures erri err i. There are two flavors of overall error measures based on squared pointwise differences: 1. the sum of squared errors or SSE SSE , and 2. the mean squared error or MSE MSE . The sum of squared errors, SSE SSE, is defined as: SSE\=n∑i\=1erri(5.9) (5.9) SSE \= ∑ i \= 1 n err i The mean squared error, MSE MSE, is defined as: MSE\=1nn∑i\=1erri(5.10) (5.10) MSE \= 1 n ∑ i \= 1 n err i As you can tell, SSE\=nMSE SSE \= n MSE, and viceversa, MSE\=SSE/n MSE \= SSE / n Throughout this book, unless mentioned otherwise, when dealing with regression problems, we will consider the MSE MSE as the default overall error function to be minimized (you could also take SSE SSE instead). Let ei e i be: ei\=(yi−^yi)→e2i\=(yi−^yi)2\=erri(5.11) (5.11) e i \= ( y i − y ^ i ) → e i 2 \= ( y i − y ^ i ) 2 \= err i Doing some algebra, it’s easy to see that: MSE\=1nn∑i\=1e2i\=1nn∑i\=1(^yi−yi)2\=1nn∑i\=1(bT→xi−yi)2\=1n(Xb−y)T(Xb−y)\=1n∥Xb−y∥2\=1n∥^y−y∥2(5.12) MSE \= 1 n ∑ i \= 1 n e i 2 \= 1 n ∑ i \= 1 n ( y ^ i − y i ) 2 \= 1 n ∑ i \= 1 n ( b T x → i − y i ) 2 \= 1 n ( X b − y ) T ( X b − y ) \= 1 n ‖ X b − y ‖ 2 (5.12) \= 1 n ‖ y ^ − y ‖ 2 As you can tell, the Mean Squared Error MSE MSE is proportional to the squared norm of the residual vector e\=^y−y e \= y ^ − y, MSE\=1n∥e∥2\=1n(^y−y)T(^y−y)(5.13) (5.13) MSE \= 1 n ‖ e ‖ 2 \= 1 n ( y ^ − y ) T ( y ^ − y ) ## 5\.5 The Least Squares Algorithm In (ordinary) least squares regression, we want to minimize the mean of squared errors (MSE MSE). This minimization problem involves computing the derivative of MSE MSE with respect to b b. In other words, we compute the gradient of MSE MSE, denoted ∇MSE(b) ∇ MSE ( b ), which is the vector of partial derivatives of MSE MSE with respecto to each parameter b0,b1,…,bp b 0 , b 1 , … , b p: ∇MSE(b)\=∂∂bMSE(b)\=∂∂b(1nbTXTXb−2nbTXTy\+1nyTy)(5.14) ∇ MSE ( b ) \= ∂ ∂ b MSE ( b ) (5.14) \= ∂ ∂ b ( 1 n b T X T X b − 2 n b T X T y \+ 1 n y T y ) which becomes: ∇MSE(b)\=2nXTXb−2nXTy(5.15) (5.15) ∇ MSE ( b ) \= 2 n X T X b − 2 n X T y Equating to zero we get that: XTXb\=XTy(normal equations)(5.16) (5.16) X T X b \= X T y ( normal equations ) The above equation defines a system of equations that most authors refer to as the so-called *Normal* equations. It is a system of n n equations with p\+1 p \+ 1 unknowns (assuming we have a constant term b0 b 0). If the cross-product matrix XTX X T X is invertible, which is not a minor assumption, then the vector of regression coefficients b b that we are looking for is given by: b\=(XTX)−1XTy(5.17) (5.17) b \= ( X T X ) − 1 X T y Having obtained b b, we can easily compute the response vector: ^y\=Xb\=X(XTX)−1XTy(5.18) y ^ \= X b (5.18) \= X ( X T X ) − 1 X T y If we denote H\=X(XTX)−1XT H \= X ( X T X ) − 1 X T, then the predicted response is: ^y\=Hy(5.19) (5.19) y ^ \= H y This matrix H H is better known as the **hat matrix**, because it puts the hat on the response. More importantly, the matrix H H is an **orthogonal projector**. From linear algebra, orthogonal projectors have very interesting properties: - they are symmetric - they are idempotent - their eigenvalues are either 0 or 1 ## 5\.6 Geometries of OLS Now that we’ve seen the algebra, it’s time to look at the geometric interpretation of all the action that is going on within linear regression via OLS. We will discuss three geometric perspectives: 1. OLS from the individuals point of view (i.e. rows of the data matrix). 2. OLS from the variables point of view (i.e. columns of the data matrix). 3. OLS from the parameters point of view, and the error surface. ### 5\.6.1 Rows Perspective This is probably the most popular perspective covered in most textbooks. For illustration purposes let’s assume that our data has just p\=1 p \= 1 predictor. In other words, we have the response Y Y and one predictor X X. We can depict individuals as points in this space: ![Scatterplot of individuals](https://allmodelsarewrong.github.io/images/ols/ols-geom-obs0.svg) Figure 5.5: Scatterplot of individuals In linear regression, we want to predict yi y i by linearly mixing the inputs ^yi\=b0\+b1xi y ^ i \= b 0 \+ b 1 x i. In two dimensions, the fitted model corresponds to a line. In three dimensions it would correspond to a plane. And in higher dimensions this would correspond to a hyperplane. ![Scatterplot with regression line](https://allmodelsarewrong.github.io/images/ols/ols-geom-obs1.svg) Figure 5.6: Scatterplot with regression line With a fitted line, we obtain predicted values ^yi y ^ i. Some predicted values may be equal to the observed values. Other predicted values will be greater than the observed values. And some predicted values will be smaller than the observed values. ![Observed values and predicted values](https://allmodelsarewrong.github.io/images/ols/ols-geom-obs2.svg) Figure 5.7: Observed values and predicted values As you can imagine, given a set of data points, you can fit an infinite number of lines (in general). So which line are we looking for? We want to obtain the line that minimizes the square of the errors ei\=^yi−yi e i \= y ^ i − y i. In the figure below, these errors (also known as \_residuals) are represented by the vertical difference between the observed values yi y i and the predicted values ^yi y ^ i. ![OLS focuses on minimizing the squared errors](https://allmodelsarewrong.github.io/images/ols/ols-geom-obs3.svg) Figure 5.8: OLS focuses on minimizing the squared errors Combining all residuals, we want to obtain parameters b0,…,bp b 0 , … , b p that minimize the squared norm of the vector of residuals: n∑i\=1e2i\=n∑i\=1(^yi−yi)2\=n∑i\=1(b0\+b1xi−yi)2(5.20) (5.20) ∑ i \= 1 n e i 2 \= ∑ i \= 1 n ( y ^ i − y i ) 2 \= ∑ i \= 1 n ( b 0 \+ b 1 x i − y i ) 2 In vector-matrix form we have: ∥e∥2\=∥^y−y∥2\=∥Xb−y∥2\=(Xb−y)T(Xb−y)∝MSE(5.21) ‖ e ‖ 2 \= ‖ y ^ − y ‖ 2 \= ‖ X b − y ‖ 2 \= ( X b − y ) T ( X b − y ) (5.21) ∝ MSE As you can tell, minimizing the squared norm of the vector of residuals, is equivalent to minimizing the mean squared error. ### 5\.6.2 Columns Perspective We can also look at the geometry of OLS from the columns perspective. This is less common than the rows perspective, but still very enlightening. Imagine variables in an n n\-dimensional space, both the response and the predictors. In this space, the X X variables will span some subspace SX S X. This subspace is not supposed to contain the response—unless Y Y happens to be a linear combination of X1,…,Xp X 1 , … , X p. ![Features and Response view in n-dim space](https://allmodelsarewrong.github.io/images/ols/ols-geomvar1.svg) Figure 5.9: Features and Response view in n-dim space What are we looking for? We’re looking for a linear combination Xb X b that gives us a good approximation to y y. As you can tell, there’s an infinite number of linear combinations that can be formed with X1,…,Xp X 1 , … , X p. ![Linear combination of features](https://allmodelsarewrong.github.io/images/ols/ols-geomvar2.svg) Figure 5.10: Linear combination of features The mix of features that we are interested in, ^y\=Xb y ^ \= X b, is the one that gives us the closest approximation to y y. ![Linear combination to be as close as possible to response](https://allmodelsarewrong.github.io/images/ols/ols-geomvar3.svg) Figure 5.11: Linear combination to be as close as possible to response Now, what do we mean by *closest approximation*? How do we determine the closeness between ^y y ^ and y y? By looking at the difference, which results in a vector e\=^y−y e \= y ^ − y. And then measuring the size or *norm* of this vector. Well, the squared norm to be precise. In other words, we want to obtain ^y y ^ such that the squared norm ∥e∥2 ‖ e ‖ 2 is as small as possible. Minimize∥e∥2\=∥^y−y∥2(5.22) (5.22) Minimize ‖ e ‖ 2 \= ‖ y ^ − y ‖ 2 Minimizing the squared norm of e e involves minimizing the mean squared error. ### 5\.6.3 Parameters Perspective In addition to the two previously discussed perspectives (rows and columns), we could also visualize the regression problem from the point of view of the parameters b b and the error surface from MSE MSE. This is the least common perspective discussed in the literature that has to do with linear regression in general. However, it is not that uncommon within the Statistical Learning literature. For illustration purposes, assume that we have only two predictors X1 X 1 and X2 X 2. Recall that the Mean Squared Error (MSE) is: E(y,^y)\=1n(bTXTXb−2bTXTy\+yTy)(5.23) (5.23) E ( y , y ^ ) \= 1 n ( b T X T X b − 2 b T X T y \+ y T y ) Now, from the point of view of b\=(b1,b2) b \= ( b 1 , b 2 ), we can classify the order of each term: E(y,^y)\=1n(bTXTXbQuadratic Form−2bTXTyLinear\+yTyConstant)(5.24) (5.24) E ( y , y ^ ) \= 1 n ( b T X T X b ⏟ Quadratic Form − 2 b T X T y ⏟ Linear \+ y T y ⏟ Constant ) Since XTX X T X is positive semidefinite, we know that bTXTXb≥0 b T X T X b ≥ 0. Furthermore, we know that (from vector calculus) it will be a paraboloid (bowl-shaped surface) in the (E,b1,b2) ( E , b 1 , b 2 ) space. The following diagram depicts this situation. ![Error Surface](https://allmodelsarewrong.github.io/images/ols/ols-error-surface0.svg) Figure 5.12: Error Surface Imagine that we get horizontal slices of the error surface. For any of those slices, we can project them onto the plane spanned by the parameters b1 b 1 and b2 b 2. The resulting projections will be like a topographic map, with error contours on this plane. In general, those contours will be ellipses. ![Error Surface with slices, and their projections](https://allmodelsarewrong.github.io/images/ols/ols-error-surface1.svg) Figure 5.13: Error Surface with slices, and their projections With quadratic error surfaces like this, they have a minimum value, and we are guaranteed the existence of b∗\=(b∗1,b∗2) b ∗ \= ( b 1 ∗ , b 2 ∗ ) s.t. bTXTXb b T X T X b is minimized. This is a powerful result! Consider, for example, a [parabolic cylinder](http://mathworld.wolfram.com/ParabolicCylinder.html). Such a shape has no unique minimum; rather, it has an infinite number of points (all lying on a line) that minimize the function. The point being; with positive semi-definite matrices, we **never** have this latter case. ![Error Surface with contour errors and the minimum](https://allmodelsarewrong.github.io/images/ols/ols-error-surface2.svg) Figure 5.14: Error Surface with contour errors and the minimum The minimum of the error surface occurs at the point (b∗1,b∗2) ( b 1 ∗ , b 2 ∗ ). This is the precisely the OLS solution.
Readable Markdownnull
Shard143 (laksa)
Root Hash2566890010099092343
Unparsed URLio,github!allmodelsarewrong,/ols.html s443