🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 94 (from laksa015)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄
INDEXABLE
CRAWLED
1 hour ago
🤖
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://online.stat.psu.edu/stat857/node/155/
Last Crawled2026-04-06 21:16:44 (1 hour ago)
First Indexednot set
HTTP Status Code200
Meta Title5.1 - Ridge Regression | STAT 897D
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
Printer-friendly version Motivation: too many predictors It is not unusual to see the number of input variables greatly exceed the number of observations, e.g. micro-array data analysis, environmental pollution studies. With many predictors, fitting the full model without penalization will result in large prediction intervals, and LS regression estimator may not uniquely exist. Motivation: ill-conditioned X Because the LS estimates depend upon ( X ′ X ) − 1 , we would have problems in computing β L S if X ′ X were singular or nearly singular. In those cases, small changes to the elements of X lead to large changes in ( X ′ X ) − 1 . The least square estimator β L S may provide a good fit to the training data, but it will not fit sufficiently well to the test data. Ridge Regression One way out of this situation is to abandon the requirement of an unbiased estimator. We assume only that X's and Y have been centered, so that we have no need for a constant term in the regression: X is a n by p matrix with centered columns, Y is a centered n-vector. Hoerl and Kennard (1970) proposed that potential instability in the LS estimator β ^ = ( X ′ X ) − 1 X ′ Y , could be improved by adding a small constant value λ to the diagonal entries of the matrix X ′ X before taking its inverse. The result is the ridge regression estimator β ^ r i d g e = ( X ′ X + λ I p ) − 1 X ′ Y Ridge regression places a particular form of constraint on the parameters ( β 's): β ^ r i d g e is chosen to minimize the penalized sum of squares: ∑ i = 1 n ( y i − ∑ j = 1 p x i j β j ) 2 + λ ∑ j = 1 p β j 2 which is equivalent to minimization of ∑ i = 1 n ( y i − ∑ j = 1 p x i j β j ) 2 subject to, for some c > 0 , ∑ j = 1 p β j 2 < c , i.e. constraining the sum of the squared coefficients. Therefore, ridge regression puts further constraints on the parameters, β j 's, in the linear model. In this case, what we are doing is that instead of just minimizing the residual sum of squares we also have a penalty term on the β 's. This penalty term is λ (a pre-chosen constant) times the squared norm of the β vector. This means that if the β j 's take on large values, the optimization function is penalized. We would prefer to take smaller β j 's, or β j 's that are close to zero to drive the penalty term small. Geometric Interpretation of Ridge Regression: The ellipses correspond to the contours of residual sum of squares (RSS): the inner ellipse has smaller RSS, and RSS is minimized at ordinal least square (OLS) estimates. For p = 2 , the constraint in ridge regression corresponds to a circle, ∑ j = 1 p β j 2 < c . We are trying to minimize the ellipse size and circle simultanously in the ridge regression. The ridge estimate is given by the point at which the ellipse and the circle touch. There is a trade-off between the penalty term and RSS. Maybe a large β would give you a better residual sum of squares but then it will push the penalty term higher. This is why you might actually prefer smaller β 's with worse residual sum of squares. From an optimization perspective, the penalty term  is equivalent to a constraint on the β 's. The function is still the residual sum of squares but now you constrain the norm of the β j   's to be smaller than some constant c . There is a correspondence between λ and c . The larger the λ is, the more you prefer the β j 's close to zero. In the extreme case when λ = 0 , then you would simply be doing a normal linear regression. And the other extreme as λ approaches infinity,  you set all the β 's to zero. Properties of Ridge Estimator: β ^ l s is an unbiased estimator of β ; β ^ r i d g e is a biased estimator of β . For orthogonal covariates, X ′ X = n I p , β ^ r i d g e = n n + λ β ^ l s . Hence, in this case, the ridge estimator always produces shrinkage towards 0 . λ controls the amount of shrinkage. An important concept in shrinkage is the "effective'' degrees of freedom associated with a set of parameters. In a ridge regression setting: If we choose λ = 0 , we have p parameters (since there is no penalization). If λ is large, the parameters are heavily constrained and the degrees of freedom will effectively be lower, tending to 0 as λ → ∞ . The effective degrees of freedom associated with β 1 , β 2 , … , β p is defined as d f ( λ ) = t r ( X ( X ′ X + λ I p ) − 1 X ′ ) = ∑ j = 1 p d j 2 d j 2 + λ , where d j are the singular values of X .  Notice that λ = 0 , which corresponds to no shrinkage, gives d f ( λ ) = p (as long as X ′ X is non-singular), as we would expect. There is a 1:1 mapping between λ and the degrees of freedom, so in practice one may simply pick the effective degrees of freedom that one would like associated with the fit, and solve for λ . As an alternative to a user-chosen λ , cross-validation is often used in choosing λ : we select λ that yields the smallest cross-validation prediction error. The intercept β 0 has been left out of the penalty term because Y has been centered. Penalization of the intercept would make the procedure depend on the origin chosen for Y . Since the ridge estimator is linear, it is straightforward to calculate the variance-covariance matrix v a r ( β ^ r i d g e ) = σ 2 ( X ′ X + λ I p ) − 1 X ′ X ( X ′ X + λ I p ) − 1 . A Bayesian Formulation Consider the linear regression model with normal errors: Y i = ∑ j = 1 p X i j β j + ϵ i ϵ i is i.i.d. normal errors with mean 0 and known variance σ 2 . Since λ is applied to the squared norm of the β vector, people often standardize all of the covariates to make them have a similar scale. Assume β j has the prior distribution β j ∼ i i d N ( 0 , σ 2 / λ ) . A large value of λ corresponds to a prior that is more tightly concentrated around zero, and hence leads to greater shrinkage towards zero. The posterior is β | Y ∼ N ( β ^ , σ 2 ( X ′ X + λ I p ) − 1 X ′ X ( X ′ X + λ I p ) − 1 ) , where β ^ = β ^ r i d g e = ( X ′ X + λ I p ) − 1 X ′ Y , confirming that the posterior mean (and mode) of the Bayesian linear model corresponds to the ridge regression estimator. Whereas the least squares solutions β ^ l s = ( X ′ X ) − 1 X ′ Y are unbiased if model is correctly specified, ridge solutions are biased, E ( β ^ r i d g e ) ≠ β . However, at the cost of bias, ridge regression reduces the variance, and thus might reduce the mean squared error (MSE). M S E = B i a s 2 + V a r i a n c e More Geometric Interpretations (optional) Inputs are centered first; Consider the fitted response y ^ = X β ^ r i d g e = X ( X T X + λ I ) − 1 X T y = U D ( D 2 + λ I ) − 1 D U T y = ∑ j = 1 p u j d j 2 d j 2 + λ u j T y where \( \textbf{u}_j\) are the normalized principal components of X . Ridge regression shrinks the coordinates with respect to the orthonormal basis formed by the principal components. Coordinates with respect to principal components with smaller variance are shrunk more. Instead of using X = ( X 1 , X 2 , ... , X p ) as predicting variables, use the new input matrix X ~ = UD Then for the new inputs: β ^ j r i d g e = d j 2 d j 2 + λ u j T y V a r ( β ^ j ) = σ 2 d j 2 where σ 2 is the variance of the error term ϵ in the linear model. The shrinkage factor given by ridge regression is: d j 2 d j 2 + λ We saw this in the previous formula. The larger λ is, the more the projection is shrunk in the direction of u j . Coordinates with respect to the principal components with a smaller variance are shrunk more. Let's take a look at this geometrically. This interpretation will become convenient when we compare it to principal components regression where instead of doing shrinkage, we either shrink the direction closer to zero or we don't shrink at all. We will see this in the "Dimension Reduction Methods" lesson.
Markdown
[Skip to Content](https://online.stat.psu.edu/stat857/node/155/#content) [Eberly College of Science](https://science.psu.edu/ "Eberly College of Science") [STAT 897D](https://online.stat.psu.edu/stat857/ "Home") Applied Data Mining and Statistical Learning [Home](https://online.stat.psu.edu/stat857/) » [Lesson 5: Regression Shrinkage Methods](https://online.stat.psu.edu/stat857/node/137/) # 5\.1 - Ridge Regression [![Printer-friendly version](https://online.stat.psu.edu/stat857/sites/all/modules/print/icons/print_icon/index.gif)Printer-friendly version](https://online.stat.psu.edu/stat857/print/book/export/html/155/ "Display a printer-friendly version of this page.") **Motivation: too many predictors** - It is not unusual to see the number of input variables greatly exceed the number of observations, e.g. micro-array data analysis, environmental pollution studies. - With many predictors, fitting the full model without penalization will result in large prediction intervals, and LS regression estimator may not uniquely exist. **Motivation: ill-conditioned X** - Because the LS estimates depend upon (X′X)−1 ( X ′ X ) − 1, we would have problems in computing βLS β L S if X′X X ′ X were singular or nearly singular. - In those cases, small changes to the elements of X X lead to large changes in (X′X)−1 ( X ′ X ) − 1. - The least square estimator βLS β L S may provide a good fit to the training data, but it will not fit sufficiently well to the test data. *** **Ridge Regression** One way out of this situation is to abandon the requirement of an unbiased estimator. We assume only that X's and Y have been centered, so that we have no need for a constant term in the regression: - X is a *n* by&#2; *p* matrix with centered columns, - Y is a centered n-vector. Hoerl and Kennard (1970) proposed that potential instability in the LS estimator ^β\=(X′X)−1X′Y, β ^ \= ( X ′ X ) − 1 X ′ Y , could be improved by adding a small constant value λ λ to the diagonal entries of the matrix X′X X ′ X before taking its inverse. The result is the ridge regression estimator ^βridge\=(X′X\+λIp)−1X′Y β ^ r i d g e \= ( X ′ X \+ λ I p ) − 1 X ′ Y Ridge regression places a particular form of constraint on the parameters (β β's): ^βridge β ^ r i d g e is chosen to minimize the penalized sum of squares: n∑i\=1(yi−p∑j\=1xijβj)2\+λp∑j\=1β2j ∑ i \= 1 n ( y i − ∑ j \= 1 p x i j β j ) 2 \+ λ ∑ j \= 1 p β j 2 which is equivalent to minimization of ∑ni\=1(yi−∑pj\=1xijβj)2 ∑ i \= 1 n ( y i − ∑ j \= 1 p x i j β j ) 2 subject to, for some c\>0 c \> 0, ∑pj\=1β2j\<c ∑ j \= 1 p β j 2 \< c, i.e. constraining the sum of the squared coefficients. Therefore, ridge regression puts further constraints on the parameters, βj β j's, in the linear model. In this case, what we are doing is that instead of just minimizing the residual sum of squares we also have a penalty term on the β β's. This penalty term is λ λ *(a pre-chosen constant)* times the squared norm of the β β vector. This means that if the βj β j's take on large values, the optimization function is penalized. We would prefer to take smaller βj β j's, or βj β j's that are close to zero to drive the penalty term small. *** **Geometric Interpretation of Ridge Regression:** ![geometric interpretation of ridge regression](https://online.stat.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson04/ridge_regression_geomteric/index.png) The ellipses correspond to the contours of residual sum of squares (RSS): the inner ellipse has smaller RSS, and RSS is minimized at ordinal least square (OLS) estimates. For p\=2 p \= 2, the constraint in ridge regression corresponds to a circle, ∑pj\=1β2j\<c ∑ j \= 1 p β j 2 \< c. We are trying to minimize the ellipse size and circle simultanously in the ridge regression. The ridge estimate is given by the point at which the ellipse and the circle touch. There is a trade-off between the penalty term and RSS. Maybe a large β β would give you a better residual sum of squares but then it will push the penalty term higher. This is why you might actually prefer smaller β β's with worse residual sum of squares. From an optimization perspective, the penalty term is equivalent to a constraint on the β β's. The function is still the residual sum of squares but now you constrain the norm of the βj β j's to be smaller than some constant *c*. There is a correspondence between λ λ and *c*. The larger the λ λ is, the more you prefer the βj β j's close to zero. In the extreme case when λ\=0 λ \= 0, then you would simply be doing a normal linear regression. And the other extreme as λ λ approaches infinity, you set all the β β's to zero. *** ****Properties of Ridge Estimator:**** - ^βls β ^ l s is an unbiased estimator of β β ; ^βridge β ^ r i d g e is a biased estimator of β β . For orthogonal covariates, X′X\=nIp X ′ X \= n I p, ^βridge\=nn\+λ^βls β ^ r i d g e \= n n \+ λ β ^ l s. Hence, in this case, the ridge estimator always produces shrinkage towards 0 0. λ λ controls the amount of shrinkage. An important concept in shrinkage is the "effective'' degrees of freedom associated with a set of parameters. In a ridge regression setting: > 1. If we choose > λ\=0 > > λ > > \= > > 0 > , we have > p > > p > parameters (since there is no penalization). > 2. If > λ > > λ > is large, the parameters are heavily constrained and the degrees of freedom will effectively be lower, tending to > 0 > > 0 > as > λ→∞ > > λ > > → > > ∞ > . > > The effective degrees of freedom associated with β1,β2,…,βp β 1 , β 2 , … , β p is defined as > df(λ)\=tr(X(X′X\+λIp)−1X′)\=p∑j\=1d2jd2j\+λ, d f ( λ ) \= t r ( X ( X ′ X \+ λ I p ) − 1 X ′ ) \= ∑ j \= 1 p d j 2 d j 2 \+ λ , > where dj d j are the singular values of X X. Notice that λ\=0 λ \= 0, which corresponds to no shrinkage, gives df(λ)\=p d f ( λ ) \= p (as long as X′X X ′ X is non-singular), as we would expect. > > There is a 1:1 mapping between λ λ and the degrees of freedom, so in practice one may simply pick the effective degrees of freedom that one would like associated with the fit, and solve for λ λ. - As an alternative to a user-chosen λ λ , cross-validation is often used in choosing λ λ : we select λ λ that yields the smallest cross-validation prediction error. - The intercept β0 β 0 has been left out of the penalty term because Y Y has been centered. Penalization of the intercept would make the procedure depend on the origin chosen for Y Y. - Since the ridge estimator is linear, it is straightforward to calculate the variance-covariance matrix var(^βridge)\=σ2(X′X\+λIp)−1X′X(X′X\+λIp)−1 v a r ( β ^ r i d g e ) \= σ 2 ( X ′ X \+ λ I p ) − 1 X ′ X ( X ′ X \+ λ I p ) − 1. *** **A Bayesian Formulation** Consider the linear regression model with normal errors: Yi\=p∑j\=1Xijβj\+ϵi Y i \= ∑ j \= 1 p X i j β j \+ ϵ i ϵi ϵ i is i.i.d. normal errors with mean 0 and known variance σ2 σ 2. Since λ λ is applied to the squared norm of the β vector, people often standardize all of the covariates to make them have a similar scale. Assume βj β j has the prior distribution βj∼iidN(0,σ2/λ) β j ∼ i i d N ( 0 , σ 2 / λ ). A large value of λ λ corresponds to a prior that is more tightly concentrated around zero, and hence leads to greater shrinkage towards zero. The posterior is β\|Y∼N(^β,σ2(X′X\+λIp)−1X′X(X′X\+λIp)−1) β \| Y ∼ N ( β ^ , σ 2 ( X ′ X \+ λ I p ) − 1 X ′ X ( X ′ X \+ λ I p ) − 1 ), where ^β\=^βridge\=(X′X\+λIp)−1X′Y β ^ \= β ^ r i d g e \= ( X ′ X \+ λ I p ) − 1 X ′ Y, confirming that the posterior mean (and mode) of the Bayesian linear model corresponds to the ridge regression estimator. Whereas the least squares solutions ^βls\=(X′X)−1X′Y β ^ l s \= ( X ′ X ) − 1 X ′ Y are unbiased if model is correctly specified, ridge solutions are biased, E(^βridge)≠β E ( β ^ r i d g e ) ≠ β. However, at the cost of bias, ridge regression reduces the variance, and thus might reduce the mean squared error (MSE). MSE\=Bias2\+Variance M S E \= B i a s 2 \+ V a r i a n c e *** **More Geometric Interpretations (optional)** - Inputs are centered first; - Consider the fitted response - Ridge regression shrinks the coordinates with respect to the orthonormal basis formed by the principal components. - Coordinates with respect to principal components with smaller variance are shrunk more. - Instead of using *X* = (*X*1, *X*2, ... , *Xp*) as predicting variables, use the new input matrix ~X X ~ \= **UD** - Then for the new inputs: - The shrinkage factor given by ridge regression is: We saw this in the previous formula. The larger λ is, the more the projection is shrunk in the direction of uj u j. Coordinates with respect to the principal components with a smaller variance are shrunk more. Let's take a look at this geometrically. ![geometric interpretation of principal components and shrinkage by ridge regression. ](https://online.stat.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson05/image_06/index.gif) [![Inspect](https://online.stat.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson06/inspect/index.gif)]() This interpretation will become convenient when we compare it to principal components regression where instead of doing shrinkage, we either shrink the direction closer to zero or we don't shrink at all. We will see this in the "Dimension Reduction Methods" lesson. [‹ Lesson 5: Regression Shrinkage Methods](https://online.stat.psu.edu/stat857/node/137/ "Go to previous page") [up](https://online.stat.psu.edu/stat857/node/137/ "Go to parent page") [5\.2 - Compare Squared Loss for Ridge Regression ›](https://online.stat.psu.edu/stat857/node/156/ "Go to next page") - [![Printer-friendly version](https://online.stat.psu.edu/stat857/sites/all/modules/print/icons/print_icon/index.gif)Printer-friendly version](https://online.stat.psu.edu/stat857/print/book/export/html/155/ "Display a printer-friendly version of this page.") # Navigation ### Start Here\! - [Welcome to STAT 897D - Applied Data Mining and Statistical Learning](https://online.stat.psu.edu/stat857/intro/ "Welcome to the STAT 857!") - [Search Course Materials](https://online.stat.psu.edu/stat857/search/) ### Lessons - [Lesson 1(a): Introduction to Data Mining](https://online.stat.psu.edu/stat857/node/141/ "Lesson 1: Introduction to Data Mining - Part I") - [Lesson 1 (b): Exploratory Data Analysis (EDA)](https://online.stat.psu.edu/stat857/node/4/ "Lesson 3 : Explanatory Data Analysis (EDA)") - [Lesson 2: Statistical Learning and Model Selection](https://online.stat.psu.edu/stat857/node/207/ "Lesson 2 (b): Supervised Learning and Model Selection") - [Lesson 3 : Linear Regression](https://online.stat.psu.edu/stat857/node/12/ "Lesson 3: Linear Methods for Regression") - [Lesson 4: Variable Selection](https://online.stat.psu.edu/stat857/node/212/ "Lesson 4: Variable Selection") - [Lesson 5: Regression Shrinkage Methods](https://online.stat.psu.edu/stat857/node/137/ "Linear Regularization") - [5\.1 - Ridge Regression](https://online.stat.psu.edu/stat857/node/155/ "Ridge Regression") - [5\.2 - Compare Squared Loss for Ridge Regression](https://online.stat.psu.edu/stat857/node/156/ "Compare Squared Loss for Ridge Regression") - [5\.3 - More on Coefficient Shrinkage (Optional)](https://online.stat.psu.edu/stat857/node/44/ "More of Coefficient Shrinkage (optional)") - [5\.4 - The Lasso](https://online.stat.psu.edu/stat857/node/158/ "The Lasso") - [5\.5 - Summary](https://online.stat.psu.edu/stat857/node/159/ "R Scripts") - [Lesson 6: Principal Components Analysis](https://online.stat.psu.edu/stat857/node/11/ "Lesson 5 : Principal Component Analysis (PCA)") - [Lesson 7: Dimension Reduction Methods](https://online.stat.psu.edu/stat857/node/208/ "Lesson 6: Dimension Reduction Methods") - [Lesson 8: Modeling Non-linear Relationships](https://online.stat.psu.edu/stat857/node/210/ "Lesson 7: Non-linear Methods") - [Lesson 9: Classification](https://online.stat.psu.edu/stat857/node/15/ "Lesson 7 : Linear Methods for Classification (Regression of Indicator Matrix)") - [Lesson 10: Support Vector Machines](https://online.stat.psu.edu/stat857/node/211/ "Lesson 10: Support Vector Machines") - [Lesson 11: Tree-based Methods](https://online.stat.psu.edu/stat857/node/22/ "Lesson 12: Decision Tree") - [Lesson 12: Cluster Analysis](https://online.stat.psu.edu/stat857/node/25/ "Lesson 6 : Clustering Methods") ### Resources - [Analysis of German Credit Data](https://online.stat.psu.edu/stat857/node/215/ "Analysis of German Credit Data") - [Analysis of Wine Quality Data](https://online.stat.psu.edu/stat857/node/223/ "Analysis of Wine Quality Data") - [Analysis of Classification Data](https://online.stat.psu.edu/stat857/node/231/ "Analysis of Classification Data") - [Final Project - Sample Work](https://online.stat.psu.edu/stat857/node/206/ "Final Project - Sample Work") [![Penn State Science](https://online.stat.psu.edu/stat857/sites/all/themes/online_ed2/images/pennstatescience250/index.png)](https://science.psu.edu/ "Penn State Science") [![Ready to Enroll?](https://online.stat.psu.edu/stat857/sites/all/themes/online_ed2/images/readytoenroll250/index.png)](https://www.worldcampus.psu.edu/course-catalog/search-courses "Course Catalog") Copyright © 2018 [The Pennsylvania State University](https://www.psu.edu/) [Privacy and Legal Statements](https://www.psu.edu/ur/legal.html) Contact the [Department of Statistics Online Programs](https://online.stat.psu.edu/statprogram/contact-us)
Readable Markdownnull
Shard94 (laksa)
Root Hash16520191723648810894
Unparsed URLedu,psu!stat,online,/stat857/node/155/ s443