🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 152 (from laksa105)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa152.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://en.wikipedia.org/wiki/Ordinary_least_squares\')), getAhrefsUnparsedNoserviceFromURL(\'https://en.wikipedia.org/wiki/Ordinary_least_squares\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares","crawl_time":1776067806,"first_indexed_time":1377383308,"http_code":200,"src_unparsed":"org,wikipedia!en,\/wiki\/Ordinary_least_squares s443","src_root_hash":"17790707453426894952","history_drop_reason":null,"meta_title":"Ordinary least squares - Wikipedia","meta_descriptions":[],"attrs_boilerpipe_text":"Okun's law\nin\nmacroeconomics\nstates that in an economy the\nGDP\ngrowth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.\nIn\nstatistics\n,\nordinary least squares\n(\nOLS\n) is a type of\nlinear least squares\nmethod for choosing the unknown\nparameters\nin a\nlinear regression\nmodel (with fixed level-one\n[\nclarification needed\n]\neffects of a\nlinear function\nof a set of\nexplanatory variables\n) by the principle of\nleast squares\n: minimizing the sum of the squares of the differences between the observed\ndependent variable\n(values of the variable being observed) in the input\ndataset\nand the output of the (linear) function of the\nindependent variable\n. Some sources consider OLS to be linear regression.\n[\n1\n]\nGeometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting\nestimator\ncan be expressed by a simple formula, especially in the case of a\nsimple linear regression\n, in which there is a single\nregressor\non the right side of the regression equation.\nThe OLS estimator is\nconsistent\nfor the level-one fixed effects when the regressors are\nexogenous\nand forms perfect\ncolinearity\n(rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments\n[\n2\n]\nand—by the\nGauss–Markov theorem\n—\noptimal in the class of linear unbiased estimators\nwhen the\nerrors\nare\nhomoscedastic\nand\nserially uncorrelated\n. Under these conditions, the method of OLS provides\nminimum-variance mean-unbiased\nestimation when the errors have finite\nvariances\n. Under the additional assumption that the errors are\nnormally distributed\nwith zero mean, OLS is the\nmaximum likelihood estimator\nthat outperforms any non-linear unbiased estimator.\nSuppose the data consists of\nobservations\n. Each observation\nincludes a scalar response\nand a column vector\nof\nparameters (regressors), i.e.,\n. In a\nlinear regression model\n, the response variable,\n, is a linear function of the regressors:\nor in\nvector\nform,\nwhere\n, as introduced previously, is a column vector of the\n-th observation of all the explanatory variables;\nis a\nvector of unknown parameters; and the scalar\nrepresents unobserved random variables (\nerrors\n) of the\n-th observation.\naccounts for the influences upon the responses\nfrom sources other than the explanatory variables\n. This model can also be written in matrix notation as\nwhere\nand\nare\nvectors of the response variables and the errors of the\nobservations, and\nis an\nmatrix of regressors, also sometimes called the\ndesign matrix\n, whose row\nis\nand contains the\n-th observations on all the explanatory variables.\nTypically, a constant term is included in the set of regressors\n, say, by taking\nfor all\n. The coefficient\ncorresponding to this regressor is called the\nintercept\n. Without the intercept, the fitted line is forced to cross the origin when\n.\nRegressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent).\nAs a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be\nquadratic\nin the second regressor, but none-the-less is still considered a\nlinear\nmodel because the model\nis\nstill linear in the parameters (\n).\nMatrix\/vector formulation\n[\nedit\n]\nConsider an\noverdetermined system\nof\nlinear equations\nin\nunknown\ncoefficients\n,\n, with\n. This can be written in\nmatrix\nform as\nwhere\n(Note: for a linear model as above, not all elements in\ncontains information on the data points. The first column is populated with ones,\n. Only the other columns contain actual data. So here\nis equal to the number of regressors plus one).\nSuch a system usually has no exact solution, so the goal is instead to find the coefficients\nwhich fit the equations \"best\", in the sense of solving the\nquadratic\nminimization\nproblem\nwhere the objective function\nis given by\nA justification for choosing this criterion is given in\nProperties\nbelow. This minimization problem has a unique solution, provided that the\ncolumns of the matrix\nare\nlinearly independent\n, given by solving the so-called\nnormal equations\n:\nThe matrix\nis known as the\nnormal matrix\nor\nGram matrix\nand the matrix\nis known as the\nmoment matrix\nof regressand by regressors.\n[\n3\n]\nFinally,\nis the coefficient vector of the least-squares\nhyperplane\n, expressed as\nor\nSuppose\nb\nis a \"candidate\" value for the parameter vector\nβ\n. The quantity\ny\ni\n−\nx\ni\nT\nb\n, called the\nresidual\nfor the\ni\n-th observation, measures the vertical distance between the data point\n(\nx\ni\n,\ny\ni\n)\nand the hyperplane\ny\n=\nx\nT\nb\n, and thus assesses the degree of fit between the actual data and the model. The\nsum of squared residuals\n(\nSSR\n) (also called the\nerror sum of squares\n(\nESS\n) or\nresidual sum of squares\n(\nRSS\n))\n[\n4\n]\nis a measure of the overall model fit:\nwhere\nT\ndenotes the matrix\ntranspose\n, and the rows of\nX\n, denoting the values of all the independent variables associated with a particular value of the dependent variable, are\nX\ni\n= x\ni\nT\n.  The value of\nb\nwhich minimizes this sum is called the\nOLS estimator for\nβ\n. The function\nS\n(\nb\n) is quadratic in\nb\nwith positive-definite\nHessian\n, and therefore this function possesses a unique global minimum at\n, which can be given by the explicit formula\n[\n5\n]\n[proof]\nThe product\nN\n=\nX\nT\nX\nis a\nGram matrix\n, and its inverse,\nQ\n=\nN\n−1\n, is the\ncofactor matrix\nof\nβ\n,\n[\n6\n]\n[\n7\n]\n[\n8\n]\nclosely related to its\ncovariance matrix\n,\nC\nβ\n.\nThe matrix (\nX\nT\n \nX\n)\n−1\n \nX\nT\n=\nQ\n \nX\nT\nis called the\nMoore–Penrose pseudoinverse\nmatrix of\nX\n. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect\nmulticollinearity\nbetween the explanatory variables (which would cause the Gram matrix to have no inverse).\nAfter we have estimated\nβ\n, the\nfitted values\n(or\npredicted values\n) from the regression will be\nwhere\nP\n=\nX\n(\nX\nT\nX\n)\n−1\nX\nT\nis the\nprojection matrix\nonto the space\nV\nspanned by the columns of\nX\n. This matrix\nP\nis also sometimes called the\nhat matrix\nbecause it \"puts a hat\" onto the variable\ny\n. Another matrix, closely related to\nP\nis the\nannihilator\nmatrix\nM\n=\nI\nn\n−\nP\n; this is a projection matrix onto the space orthogonal to\nV\n. Both matrices\nP\nand\nM\nare\nsymmetric\nand\nidempotent\n(meaning that\nP\n2\n=\nP\nand\nM\n2\n=\nM\n), and relate to the data matrix\nX\nvia identities\nPX\n=\nX\nand\nMX\n= 0\n.\n[\n9\n]\nMatrix\nM\ncreates the\nresiduals\nfrom the regression:\nThe variances of the predicted values\nare found in the main diagonal of the\nvariance-covariance matrix\nof predicted values:\nwhere\nP\nis the projection matrix and\ns\n2\nis the sample variance.\n[\n10\n]\nThe full matrix is very large; its diagonal elements can be calculated individually as:\nwhere\nX\ni\nis the\ni\n-th row of matrix\nX\n.\nUsing these residuals we can estimate the sample variance\ns\n2\nusing the\nreduced chi-squared\nstatistic:\nThe denominator,\nn\n−\np\n, is the\nstatistical degrees of freedom\n. The first quantity,\ns\n2\n, is the OLS estimate for\nσ\n2\n, whereas the second,\n, is the MLE estimate for\nσ\n2\n. The two estimators are quite similar in large samples; the first estimator is always\nunbiased\n, while the second estimator is biased but has a smaller\nmean squared error\n. In practice\ns\n2\nis used more often, since it is more convenient for the hypothesis testing. The square root of\ns\n2\nis called the\nregression standard error\n,\n[\n11\n]\nstandard error of the regression\n,\n[\n12\n]\n[\n13\n]\nor\nstandard error of the equation\n.\n[\n9\n]\nIt is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto\nX\n. The\ncoefficient of determination\nR\n2\nis defined as a ratio of \"explained\" variance to the \"total\" variance of the dependent variable\ny\n, in the cases where the regression sum of squares equals the sum of squares of residuals:\n[\n14\n]\nwhere TSS is the\ntotal sum of squares\nfor the dependent variable,\n, and\nis an\nn\n×\nn\nmatrix of ones. (\nis a\ncentering matrix\nwhich is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for\nR\n2\nto be meaningful, the matrix\nX\nof data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case,\nR\n2\nwill always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.\nSimple linear regression model\n[\nedit\n]\nIf the data matrix\nX\ncontains only two variables, a constant and a scalar regressor\nx\ni\n, then this is called the \"simple regression model\". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as\n(\nα\n,\nβ\n)\n:\nThe least squares estimates in this case are given by simple formulas\nAlternative derivations\n[\nedit\n]\nIn the previous section the least squares estimator\nwas obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same:\n^\nβ\n= (\nX\nT\nX\n)\n−1\nX\nT\ny\n; the only difference is in how we interpret this result.\nOLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of\nand\nrefers to a column of the data matrix.)\nLeast squares as projection of y onto col(X) for three observations; ŷ = Xβ gives fitted values and y − ŷ is the residual.\nFor mathematicians, OLS is an approximate solution to an overdetermined system of linear equations\nXβ\n≈\ny\n, where\nβ\nis the unknown. Assuming the system cannot be solved exactly (the number of equations\nn\nis much larger than the number of unknowns\np\n), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies\nwhere\n‖\n·\n‖\nis the standard\nL\n2\n norm\nin the\nn\n-dimensional\nEuclidean space\nR\nn\n. The predicted quantity\nXβ\nis just a certain linear combination of the vectors of regressors. Thus, the residual vector\ny\n−\nXβ\nwill have the smallest length when\ny\nis\nprojected orthogonally\nonto the\nlinear subspace\nspanned\nby the columns of\nX\n. The OLS estimator\nin this case can be interpreted as the coefficients of\nvector decomposition\nof\n^\ny\n=\nPy\nalong the basis of\nX\n.\nIn other words, the gradient equations at the minimum can be written as:\nA geometrical interpretation of these equations is that the vector of residuals,\nis orthogonal to the\ncolumn space\nof\nX\n, since the\ndot product\nis equal to zero for\nany\nconformal vector,\nv\n. This means that\nis the shortest of all possible vectors\n, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.\nIntroducing\nand a matrix\nK\nwith the assumption that a matrix\nis non-singular and\nK\nT\nX\n= 0 (cf.\nOrthogonal projections\n), the residual vector should satisfy the following equation:\nThe equation and solution of linear least squares are thus described as follows:\nAnother way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.\n[\n15\n]\nAlthough this way of calculation is more computationally expensive, it provides a better intuition on OLS.\nThe OLS estimator is identical to the\nmaximum likelihood estimator\n(MLE) under the normality assumption for the error terms.\n[\n16\n]\n[proof]\nThis normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by\nYule\nand\nPearson\n.\n[\ncitation needed\n]\nFrom the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the\nCramér–Rao bound\nfor variance) if the normality assumption is satisfied.\n[\n17\n]\nGeneralized method of moments\n[\nedit\n]\nIn\niid\ncase the OLS estimator can also be viewed as a\nGMM\nestimator arising from the moment conditions\nThese moment conditions state that the regressors should be uncorrelated with the errors. Since\nx\ni\nis a\np\n-vector, the number of moment conditions is equal to the dimension of the parameter vector\nβ\n, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.\nNote that the original strict exogeneity assumption\nE[\nε\ni\n | \nx\ni\n] = 0\nimplies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function\nƒ\n, the moment condition\nE[\nƒ\n(\nx\ni\n)·\nε\ni\n] = 0\nwill hold. However it can be shown using the\nGauss–Markov theorem\nthat the optimal choice of function\nƒ\nis to take\nƒ\n(\nx\n) =\nx\n, which results in the moment equation posted above.\nThere are several different frameworks in which the\nlinear regression model\ncan be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.\nOne of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (\nrandom design\n) the regressors\nx\ni\nare random and sampled together with the\ny\ni\n'\ns from some\npopulation\n, as in an\nobservational study\n. This approach allows for more natural study of the\nasymptotic properties\nof the estimators. In the other interpretation (\nfixed design\n), the regressors\nX\nare treated as known constants set by a\ndesign\n, and\ny\nis sampled conditionally on the values of\nX\nas in an\nexperiment\n. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on\nX\n. All results stated in this article are within the random design framework.\nThe classical model focuses on the \"finite sample\" estimation and inference, meaning that the number of observations\nn\nis fixed. This contrasts with the other approaches, which study the\nasymptotic behavior\nof OLS, and in which the behavior at a large number of samples is studied.\nTo prove finite sample unbiasedness of the OLS estimator, we require the following assumptions.\nExample of a cubic polynomial regression, which is a type of linear regression. Although\npolynomial regression\nfits a curve model to the data, as a\nstatistical estimation\nproblem it is linear, in the sense that the conditional expectation function\nis linear in the unknown\nparameters\nthat are estimated from the\ndata\n. For this reason, polynomial regression is considered to be a special case of\nmultiple linear regression\n.\nExogeneity\n. The regressors do not\ncovary\nwith the error term:\nThis requires, for example, that there are no\nomitted variables\nthat covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression in\nmathematical statistics\nis that the predictor variables\nx\ncan be treated as fixed values, rather than\nrandom variables\n. This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complex\nerrors-in-variables models\n,\ninstrumental variable models\nand the like.\nLinearity\n, or\ncorrect specification\n. This means that the mean of the response variable is a\nlinear combination\nof the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in\npolynomial regression\n, which uses linear regression to fit the response variable as an arbitrary\npolynomial\nfunction (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have \"too much power\", in that they tend to\noverfit\nthe data. As a result, some kind of\nregularization\nmust typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are\nridge regression\nand\nlasso regression\n.\nBayesian linear regression\ncan also be used, which by its nature is more or less immune to the problem of overfitting. (In fact,\nridge regression\nand\nlasso regression\ncan both be viewed as special cases of Bayesian linear regression, with particular types of\nprior distributions\nplaced on the regression coefficients.)\nVisualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab\nConstant variance\nor\nhomoscedasticity\n. This means that the variance of the errors does not depend on the values of the predictor variables:\nThus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000—i.e., a\nstandard deviation\nof around $20,000—while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, since that would imply their actual income could vary anywhere between −$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called\nheteroscedasticity\n. In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a \"fanning effect\" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see\nHeteroscedasticity\n. The presence of heteroscedasticity will result in an overall \"average\" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of\nordinary least squares\n, not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The\nmean squared error\nfor the model will also be wrong. Various estimation techniques including\nweighted least squares\nand the use of\nheteroscedasticity-consistent standard errors\ncan handle heteroscedasticity in a quite general way.\nBayesian linear regression\ntechniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the\nlogarithm\nof the response variable using a linear regression model, which implies that the response variable itself has a\nlog-normal distribution\nrather than a\nnormal distribution\n).\nTo check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as\nautocorrelation\nin the errors or their correlation with one or more covariates.\nUncorrelatedness of errors\n. This assumes that the errors of the response variables are uncorrelated with each other:\nSome methods such as\ngeneralized least squares\nare capable of handling correlated errors, although they typically require significantly more data unless some sort of\nregularization\nis used to bias the model towards assuming uncorrelated errors.\nBayesian linear regression\nis a general way of handling this issue. Full\nstatistical independence\nis a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence.\nLack of perfect multicollinearity\nin the predictors. For standard\nleast squares\nestimation methods, the design matrix\nX\nmust have full\ncolumn rank\np\n:\n[\n18\n]\nIf this assumption is violated,  perfect\nmulticollinearity\nexists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see\nVariance inflation factor\n). In the case of perfect multicollinearity, the parameter vector\nβ\nwill be\nnon-identifiable\n—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space\nR\np\n). See\npartial least squares regression\n. Methods for fitting linear models with multicollinearity have been developed,\n[\n19\n]\n[\n20\n]\n[\n21\n]\n[\n22\n]\nsome of which require additional assumptions such as \"effect sparsity\"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in\ngeneralized linear models\n, do not suffer from this problem.\nViolations of these assumptions can result in biased estimations of\nβ\n, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:\nThe statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.\nThe arrangement, or\nprobability distribution\nof the predictor variables\nx\nhas a major influence on the precision of estimates of\nβ\n.\nSampling\nand\ndesign of experiments\nare highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of\nβ\n.\nFinite sample properties\n[\nedit\n]\nFirst of all, under the\nstrict exogeneity\nassumption the OLS estimators\nand\ns\n2\nare\nunbiased\n, meaning that their expected values coincide with the true values of the parameters:\n[\n23\n]\n[proof]\nIf the strict exogeneity does not hold (as is the case with many\ntime series\nmodels, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.\nThe\nvariance-covariance matrix\n(or simply\ncovariance matrix\n) of\nis equal to\n[\n24\n]\nIn particular, the standard error of each coefficient\nis equal to square root of the\nj\n-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity\nσ\n2\nwith its estimate\ns\n2\n. Thus,\nIt can also be easily shown that the estimator\nis uncorrelated with the residuals from the model:\n[\n24\n]\nThe\nGauss–Markov theorem\nstates that under the\nspherical errors\nassumption (that is, the errors should be\nuncorrelated\nand\nhomoscedastic\n) the estimator\nis efficient in the class of linear unbiased estimators. This is called the\nbest linear unbiased estimator\n(BLUE). Efficiency should be understood as if we were to find some other estimator\nwhich would be linear in\ny\nand unbiased, then\n[\n24\n]\nin the sense that this is a\nnonnegative-definite matrix\n. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms\nε\n, other, non-linear estimators may provide better results than OLS.\nThe properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the\nnormality assumption\nholds (that is, that\nε\n~\nN\n(0,\nσ\n2\nI\nn\n)\n), then additional properties of the OLS estimators can be stated.\nThe estimator\nis normally distributed, with mean and variance as given before:\n[\n25\n]\nThis estimator reaches the\nCramér–Rao bound\nfor the model, and thus is optimal in the class of all unbiased estimators.\n[\n17\n]\nNote that unlike the\nGauss–Markov theorem\n, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.\nThe estimator\ns\n2\nwill be proportional to the\nchi-squared distribution\n:\n[\n26\n]\nThe variance of this estimator is equal to\n2\nσ\n4\n\/(\nn\n − \np\n)\n, which does not attain the\nCramér–Rao bound\nof\n2\nσ\n4\n\/\nn\n. However it was shown that there are no unbiased estimators of\nσ\n2\nwith variance smaller than that of the estimator\ns\n2\n.\n[\n27\n]\nIf we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the\nmean squared error\n) estimator in this class will be\n~\nσ\n2\n= SSR \n\/\n (\nn\n − \np\n + 2)\n, which even beats the Cramér–Rao bound in case when there is only one regressor (\np\n= 1\n).\n[\n28\n]\nMoreover, the estimators\nand\ns\n2\nare\nindependent\n,\n[\n29\n]\nthe fact which comes in useful when constructing the t- and F-tests for the regression.\nInfluential observations\n[\nedit\n]\nAs was mentioned before, the estimator\nis linear in\ny\n, meaning that it represents a linear combination of the dependent variables\ny\ni\n. The weights in this linear combination are functions of the regressors\nX\n, and generally are unequal. The observations with high weights are called\ninfluential\nbecause they have a more pronounced effect on the value of the estimator.\nTo analyze which observations are influential we remove a specific\nj\n-th observation and consider how much the estimated quantities are going to change (similarly to the\njackknife method\n). It can be shown that the change in the OLS estimator for\nβ\nwill be equal to\n[\n30\n]\nwhere\nh\nj\n=\nx\nj\nT\n (\nX\nT\nX\n)\n−1\nx\nj\nis the\nj\n-th diagonal element of the hat matrix\nP\n, and\nx\nj\nis the vector of regressors corresponding to the\nj\n-th observation. Similarly, the change in the predicted value for\nj\n-th observation resulting from omitting that observation from the dataset will be equal to\n[\n30\n]\nFrom the properties of the hat matrix,\n0 ≤\nh\nj\n≤ 1\n, and they sum up to\np\n, so that on average\nh\nj\n≈\np\/n\n. These quantities\nh\nj\nare called the\nleverages\n, and observations with high\nh\nj\nare called\nleverage points\n.\n[\n31\n]\nUsually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.\nPartitioned regression\n[\nedit\n]\nSometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form\nwhere\nX\n1\nand\nX\n2\nhave dimensions\nn\n×\np\n1\n,\nn\n×\np\n2\n, and\nβ\n1\n,\nβ\n2\nare\np\n1\n×1 and\np\n2\n×1 vectors, with\np\n1\n+\np\n2\n=\np\n.\nThe\nFrisch–Waugh–Lovell theorem\nstates that in this regression the residuals\nand the OLS estimate\nwill be numerically identical to the residuals and the OLS estimate for\nβ\n2\nin the following regression:\n[\n32\n]\nwhere\nM\n1\nis the\nannihilator matrix\nfor regressors\nX\n1\n.\nThe theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.\nLarge sample properties\n[\nedit\n]\nThe least squares estimators are\npoint estimates\nof the linear regression model parameters\nβ\n. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the\ninterval estimates\n.\nSince we have not made any assumption about the distribution of error term\nε\ni\n, it is impossible to infer the distribution of the estimators\nand\n. Nevertheless, we can apply the\ncentral limit theorem\nto derive their\nasymptotic\nproperties as sample size\nn\ngoes to infinity. While the sample size is necessarily finite, it is customary to assume that\nn\nis \"large enough\" so that the true distribution of the OLS estimator is close to its asymptotic limit.\nWe can show that under the model assumptions, the least squares estimator for\nβ\nis\nconsistent\n(that is\nconverges in probability\nto\nβ\n) and asymptotically normal:\n[proof]\nwhere\nUsing this asymptotic distribution, approximate two-sided confidence intervals for the\nj\n-th component of the vector\ncan be constructed as\n  at the\n1 − \nα\nconfidence level,\nwhere\nq\ndenotes the\nquantile function\nof standard normal distribution, and [·]\njj\nis the\nj\n-th diagonal element of a matrix.\nSimilarly, the least squares estimator for\nσ\n2\nis also consistent and asymptotically normal (provided that the fourth moment of\nε\ni\nexists) with limiting distribution\nThese asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose\nis some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The\nmean response\nis the quantity\n, whereas the\npredicted response\nis\n. Clearly the predicted response is a random variable, its distribution can be derived from that of\n:\nwhich allows construct confidence intervals for mean  response\nto be constructed:\n  at the\n1 − \nα\nconfidence level.\nTwo hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The\nnull hypothesis\nof no explanatory value of the estimated regression is tested using an\nF-test\n. If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the\nalternative hypothesis\n, that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.\nSecond, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's\nt-statistic\n, as the ratio of the coefficient estimate to its\nstandard error\n. If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.\nIn addition, the\nChow test\nis used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.\nViolations of assumptions\n[\nedit\n]\nIn a\ntime series\nmodel, we require the\nstochastic process\n{\nx\ni\n,\ny\ni\n} to be\nstationary\nand\nergodic\n; if {\nx\ni\n,\ny\ni\n} is nonstationary, OLS results are often biased unless {\nx\ni\n,\ny\ni\n} is\nco-integrating\n.\n[\n33\n]\nWe still require the regressors to be\nstrictly exogenous\n: E[\nx\ni\nε\ni\n] = 0 for all\ni\n= 1, ...,\nn\n. If they are only\npredetermined\n, OLS is biased in finite sample;\nFinally, the assumptions on the variance take the form of requiring that {\nx\ni\nε\ni\n} is a\nmartingale difference sequence\n, with a finite matrix of second moments\nQ\nxxε\n²\n= E[ \nε\ni\n2\nx\ni\n x\ni\nT\n ]\n.\nConstrained estimation\n[\nedit\n]\nSuppose it is known that the coefficients in the regression satisfy a system of linear equations\nwhere\nQ\nis a\np\n×\nq\nmatrix of full rank, and\nc\nis a\nq\n×1 vector of known constants, where\nq < p\n. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint\nA\n. The\nconstrained least squares (CLS)\nestimator can be given by an explicit formula:\n[\n34\n]\nThis expression for the constrained estimator is valid as long as the matrix\nX\nT\nX\nis invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails,\nβ\nwill not be identifiable. However it may happen that adding the restriction\nA\nmakes\nβ\nidentifiable, in which case one would like to find the formula for the estimator. The estimator is equal to\n[\n35\n]\nwhere\nR\nis a\np\n×(\np\n − \nq\n) matrix such that the matrix\n[\nQ R\n]\nis non-singular, and\nR\nT\nQ\n= 0\n. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when\nX\nT\nX\nis invertible.\n[\n35\n]\nExample with real data\n[\nedit\n]\nThe following data set gives average heights and weights for American women aged 30–39 (source:\nThe World Almanac and Book of Facts, 1975\n).\nHeight (m)\n1.47\n1.50\n1.52\n1.55\n1.57\nScatterplot\nof the data, the relationship is slightly curved but close to linear\nWeight (kg)\n52.21\n53.12\n54.48\n55.84\n57.20\nHeight (m)\n1.60\n1.63\n1.65\n1.68\n1.70\nWeight (kg)\n58.57\n59.93\n61.29\n63.11\n64.47\nHeight (m)\n1.73\n1.75\n1.78\n1.80\n1.83\nWeight (kg)\n66.28\n68.10\n69.92\n72.19\n74.46\nWhen only one dependent variable is being modeled, a\nscatterplot\nwill suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model.  The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor\nHEIGHT\n2\n.  The regression model then becomes a multiple linear model:\nFitted regression\nThe output from most popular\nstatistical packages\nwill look similar to this:\nMethod\nLeast squares\nDependent variable\nWEIGHT\nObservations\n15\nParameter\nValue\nStd error\nt-statistic\np-value\n128.8128\n16.3083\n7.8986\n0.0000\n−143.1620\n19.8332\n−7.2183\n0.0000\n61.9603\n6.0084\n10.3122\n0.0000\nR\n2\n0.9989\n\nS.E. of regression\n0.2516\n\n\nAdjusted R\n2\n0.9987\n\nModel sum-of-sq.\n692.61\n\n\nLog-likelihood\n1.0890\n\nResidual sum-of-sq.\n0.7595\nDurbin–Watson stat.\n2.1013\n\nTotal sum-of-sq.\n693.37\nAkaike criterion\n0.2548\n\nF-statistic\n5471.2\nSchwarz criterion\n0.3964\n\np-value (F-stat)\n0.0000\nIn this table:\nThe\nValue\ncolumn gives the least squares estimates of parameters\nβ\nj\nThe\nStd error\ncolumn shows\nstandard errors\nof each coefficient estimate:\nThe\nt-statistic\nand\np-value\ncolumns are testing whether any of the coefficients might be equal to zero. The\nt\n-statistic is calculated simply as\n. If the errors ε follow a normal distribution,\nt\nfollows a Student-t distribution.  Under weaker conditions,\nt\nis asymptotically normal. Large values of\nt\nindicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column,\np\n-value\n, expresses the results of the hypothesis test as a\nsignificance level\n.  Conventionally,\np\n-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.\nR-squared\nis the\ncoefficient of determination\nindicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors\nX\nhave no explanatory power whatsoever. This is a biased estimate of the population\nR-squared\n, and will never decrease if additional regressors are added, even if they are irrelevant.\nAdjusted R-squared\nis a slightly modified version of\n, designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than\n, can decrease as new regressors are added, and even be negative for poorly fitting models:\nLog-likelihood\nis calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.\nDurbin–Watson statistic\ntests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.\nAkaike information criterion\nand\nSchwarz criterion\nare both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.\n[\n36\n]\nStandard error of regression\nis an estimate of\nσ\n, standard error of the error term.\nTotal sum of squares\n,\nmodel sum of squared\n, and\nresidual sum of squares\ntell us how much of the initial variation in the sample were explained by the regression.\nF-statistic\ntries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has\nF\n(\np–1\n,\nn–p\n) distribution under the null hypothesis and normality assumption, and its\np-value\nindicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as\nWald test\nor\nLR test\nshould be used.\nResiduals plot\nOrdinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model.  These are some of the common diagnostic plots:\nResiduals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold.  Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.\nResiduals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.\nResiduals against the fitted values,\n.\nResiduals against the preceding residual.  This plot may identify serial correlations in the residuals.\nAn important consideration when carrying out statistical inference using regression models is how the data were sampled.  In this example, the data are averages rather than measurements on individual women.  The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.\nSensitivity to rounding\n[\nedit\n]\nThis example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is\nnot\nan exact conversion. The original inches can be recovered by Round(x\/0.0254) and then re-converted to metric without rounding. If this is done the results become:\nConst\nHeight\nHeight\n2\nConverted to metric with rounding.\n128.8128\n−143.162\n61.96033\nConverted to metric without rounding.\n119.0205\n−131.5076\n58.5046\nResiduals to a quadratic fit for correctly and incorrectly converted data.\nUsing either of these equations to predict the weight of a 5' 6\" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.\nWhile this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range (\nextrapolation\n).\nThis highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the\nx\nand\ny\nerrors.\nAnother example with less real data\n[\nedit\n]\nWe can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is\nwhere\nis the radius of how far the object is from one of the bodies. In the equation the parameters\nand\nare used to determine the path of the orbit. We have measured the following data.\n(in degrees)\n43\n\n45\n\n52\n\n93\n\n108\n\n116\n4.7126\n\n4.5542\n\n4.0419\n\n2.2187\n\n1.8910\n\n1.7599\nWe need to find the least-squares approximation of\nand\nfor the given data.\nFirst we need to represent e and p in a linear form. So we are going to rewrite the equation\nas\n.\nFurthermore, one could fit for\napsides\nby expanding\nwith an extra parameter as\n, which is linear in both\nand in the extra basis function\n.\nWe use the original two-parameter form to represent our observational data as:\nwhere:\n;\n;\ncontains the coefficients of\nin the first column, which are all 1, and the coefficients of\nin the second column, given by\n; and\n, such that:\nOn solving we get\n,\nso\nand\nBayesian least squares\nFama–MacBeth regression\nNonlinear least squares\nNumerical methods for linear least squares\nNonlinear system identification\n^\n\"The Origins of Ordinary Least Squares Assumptions\"\n.\nFeature Column\n. 2022-03-01\n. Retrieved\n2024-05-16\n.\n^\n\"What is a complete list of the usual assumptions for linear regression?\"\n.\nCross Validated\n. Retrieved\n2022-09-28\n.\n^\nGoldberger, Arthur S.\n(1964).\n\"Classical Linear Regression\"\n.\nEconometric Theory\n. New York: John Wiley & Sons. pp. \n158\n.\nISBN\n \n0-471-31101-4\n.\n^\nHayashi, Fumio\n(2000).\nEconometrics\n. Princeton University Press. p. 15.\nISBN\n \n9780691010182\n.\n^\nHayashi (2000\n, page 18).\n^\nGhilani, Charles D.; Wolf, Paul R. (12 June 2006).\nAdjustment Computations: Spatial Data Analysis\n. John Wiley & Sons.\nISBN\n \n9780471697282\n.\n^\nHofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007).\nGNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more\n. Springer.\nISBN\n \n9783211730171\n.\n^\nXu, Guochang (5 October 2007).\nGPS: Theory, Algorithms and Applications\n. Springer.\nISBN\n \n9783540727156\n.\n^\na\nb\nHayashi (2000\n, page 19)\n^\nHoaglin, David C.; Welsch, Roy E. (1978).\n\"The Hat Matrix in Regression and ANOVA\"\n.\nThe American Statistician\n.\n32\n(1):\n17–\n22.\ndoi\n:\n10.1080\/00031305.1978.10479237\n.\nhdl\n:\n1721.1\/1920\n.\nISSN\n \n0003-1305\n.\n^\nJulian Faraway (2000),\nPractical Regression and Anova using R\n^\nKenney, J.; Keeping, E. S. (1963).\nMathematics of Statistics\n. van Nostrand. p. 187.\n^\nZwillinger, Daniel (1995).\nStandard Mathematical Tables and Formulae\n. Chapman&Hall\/CRC. p. 626.\nISBN\n \n0-8493-2479-3\n.\n^\nHayashi (2000\n, page 20)\n^\nAkbarzadeh, Vahab (7 May 2014).\n\"Line Estimation\"\n.\n^\nHayashi (2000\n, page 49)\n^\na\nb\nHayashi (2000\n, page 52)\n^\nHayashi (2000\n, page 10)\n^\nTibshirani, Robert (1996). \"Regression Shrinkage and Selection via the Lasso\".\nJournal of the Royal Statistical Society, Series B\n.\n58\n(1):\n267–\n288.\ndoi\n:\n10.1111\/j.2517-6161.1996.tb02080.x\n.\nJSTOR\n \n2346178\n.\n^\nEfron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). \"Least Angle Regression\".\nThe Annals of Statistics\n.\n32\n(2):\n407–\n451.\narXiv\n:\nmath\/0406456\n.\ndoi\n:\n10.1214\/009053604000000067\n.\nJSTOR\n \n3448465\n.\nS2CID\n \n204004121\n.\n^\nHawkins, Douglas M. (1973). \"On the Investigation of Alternative Regressions by Principal Component Analysis\".\nJournal of the Royal Statistical Society, Series C\n.\n22\n(3):\n275–\n286.\ndoi\n:\n10.2307\/2346776\n.\nJSTOR\n \n2346776\n.\n^\nJolliffe, Ian T. (1982). \"A Note on the Use of Principal Components in Regression\".\nJournal of the Royal Statistical Society, Series C\n.\n31\n(3):\n300–\n303.\ndoi\n:\n10.2307\/2348005\n.\nJSTOR\n \n2348005\n.\n^\nHayashi (2000\n, pages 27, 30)\n^\na\nb\nc\nHayashi (2000\n, page 27)\n^\nAmemiya, Takeshi\n(1985).\nAdvanced Econometrics\n. Harvard University Press. p. \n13\n.\nISBN\n \n9780674005600\n.\n^\nAmemiya (1985\n, page 14)\n^\nRao, C. R.\n(1973).\nLinear Statistical Inference and its Applications\n(Second ed.). New York: J. Wiley & Sons. p. 319.\nISBN\n \n0-471-70823-2\n.\n^\nAmemiya (1985\n, page 20)\n^\nAmemiya (1985\n, page 27)\n^\na\nb\nDavidson, Russell;\nMacKinnon, James G.\n(1993).\nEstimation and Inference in Econometrics\n. New York: Oxford University Press. p. 33.\nISBN\n \n0-19-506011-3\n.\n^\nDavidson & MacKinnon (1993\n, page 36)\n^\nDavidson & MacKinnon (1993\n, page 20)\n^\n\"Memento on EViews Output\"\n(PDF)\n. Retrieved\n28 December\n2020\n.\n^\nAmemiya (1985\n, page 21)\n^\na\nb\nAmemiya (1985\n, page 22)\n^\nBurnham, Kenneth P.; Anderson, David R. (2002).\nModel Selection and Multi-Model Inference\n(2nd ed.). Springer.\nISBN\n \n0-387-95364-7\n.\nDougherty, Christopher\n(2002).\nIntroduction to Econometrics\n(2nd ed.). New York: Oxford University Press. pp. \n48–\n113.\nISBN\n \n0-19-877643-8\n.\nGujarati, Damodar N.\n;\nPorter, Dawn C.\n(2009).\nBasic Econometics\n(Fifth ed.). Boston: McGraw-Hill Irwin. pp. \n55–\n96.\nISBN\n \n978-0-07-337577-9\n.\nHeij, Christiaan\n; Boer, Paul;\nFranses, Philip H.\n;\nKloek, Teun\n;\nvan Dijk, Herman K.\n(2004).\nEconometric Methods with Applications in Business and Economics\n(1st ed.). Oxford: Oxford University Press. pp. \n76–\n115.\nISBN\n \n978-0-19-926801-6\n.\nHill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008).\nPrinciples of Econometrics\n(3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. \n8–\n47.\nISBN\n \n978-0-471-72360-8\n.\nWooldridge, Jeffrey\n(2008).\n\"The Simple Regression Model\"\n.\nIntroductory Econometrics: A Modern Approach\n(4th ed.). Mason, OH: Cengage Learning. pp. \n22–\n67.\nISBN\n \n978-0-324-58162-1\n.","attrs_markdown":"[Jump to content](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#bodyContent)\n\nMain menu\n\nMain menu\n\nmove to sidebar\n\nhide\n\nNavigation\n\n- [Main page](https:\/\/en.wikipedia.org\/wiki\/Main_Page \"Visit the main page [z]\")\n- [Contents](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Contents \"Guides to browsing Wikipedia\")\n- [Current events](https:\/\/en.wikipedia.org\/wiki\/Portal:Current_events \"Articles related to current events\")\n- [Random article](https:\/\/en.wikipedia.org\/wiki\/Special:Random \"Visit a randomly selected article [x]\")\n- [About Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:About \"Learn about Wikipedia and how it works\")\n- [Contact us](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Contact_us \"How to contact Wikipedia\")\n\nContribute\n\n- [Help](https:\/\/en.wikipedia.org\/wiki\/Help:Contents \"Guidance on how to use and edit Wikipedia\")\n- [Learn to edit](https:\/\/en.wikipedia.org\/wiki\/Help:Introduction \"Learn how to edit Wikipedia\")\n- [Community portal](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Community_portal \"The hub for editors\")\n- [Recent changes](https:\/\/en.wikipedia.org\/wiki\/Special:RecentChanges \"A list of recent changes to Wikipedia [r]\")\n- [Upload file](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:File_upload_wizard \"Add images or other media for use on Wikipedia\")\n- [Special pages](https:\/\/en.wikipedia.org\/wiki\/Special:SpecialPages \"A list of all special pages [q]\")\n\n[![](https:\/\/en.wikipedia.org\/static\/images\/icons\/enwiki-25.svg) ![Wikipedia](https:\/\/en.wikipedia.org\/static\/images\/mobile\/copyright\/wikipedia-wordmark-en-25.svg) ![The Free Encyclopedia](https:\/\/en.wikipedia.org\/static\/images\/mobile\/copyright\/wikipedia-tagline-en-25.svg)](https:\/\/en.wikipedia.org\/wiki\/Main_Page)\n\n[Search](https:\/\/en.wikipedia.org\/wiki\/Special:Search \"Search Wikipedia [f]\")\n\nAppearance\n\n- [Donate](https:\/\/donate.wikimedia.org\/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en)\n- [Create account](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:CreateAccount&returnto=Ordinary+least+squares \"You are encouraged to create an account and log in; however, it is not mandatory\")\n- [Log in](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:UserLogin&returnto=Ordinary+least+squares \"You're encouraged to log in; however, it's not mandatory. [o]\")\n\nPersonal tools\n\n- [Donate](https:\/\/donate.wikimedia.org\/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en)\n- [Create account](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:CreateAccount&returnto=Ordinary+least+squares \"You are encouraged to create an account and log in; however, it is not mandatory\")\n- [Log in](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:UserLogin&returnto=Ordinary+least+squares \"You're encouraged to log in; however, it's not mandatory. [o]\")\n\n## Contents\nmove to sidebar\n\nhide\n\n- [(Top)](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares)\n- [1 Linear model](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Linear_model)\n  Toggle Linear model subsection\n  - [1\\.1 Matrix\/vector formulation](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Matrix\/vector_formulation)\n- [2 Estimation](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Estimation)\n- [3 Prediction](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Prediction)\n- [4 Sample statistics](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Sample_statistics)\n  Toggle Sample statistics subsection\n  - [4\\.1 Simple linear regression model](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Simple_linear_regression_model)\n- [5 Alternative derivations](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Alternative_derivations)\n  Toggle Alternative derivations subsection\n  - [5\\.1 Projection](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Projection)\n  - [5\\.2 Maximum likelihood](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Maximum_likelihood)\n  - [5\\.3 Generalized method of moments](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Generalized_method_of_moments)\n- [6 Assumptions](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Assumptions)\n- [7 Properties](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Properties)\n  Toggle Properties subsection\n  - [7\\.1 Finite sample properties](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Finite_sample_properties)\n    - [7\\.1.1 Assuming normality](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Assuming_normality)\n    - [7\\.1.2 Influential observations](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Influential_observations)\n    - [7\\.1.3 Partitioned regression](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Partitioned_regression)\n  - [7\\.2 Large sample properties](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Large_sample_properties)\n    - [7\\.2.1 Inference](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Inference)\n    - [7\\.2.2 Hypothesis testing](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Hypothesis_testing)\n  - [7\\.3 Violations of assumptions](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Violations_of_assumptions)\n    - [7\\.3.1 Time series model](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Time_series_model)\n    - [7\\.3.2 Constrained estimation](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Constrained_estimation)\n- [8 Example with real data](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Example_with_real_data)\n  Toggle Example with real data subsection\n  - [8\\.1 Sensitivity to rounding](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Sensitivity_to_rounding)\n- [9 Another example with less real data](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Another_example_with_less_real_data)\n  Toggle Another example with less real data subsection\n  - [9\\.1 Problem statement](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Problem_statement)\n  - [9\\.2 Solution](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Solution)\n- [10 See also](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#See_also)\n- [11 References](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#References)\n- [12 Further reading](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Further_reading)\n\nToggle the table of contents\n\n# Ordinary least squares\n12 languages\n\n- [العربية](https:\/\/ar.wikipedia.org\/wiki\/%D9%85%D8%B1%D8%A8%D8%B9%D8%A7%D8%AA_%D8%B5%D8%BA%D8%B1%D9%89_%D8%B9%D8%A7%D8%AF%D9%8A%D8%A9 \"مربعات صغرى عادية – Arabic\")\n- [Asturianu](https:\/\/ast.wikipedia.org\/wiki\/M%C3%ADnimos_cuadraos_ordinarios \"Mínimos cuadraos ordinarios – Asturian\")\n- [Català](https:\/\/ca.wikipedia.org\/wiki\/M%C3%A8tode_dels_m%C3%ADnims_quadrats_ordinaris \"Mètode dels mínims quadrats ordinaris – Catalan\")\n- [Ελληνικά](https:\/\/el.wikipedia.org\/wiki\/%CE%9C%CE%AD%CE%B8%CE%BF%CE%B4%CE%BF%CF%82_%CE%B5%CE%BB%CE%B1%CF%87%CE%AF%CF%83%CF%84%CF%89%CE%BD_%CF%84%CE%B5%CF%84%CF%81%CE%B1%CE%B3%CF%8E%CE%BD%CF%89%CE%BD \"Μέθοδος ελαχίστων τετραγώνων – Greek\")\n- [Español](https:\/\/es.wikipedia.org\/wiki\/M%C3%ADnimos_cuadrados_ordinarios \"Mínimos cuadrados ordinarios – Spanish\")\n- [فارسی](https:\/\/fa.wikipedia.org\/wiki\/%DA%A9%D9%85%D8%AA%D8%B1%DB%8C%D9%86_%D9%85%D8%B1%D8%A8%D8%B9%D8%A7%D8%AA_%D9%85%D8%B9%D9%85%D9%88%D9%84%DB%8C \"کمترین مربعات معمولی – Persian\")\n- [Français](https:\/\/fr.wikipedia.org\/wiki\/M%C3%A9thode_des_moindres_carr%C3%A9s_ordinaire \"Méthode des moindres carrés ordinaire – French\")\n- [한국어](https:\/\/ko.wikipedia.org\/wiki\/%EC%A0%95%EA%B7%9C%EB%B0%A9%EC%A0%95%EC%8B%9D \"정규방정식 – Korean\")\n- [Simple English](https:\/\/simple.wikipedia.org\/wiki\/Ordinary_least_squares \"Ordinary least squares – Simple English\")\n- [Slovenščina](https:\/\/sl.wikipedia.org\/wiki\/Navadni_najmanj%C5%A1i_kvadrati \"Navadni najmanjši kvadrati – Slovenian\")\n- [粵語](https:\/\/zh-yue.wikipedia.org\/wiki\/%E6%99%AE%E9%80%9A%E6%9C%80%E5%B0%8F%E4%BA%8C%E4%B9%98%E6%B3%95 \"普通最小二乘法 – Cantonese\")\n- [中文](https:\/\/zh.wikipedia.org\/wiki\/%E6%99%AE%E9%80%9A%E6%9C%80%E5%B0%8F%E4%BA%8C%E4%B9%98%E6%B3%95 \"普通最小二乘法 – Chinese\")\n\n[Edit links](https:\/\/www.wikidata.org\/wiki\/Special:EntityPage\/Q2912993#sitelinks-wikipedia \"Edit interlanguage links\")\n\n- [Article](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares \"View the content page [c]\")\n- [Talk](https:\/\/en.wikipedia.org\/wiki\/Talk:Ordinary_least_squares \"Discuss improvements to the content page [t]\")\n\nEnglish\n\n- [Read](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares)\n- [Edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit \"Edit this page [e]\")\n- [View history](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=history \"Past revisions of this page [h]\")\n\nTools\n\nTools\n\nmove to sidebar\n\nhide\n\nActions\n\n- [Read](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares)\n- [Edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit \"Edit this page [e]\")\n- [View history](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=history)\n\nGeneral\n\n- [What links here](https:\/\/en.wikipedia.org\/wiki\/Special:WhatLinksHere\/Ordinary_least_squares \"List of all English Wikipedia pages containing links to this page [j]\")\n- [Related changes](https:\/\/en.wikipedia.org\/wiki\/Special:RecentChangesLinked\/Ordinary_least_squares \"Recent changes in pages linked from this page [k]\")\n- [Upload file](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:File_Upload_Wizard \"Upload files [u]\")\n- [Permanent link](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&oldid=1345164414 \"Permanent link to this revision of this page\")\n- [Page information](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=info \"More information about this page\")\n- [Cite this page](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:CiteThisPage&page=Ordinary_least_squares&id=1345164414&wpFormIdentifier=titleform \"Information on how to cite this page\")\n- [Get shortened URL](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:UrlShortener&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FOrdinary_least_squares)\n\nPrint\/export\n\n- [Download as PDF](https:\/\/en.wikipedia.org\/w\/index.php?title=Special:DownloadAsPdf&page=Ordinary_least_squares&action=show-download-screen \"Download this page as a PDF file\")\n- [Printable version](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&printable=yes \"Printable version of this page [p]\")\n\nIn other projects\n\n- [Wikidata item](https:\/\/www.wikidata.org\/wiki\/Special:EntityPage\/Q2912993 \"Structured data on this page hosted by Wikidata [g]\")\n\nAppearance\n\nmove to sidebar\n\nhide\n\nFrom Wikipedia, the free encyclopedia\n\nMethod for estimating the unknown parameters in a linear regression model\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/7\/77\/Okuns_law_quarterly_differences.svg\/330px-Okuns_law_quarterly_differences.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Okuns_law_quarterly_differences.svg)\n\n[Okun's law](https:\/\/en.wikipedia.org\/wiki\/Okun%27s_law \"Okun's law\") in [macroeconomics](https:\/\/en.wikipedia.org\/wiki\/Macroeconomics \"Macroeconomics\") states that in an economy the [GDP](https:\/\/en.wikipedia.org\/wiki\/GDP \"GDP\") growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.\n\n|   |\n|---|\n| Part of a series on |\n| [Regression analysis](https:\/\/en.wikipedia.org\/wiki\/Regression_analysis \"Regression analysis\") |\n| Models |\n| [Linear regression](https:\/\/en.wikipedia.org\/wiki\/Linear_regression \"Linear regression\") [Simple regression](https:\/\/en.wikipedia.org\/wiki\/Simple_linear_regression \"Simple linear regression\") [Polynomial regression](https:\/\/en.wikipedia.org\/wiki\/Polynomial_regression \"Polynomial regression\") [General linear model](https:\/\/en.wikipedia.org\/wiki\/General_linear_model \"General linear model\") |\n| [Generalized linear model](https:\/\/en.wikipedia.org\/wiki\/Generalized_linear_model \"Generalized linear model\") [Vector generalized linear model](https:\/\/en.wikipedia.org\/wiki\/Vector_generalized_linear_model \"Vector generalized linear model\") [Discrete choice](https:\/\/en.wikipedia.org\/wiki\/Discrete_choice \"Discrete choice\") [Binomial regression](https:\/\/en.wikipedia.org\/wiki\/Binomial_regression \"Binomial regression\") [Binary regression](https:\/\/en.wikipedia.org\/wiki\/Binary_regression \"Binary regression\") [Logistic regression](https:\/\/en.wikipedia.org\/wiki\/Logistic_regression \"Logistic regression\") [Multinomial logistic regression](https:\/\/en.wikipedia.org\/wiki\/Multinomial_logistic_regression \"Multinomial logistic regression\") [Mixed logit](https:\/\/en.wikipedia.org\/wiki\/Mixed_logit \"Mixed logit\") [Probit](https:\/\/en.wikipedia.org\/wiki\/Probit_model \"Probit model\") [Multinomial probit](https:\/\/en.wikipedia.org\/wiki\/Multinomial_probit \"Multinomial probit\") [Ordered logit](https:\/\/en.wikipedia.org\/wiki\/Ordered_logit \"Ordered logit\") [Ordered probit](https:\/\/en.wikipedia.org\/wiki\/Ordered_probit \"Ordered probit\") [Poisson](https:\/\/en.wikipedia.org\/wiki\/Poisson_regression \"Poisson regression\") |\n| [Multilevel model](https:\/\/en.wikipedia.org\/wiki\/Multilevel_model \"Multilevel model\") [Fixed effects](https:\/\/en.wikipedia.org\/wiki\/Fixed_effects_model \"Fixed effects model\") [Random effects](https:\/\/en.wikipedia.org\/wiki\/Random_effects_model \"Random effects model\") [Linear mixed-effects model](https:\/\/en.wikipedia.org\/wiki\/Mixed_model \"Mixed model\") [Nonlinear mixed-effects model](https:\/\/en.wikipedia.org\/wiki\/Nonlinear_mixed-effects_model \"Nonlinear mixed-effects model\") |\n| [Nonlinear regression](https:\/\/en.wikipedia.org\/wiki\/Nonlinear_regression \"Nonlinear regression\") [Nonparametric](https:\/\/en.wikipedia.org\/wiki\/Nonparametric_regression \"Nonparametric regression\") [Semiparametric](https:\/\/en.wikipedia.org\/wiki\/Semiparametric_regression \"Semiparametric regression\") [Robust](https:\/\/en.wikipedia.org\/wiki\/Robust_regression \"Robust regression\") [Quantile](https:\/\/en.wikipedia.org\/wiki\/Quantile_regression \"Quantile regression\") [Isotonic](https:\/\/en.wikipedia.org\/wiki\/Isotonic_regression \"Isotonic regression\") [Principal components](https:\/\/en.wikipedia.org\/wiki\/Principal_component_regression \"Principal component regression\") [Least angle](https:\/\/en.wikipedia.org\/wiki\/Least-angle_regression \"Least-angle regression\") [Local](https:\/\/en.wikipedia.org\/wiki\/Local_regression \"Local regression\") [Segmented](https:\/\/en.wikipedia.org\/wiki\/Segmented_regression \"Segmented regression\") |\n| [Errors-in-variables](https:\/\/en.wikipedia.org\/wiki\/Errors-in-variables_models \"Errors-in-variables models\") |\n| Estimation |\n| [Least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\") [Linear](https:\/\/en.wikipedia.org\/wiki\/Linear_least_squares \"Linear least squares\") [Non-linear](https:\/\/en.wikipedia.org\/wiki\/Non-linear_least_squares \"Non-linear least squares\") |\n| [Ordinary]() [Weighted](https:\/\/en.wikipedia.org\/wiki\/Weighted_least_squares \"Weighted least squares\") [Generalized](https:\/\/en.wikipedia.org\/wiki\/Generalized_least_squares \"Generalized least squares\") [Generalized estimating equation](https:\/\/en.wikipedia.org\/wiki\/Generalized_estimating_equation \"Generalized estimating equation\") |\n| [Partial](https:\/\/en.wikipedia.org\/wiki\/Partial_least_squares_regression \"Partial least squares regression\") [Total](https:\/\/en.wikipedia.org\/wiki\/Total_least_squares \"Total least squares\") [Non-negative](https:\/\/en.wikipedia.org\/wiki\/Non-negative_least_squares \"Non-negative least squares\") [Ridge regression](https:\/\/en.wikipedia.org\/wiki\/Tikhonov_regularization \"Tikhonov regularization\") [Regularized](https:\/\/en.wikipedia.org\/wiki\/Regularized_least_squares \"Regularized least squares\") |\n| [Least absolute deviations](https:\/\/en.wikipedia.org\/wiki\/Least_absolute_deviations \"Least absolute deviations\") [Iteratively reweighted](https:\/\/en.wikipedia.org\/wiki\/Iteratively_reweighted_least_squares \"Iteratively reweighted least squares\") [Bayesian](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") [Bayesian multivariate](https:\/\/en.wikipedia.org\/wiki\/Bayesian_multivariate_linear_regression \"Bayesian multivariate linear regression\") [Least-squares spectral analysis](https:\/\/en.wikipedia.org\/wiki\/Least-squares_spectral_analysis \"Least-squares spectral analysis\") |\n| Background |\n| [Regression validation](https:\/\/en.wikipedia.org\/wiki\/Regression_validation \"Regression validation\") [Mean and predicted response](https:\/\/en.wikipedia.org\/wiki\/Mean_and_predicted_response \"Mean and predicted response\") [Errors and residuals](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals \"Errors and residuals\") [Goodness of fit](https:\/\/en.wikipedia.org\/wiki\/Goodness_of_fit \"Goodness of fit\") [Studentized residual](https:\/\/en.wikipedia.org\/wiki\/Studentized_residual \"Studentized residual\") [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\") |\n| [![icon](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/3\/3e\/Nuvola_apps_edu_mathematics_blue-p.svg\/40px-Nuvola_apps_edu_mathematics_blue-p.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Nuvola_apps_edu_mathematics_blue-p.svg) [Mathematics portal](https:\/\/en.wikipedia.org\/wiki\/Portal:Mathematics \"Portal:Mathematics\") |\n| [v](https:\/\/en.wikipedia.org\/wiki\/Template:Regression_bar \"Template:Regression bar\") [t](https:\/\/en.wikipedia.org\/wiki\/Template_talk:Regression_bar \"Template talk:Regression bar\") [e](https:\/\/en.wikipedia.org\/wiki\/Special:EditPage\/Template:Regression_bar \"Special:EditPage\/Template:Regression bar\") |\n\nIn [statistics](https:\/\/en.wikipedia.org\/wiki\/Statistics \"Statistics\"), **ordinary least squares** (**OLS**) is a type of [linear least squares](https:\/\/en.wikipedia.org\/wiki\/Linear_least_squares \"Linear least squares\") method for choosing the unknown [parameters](https:\/\/en.wikipedia.org\/wiki\/Statistical_parameter \"Statistical parameter\") in a [linear regression](https:\/\/en.wikipedia.org\/wiki\/Linear_regression \"Linear regression\") model (with fixed level-one\\[*[clarification needed](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Please_clarify \"Wikipedia:Please clarify\")*\\] effects of a [linear function](https:\/\/en.wikipedia.org\/wiki\/Linear_function \"Linear function\") of a set of [explanatory variables](https:\/\/en.wikipedia.org\/wiki\/Explanatory_variable \"Explanatory variable\")) by the principle of [least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\"): minimizing the sum of the squares of the differences between the observed [dependent variable](https:\/\/en.wikipedia.org\/wiki\/Dependent_variable \"Dependent variable\") (values of the variable being observed) in the input [dataset](https:\/\/en.wikipedia.org\/wiki\/Dataset \"Dataset\") and the output of the (linear) function of the [independent variable](https:\/\/en.wikipedia.org\/wiki\/Independent_variable \"Independent variable\"). Some sources consider OLS to be linear regression.[\\[1\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-1)\n\nGeometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting [estimator](https:\/\/en.wikipedia.org\/wiki\/Statistical_estimation \"Statistical estimation\") can be expressed by a simple formula, especially in the case of a [simple linear regression](https:\/\/en.wikipedia.org\/wiki\/Simple_linear_regression \"Simple linear regression\"), in which there is a single [regressor](https:\/\/en.wikipedia.org\/wiki\/Regressor \"Regressor\") on the right side of the regression equation.\n\nThe OLS estimator is [consistent](https:\/\/en.wikipedia.org\/wiki\/Consistent_estimator \"Consistent estimator\") for the level-one fixed effects when the regressors are [exogenous](https:\/\/en.wikipedia.org\/wiki\/Exogenous \"Exogenous\") and forms perfect [colinearity](https:\/\/en.wikipedia.org\/wiki\/Collinearity \"Collinearity\") (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[\\[2\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-2) and—by the [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\")—[optimal in the class of linear unbiased estimators](https:\/\/en.wikipedia.org\/wiki\/Best_linear_unbiased_estimator \"Best linear unbiased estimator\") when the [errors](https:\/\/en.wikipedia.org\/wiki\/Statistical_error \"Statistical error\") are [homoscedastic](https:\/\/en.wikipedia.org\/wiki\/Homoscedastic \"Homoscedastic\") and [serially uncorrelated](https:\/\/en.wikipedia.org\/wiki\/Autocorrelation \"Autocorrelation\"). Under these conditions, the method of OLS provides [minimum-variance mean-unbiased](https:\/\/en.wikipedia.org\/wiki\/UMVU \"UMVU\") estimation when the errors have finite [variances](https:\/\/en.wikipedia.org\/wiki\/Variance \"Variance\"). Under the additional assumption that the errors are [normally distributed](https:\/\/en.wikipedia.org\/wiki\/Normal_distribution \"Normal distribution\") with zero mean, OLS is the [maximum likelihood estimator](https:\/\/en.wikipedia.org\/wiki\/Maximum_likelihood_estimator \"Maximum likelihood estimator\") that outperforms any non-linear unbiased estimator.\n\n## Linear model\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=1 \"Edit section: Linear model\")\\]\n\nMain article: [Linear regression model](https:\/\/en.wikipedia.org\/wiki\/Linear_regression_model \"Linear regression model\")\n\nSuppose the data consists of n {\\\\displaystyle n} ![{\\\\displaystyle n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [observations](https:\/\/en.wikipedia.org\/wiki\/Statistical_unit \"Statistical unit\") { x i , y i } i \\= 1 n {\\\\displaystyle \\\\left\\\\{\\\\mathbf {x} \\_{i},y\\_{i}\\\\right\\\\}\\_{i=1}^{n}} ![{\\\\displaystyle \\\\left\\\\{\\\\mathbf {x} \\_{i},y\\_{i}\\\\right\\\\}\\_{i=1}^{n}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a30b4d40ab94a43e79c39dab82a36f8d19bdc798). Each observation i {\\\\displaystyle i} ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20) includes a scalar response y i {\\\\displaystyle y\\_{i}} ![{\\\\displaystyle y\\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67d30d30b6c2dbe4d6f150d699de040937ecc95f) and a column vector x i {\\\\displaystyle \\\\mathbf {x} \\_{i}} ![{\\\\displaystyle \\\\mathbf {x} \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/57d2ef3df60acdb53bdf90535264041fea7231cd) of p {\\\\displaystyle p} ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) parameters (regressors), i.e., x i \\= \\[ x i 1 , x i 2 , … , x i p \\] T {\\\\displaystyle \\\\mathbf {x} \\_{i}=\\\\left\\[x\\_{i1},x\\_{i2},\\\\dots ,x\\_{ip}\\\\right\\]^{\\\\operatorname {T} }} ![{\\\\displaystyle \\\\mathbf {x} \\_{i}=\\\\left\\[x\\_{i1},x\\_{i2},\\\\dots ,x\\_{ip}\\\\right\\]^{\\\\operatorname {T} }}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3278872b5bdb53e6af3474d92e9926c0238e8935). In a [linear regression model](https:\/\/en.wikipedia.org\/wiki\/Linear_regression_model \"Linear regression model\"), the response variable, y i {\\\\displaystyle y\\_{i}} ![{\\\\displaystyle y\\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67d30d30b6c2dbe4d6f150d699de040937ecc95f), is a linear function of the regressors:\n\ny\n\ni\n\n\\=\n\nβ\n\n1\n\nx\n\ni\n\n1\n\n\\+\n\nβ\n\n2\n\nx\n\ni\n\n2\n\n\\+\n\n⋯\n\n\\+\n\nβ\n\np\n\nx\n\ni\n\np\n\n\\+\n\nε\n\ni\n\n,\n\n{\\\\displaystyle y\\_{i}=\\\\beta \\_{1}\\\\ x\\_{i1}+\\\\beta \\_{2}\\\\ x\\_{i2}+\\\\cdots +\\\\beta \\_{p}\\\\ x\\_{ip}+\\\\varepsilon \\_{i},}\n\n![{\\\\displaystyle y\\_{i}=\\\\beta \\_{1}\\\\ x\\_{i1}+\\\\beta \\_{2}\\\\ x\\_{i2}+\\\\cdots +\\\\beta \\_{p}\\\\ x\\_{ip}+\\\\varepsilon \\_{i},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7870b85fe5f4127c19581eb03c7c0a26a76035a0)\n\nor in [vector](https:\/\/en.wikipedia.org\/wiki\/Row_and_column_vectors \"Row and column vectors\") form,\n\ny\n\ni\n\n\\=\n\nx\n\ni\n\nT\n\nβ\n\n\\+\n\nε\n\ni\n\n,\n\n{\\\\displaystyle y\\_{i}=\\\\mathbf {x} \\_{i}^{\\\\operatorname {T} }{\\\\boldsymbol {\\\\beta }}+\\\\varepsilon \\_{i},\\\\,}\n\n![{\\\\displaystyle y\\_{i}=\\\\mathbf {x} \\_{i}^{\\\\operatorname {T} }{\\\\boldsymbol {\\\\beta }}+\\\\varepsilon \\_{i},\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/b32cb3791f0e061f7d9930fa093e77c44de5df54)\n\nwhere x i {\\\\displaystyle \\\\mathbf {x} \\_{i}} ![{\\\\displaystyle \\\\mathbf {x} \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/57d2ef3df60acdb53bdf90535264041fea7231cd), as introduced previously, is a column vector of the i {\\\\displaystyle i} ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20)\\-th observation of all the explanatory variables; β {\\\\displaystyle {\\\\boldsymbol {\\\\beta }}} ![{\\\\displaystyle {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/702cafc420cc00c54896f6d125112820956aaf6b) is a p × 1 {\\\\displaystyle p\\\\times 1} ![{\\\\displaystyle p\\\\times 1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a9b3ff9128b8bc9ccf1c3b9a3ba1d253b95f5754) vector of unknown parameters; and the scalar ε i {\\\\displaystyle \\\\varepsilon \\_{i}} ![{\\\\displaystyle \\\\varepsilon \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) represents unobserved random variables ([errors](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals_in_statistics \"Errors and residuals in statistics\")) of the i {\\\\displaystyle i} ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20)\\-th observation. ε i {\\\\displaystyle \\\\varepsilon \\_{i}} ![{\\\\displaystyle \\\\varepsilon \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) accounts for the influences upon the responses y i {\\\\displaystyle y\\_{i}} ![{\\\\displaystyle y\\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67d30d30b6c2dbe4d6f150d699de040937ecc95f) from sources other than the explanatory variables x i {\\\\displaystyle \\\\mathbf {x} \\_{i}} ![{\\\\displaystyle \\\\mathbf {x} \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/57d2ef3df60acdb53bdf90535264041fea7231cd). This model can also be written in matrix notation as\n\ny\n\n\\=\n\nX\n\nβ\n\n\\+\n\nε\n\n,\n\n{\\\\displaystyle \\\\mathbf {y} =\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}+{\\\\boldsymbol {\\\\varepsilon }},\\\\,}\n\n![{\\\\displaystyle \\\\mathbf {y} =\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}+{\\\\boldsymbol {\\\\varepsilon }},\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e12b25ef53459d71b4879d9bd23b4387defe4aef)\n\nwhere y {\\\\displaystyle \\\\mathbf {y} } ![{\\\\displaystyle \\\\mathbf {y} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/bb25a040b592282dc2a254c3117e792c3c81161f) and ε {\\\\displaystyle {\\\\boldsymbol {\\\\varepsilon }}} ![{\\\\displaystyle {\\\\boldsymbol {\\\\varepsilon }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/8445af5ff7da70714382bc35e78bedcacf68e825) are n × 1 {\\\\displaystyle n\\\\times 1} ![{\\\\displaystyle n\\\\times 1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d24148f103e1cccb60addeeb0a64cb1c3d5622e0) vectors of the response variables and the errors of the n {\\\\displaystyle n} ![{\\\\displaystyle n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a601995d55609f2d9f5e233e36fbe9ea26011b3b) observations, and X {\\\\displaystyle \\\\mathbf {X} } ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd) is an n × p {\\\\displaystyle n\\\\times p} ![{\\\\displaystyle n\\\\times p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/43ad58cdd60e9b0ab2bec828151c740accf92028) matrix of regressors, also sometimes called the [design matrix](https:\/\/en.wikipedia.org\/wiki\/Design_matrix \"Design matrix\"), whose row i {\\\\displaystyle i} ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20) is x i T {\\\\displaystyle \\\\mathbf {x} \\_{i}^{\\\\operatorname {T} }} ![{\\\\displaystyle \\\\mathbf {x} \\_{i}^{\\\\operatorname {T} }}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/71723da20a9d144f526b5f42f8bce496c157d34d) and contains the i {\\\\displaystyle i} ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20)\\-th observations on all the explanatory variables.\n\nTypically, a constant term is included in the set of regressors X {\\\\displaystyle \\\\mathbf {X} } ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd), say, by taking x i 1 \\= 1 {\\\\displaystyle x\\_{i1}=1} ![{\\\\displaystyle x\\_{i1}=1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e29706788693b7f680e4d6dd2cfd96078cd968d8) for all i \\= 1 , … , n {\\\\displaystyle i=1,\\\\dots ,n} ![{\\\\displaystyle i=1,\\\\dots ,n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f3f269b2f3b2f87fec0168426652a5ea80b56112). The coefficient β 1 {\\\\displaystyle \\\\beta \\_{1}} ![{\\\\displaystyle \\\\beta \\_{1}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) corresponding to this regressor is called the *intercept*. Without the intercept, the fitted line is forced to cross the origin when x i \\= 0 → {\\\\displaystyle x\\_{i}={\\\\vec {0}}} ![{\\\\displaystyle x\\_{i}={\\\\vec {0}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d46f3b2caf53f86b9bb27baa47fa68fa138d61be).\n\nRegressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent).\n\nAs a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be *quadratic* in the second regressor, but none-the-less is still considered a *linear* model because the model *is* still linear in the parameters (β {\\\\displaystyle {\\\\boldsymbol {\\\\beta }}} ![{\\\\displaystyle {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/702cafc420cc00c54896f6d125112820956aaf6b)).\n\n### Matrix\/vector formulation\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=2 \"Edit section: Matrix\/vector formulation\")\\]\n\nConsider an [overdetermined system](https:\/\/en.wikipedia.org\/wiki\/Overdetermined_system \"Overdetermined system\")\n\n∑\n\nj\n\n\\=\n\n1\n\np\n\nx\n\ni\n\nj\n\nβ\n\nj\n\n\\=\n\ny\n\ni\n\n,\n\n(\n\ni\n\n\\=\n\n1\n\n,\n\n2\n\n,\n\n…\n\n,\n\nn\n\n)\n\n,\n\n{\\\\displaystyle \\\\sum \\_{j=1}^{p}x\\_{ij}\\\\beta \\_{j}=y\\_{i},\\\\ (i=1,2,\\\\dots ,n),}\n\n![{\\\\displaystyle \\\\sum \\_{j=1}^{p}x\\_{ij}\\\\beta \\_{j}=y\\_{i},\\\\ (i=1,2,\\\\dots ,n),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7080f1307d382773f003a69c9df7f481720c1fd2)\n\nof n {\\\\displaystyle n} ![{\\\\displaystyle n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [linear equations](https:\/\/en.wikipedia.org\/wiki\/Linear_equation \"Linear equation\") in p {\\\\displaystyle p} ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) unknown [coefficients](https:\/\/en.wikipedia.org\/wiki\/Coefficients \"Coefficients\"), β 1 , β 2 , … , β p {\\\\displaystyle \\\\beta \\_{1},\\\\beta \\_{2},\\\\dots ,\\\\beta \\_{p}} ![{\\\\displaystyle \\\\beta \\_{1},\\\\beta \\_{2},\\\\dots ,\\\\beta \\_{p}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3a420907c6b68e22d9f37306b5837bdafd46ae1e), with n \\> p {\\\\displaystyle n\\>p} ![{\\\\displaystyle n\\>p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7002b8ec5aefb63e588cf894804ba7cb8401fc95). This can be written in [matrix](https:\/\/en.wikipedia.org\/wiki\/Matrix_\$mathematics\$ \"Matrix (mathematics)\") form as\n\nX\n\nβ\n\n\\=\n\ny\n\n,\n\n{\\\\displaystyle \\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}=\\\\mathbf {y} ,}\n\n![{\\\\displaystyle \\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}=\\\\mathbf {y} ,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3b00013b54c8f88aeeb146a12584541c6e934529)\n\nwhere\n\nX\n\n\\=\n\n\\[\n\nX\n\n11\n\nX\n\n12\n\n⋯\n\nX\n\n1\n\np\n\nX\n\n21\n\nX\n\n22\n\n⋯\n\nX\n\n2\n\np\n\n⋮\n\n⋮\n\n⋱\n\n⋮\n\nX\n\nn\n\n1\n\nX\n\nn\n\n2\n\n⋯\n\nX\n\nn\n\np\n\n\\]\n\n,\n\nβ\n\n\\=\n\n\\[\n\nβ\n\n1\n\nβ\n\n2\n\n⋮\n\nβ\n\np\n\n\\]\n\n,\n\ny\n\n\\=\n\n\\[\n\ny\n\n1\n\ny\n\n2\n\n⋮\n\ny\n\nn\n\n\\]\n\n.\n\n{\\\\displaystyle \\\\mathbf {X} ={\\\\begin{bmatrix}X\\_{11}\\&X\\_{12}&\\\\cdots \\&X\\_{1p}\\\\\\\\X\\_{21}\\&X\\_{22}&\\\\cdots \\&X\\_{2p}\\\\\\\\\\\\vdots &\\\\vdots &\\\\ddots &\\\\vdots \\\\\\\\X\\_{n1}\\&X\\_{n2}&\\\\cdots \\&X\\_{np}\\\\end{bmatrix}},\\\\qquad {\\\\boldsymbol {\\\\beta }}={\\\\begin{bmatrix}\\\\beta \\_{1}\\\\\\\\\\\\beta \\_{2}\\\\\\\\\\\\vdots \\\\\\\\\\\\beta \\_{p}\\\\end{bmatrix}},\\\\qquad \\\\mathbf {y} ={\\\\begin{bmatrix}y\\_{1}\\\\\\\\y\\_{2}\\\\\\\\\\\\vdots \\\\\\\\y\\_{n}\\\\end{bmatrix}}.}\n\n![{\\\\displaystyle \\\\mathbf {X} ={\\\\begin{bmatrix}X\\_{11}\\&X\\_{12}&\\\\cdots \\&X\\_{1p}\\\\\\\\X\\_{21}\\&X\\_{22}&\\\\cdots \\&X\\_{2p}\\\\\\\\\\\\vdots &\\\\vdots &\\\\ddots &\\\\vdots \\\\\\\\X\\_{n1}\\&X\\_{n2}&\\\\cdots \\&X\\_{np}\\\\end{bmatrix}},\\\\qquad {\\\\boldsymbol {\\\\beta }}={\\\\begin{bmatrix}\\\\beta \\_{1}\\\\\\\\\\\\beta \\_{2}\\\\\\\\\\\\vdots \\\\\\\\\\\\beta \\_{p}\\\\end{bmatrix}},\\\\qquad \\\\mathbf {y} ={\\\\begin{bmatrix}y\\_{1}\\\\\\\\y\\_{2}\\\\\\\\\\\\vdots \\\\\\\\y\\_{n}\\\\end{bmatrix}}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9da3326ca4d8536e1e444d2cea03ec869d734e6f)\n\n(Note: for a linear model as above, not all elements in X {\\\\displaystyle \\\\mathbf {X} } ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd) contains information on the data points. The first column is populated with ones, X i 1 \\= 1 {\\\\displaystyle X\\_{i1}=1} ![{\\\\displaystyle X\\_{i1}=1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3b81dfc9a5c55d20b4edfdc14c4ea4fc4e666bb0). Only the other columns contain actual data. So here p {\\\\displaystyle p} ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) is equal to the number of regressors plus one).\n\nSuch a system usually has no exact solution, so the goal is instead to find the coefficients β {\\\\displaystyle {\\\\boldsymbol {\\\\beta }}} ![{\\\\displaystyle {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/702cafc420cc00c54896f6d125112820956aaf6b) which fit the equations \"best\", in the sense of solving the [quadratic](https:\/\/en.wikipedia.org\/wiki\/Quadratic_form_\$statistics\$ \"Quadratic form (statistics)\") [minimization](https:\/\/en.wikipedia.org\/wiki\/Mathematical_optimization \"Mathematical optimization\") problem\n\nβ\n\n^\n\n\\=\n\na\n\nr\n\ng\n\nm\n\ni\n\nn\n\nβ\n\nS\n\n(\n\nβ\n\n)\n\n,\n\n{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}={\\\\underset {\\\\boldsymbol {\\\\beta }}{\\\\operatorname {arg\\\\,min} }}\\\\,S({\\\\boldsymbol {\\\\beta }}),}\n\n![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}={\\\\underset {\\\\boldsymbol {\\\\beta }}{\\\\operatorname {arg\\\\,min} }}\\\\,S({\\\\boldsymbol {\\\\beta }}),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/bf227a8af979e716d08f7c82dd95b17440e33a15)\n\nwhere the objective function S {\\\\displaystyle S} ![{\\\\displaystyle S}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4611d85173cd3b508e67077d4a1252c9c05abca2) is given by\n\nS\n\n(\n\nβ\n\n)\n\n\\=\n\n∑\n\ni\n\n\\=\n\n1\n\nn\n\n\\|\n\ny\n\ni\n\n−\n\n∑\n\nj\n\n\\=\n\n1\n\np\n\nX\n\ni\n\nj\n\nβ\n\nj\n\n\\|\n\n2\n\n\\=\n\n‖\n\ny\n\n−\n\nX\n\nβ\n\n‖\n\n2\n\n.\n\n{\\\\displaystyle S({\\\\boldsymbol {\\\\beta }})=\\\\sum \\_{i=1}^{n}\\\\left\\|y\\_{i}-\\\\sum \\_{j=1}^{p}X\\_{ij}\\\\beta \\_{j}\\\\right\\|^{2}=\\\\left\\\\\\|\\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}\\\\right\\\\\\|^{2}.}\n\n![{\\\\displaystyle S({\\\\boldsymbol {\\\\beta }})=\\\\sum \\_{i=1}^{n}\\\\left\\|y\\_{i}-\\\\sum \\_{j=1}^{p}X\\_{ij}\\\\beta \\_{j}\\\\right\\|^{2}=\\\\left\\\\\\|\\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}\\\\right\\\\\\|^{2}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cd9b3380cc6d4c170f743bb84a1878cac86ce009)\n\nA justification for choosing this criterion is given in [Properties](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Properties) below. This minimization problem has a unique solution, provided that the p {\\\\displaystyle p} ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) columns of the matrix X {\\\\displaystyle \\\\mathbf {X} } ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd) are [linearly independent](https:\/\/en.wikipedia.org\/wiki\/Linearly_independent \"Linearly independent\"), given by solving the so-called *normal equations*:\n\n(\n\nX\n\nT\n\nX\n\n)\n\nβ\n\n^\n\n\\=\n\nX\n\nT\n\ny\n\n.\n\n{\\\\displaystyle \\\\left(\\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {X} \\\\right){\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {y} \\\\ .}\n\n![{\\\\displaystyle \\\\left(\\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {X} \\\\right){\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {y} \\\\ .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/74ccf054aed29744d095c445b7aaa7a84729db17)\n\nThe matrix X T X {\\\\displaystyle \\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {X} } ![{\\\\displaystyle \\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a8b826366f6f9df8cf2d0ea4fe3eda3c760d2fc8) is known as the *normal matrix* or [Gram matrix](https:\/\/en.wikipedia.org\/wiki\/Gram_matrix \"Gram matrix\") and the matrix X T y {\\\\displaystyle \\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {y} } ![{\\\\displaystyle \\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {y} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9fdb62e41d505bc511e1f626946b2775aa00c01b) is known as the [moment matrix](https:\/\/en.wikipedia.org\/wiki\/Moment_matrix \"Moment matrix\") of regressand by regressors.[\\[3\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-3) Finally, β ^ {\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}} ![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4a29ed56e80ee92ae1ef81b8ee8b7ffb4a16b614) is the coefficient vector of the least-squares [hyperplane](https:\/\/en.wikipedia.org\/wiki\/Hyperplane \"Hyperplane\"), expressed as\n\nβ\n\n^\n\n\\=\n\n(\n\nX\n\n⊤\n\nX\n\n)\n\n−\n\n1\n\nX\n\n⊤\n\ny\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }\\\\mathbf {y} .}\n\n![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }\\\\mathbf {y} .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/96ef11beb26be8b3f28df0d43e0810694ea980d1)\n\nor\n\nβ\n\n^\n\n\\=\n\nβ\n\n\\+\n\n(\n\nX\n\n⊤\n\nX\n\n)\n\n−\n\n1\n\nX\n\n⊤\n\nε\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}={\\\\boldsymbol {\\\\beta }}+\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }{\\\\boldsymbol {\\\\varepsilon }}.}\n\n![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}={\\\\boldsymbol {\\\\beta }}+\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }{\\\\boldsymbol {\\\\varepsilon }}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/dd9aa3940939f16090f954e6f8b76d914f3d9f18)\n\n## Estimation\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=3 \"Edit section: Estimation\")\\]\n\nSuppose *b* is a \"candidate\" value for the parameter vector *β*. The quantity *yi* − *xi*T*b*, called the *[residual](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals_in_statistics \"Errors and residuals in statistics\")* for the *i*\\-th observation, measures the vertical distance between the data point (*xi*, *yi*) and the hyperplane *y* = *x*T*b*, and thus assesses the degree of fit between the actual data and the model. The *[sum of squared residuals](https:\/\/en.wikipedia.org\/wiki\/Sum_of_squared_residuals \"Sum of squared residuals\")* (*SSR*) (also called the *error sum of squares* (*ESS*) or *residual sum of squares* (*RSS*))[\\[4\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-4) is a measure of the overall model fit:\n\nS\n\n(\n\nb\n\n)\n\n\\=\n\n∑\n\ni\n\n\\=\n\n1\n\nn\n\n(\n\ny\n\ni\n\n−\n\nx\n\ni\n\nT\n\nb\n\n)\n\n2\n\n\\=\n\n(\n\ny\n\n−\n\nX\n\nb\n\n)\n\nT\n\n(\n\ny\n\n−\n\nX\n\nb\n\n)\n\n,\n\n{\\\\displaystyle S(b)=\\\\sum \\_{i=1}^{n}(y\\_{i}-x\\_{i}^{\\\\operatorname {T} }b)^{2}=(y-Xb)^{\\\\operatorname {T} }(y-Xb),}\n\n![{\\\\displaystyle S(b)=\\\\sum \\_{i=1}^{n}(y\\_{i}-x\\_{i}^{\\\\operatorname {T} }b)^{2}=(y-Xb)^{\\\\operatorname {T} }(y-Xb),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/db93b1fe27230d86895999a749498e7180f26acd)\n\nwhere *T* denotes the matrix [transpose](https:\/\/en.wikipedia.org\/wiki\/Transpose \"Transpose\"), and the rows of *X*, denoting the values of all the independent variables associated with a particular value of the dependent variable, are *Xi = xi*T. The value of *b* which minimizes this sum is called the **OLS estimator for *β***. The function *S*(*b*) is quadratic in *b* with positive-definite [Hessian](https:\/\/en.wikipedia.org\/wiki\/Hessian_matrix \"Hessian matrix\"), and therefore this function possesses a unique global minimum at b \\= β ^ {\\\\displaystyle b={\\\\hat {\\\\beta }}} ![{\\\\displaystyle b={\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5b0af26c563f2ee2ad28a14e01ee712c5ab69d63), which can be given by the explicit formula[\\[5\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-5)[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Least_squares_estimator_for_.CE.B2 \"Proofs involving ordinary least squares\")\n\nβ\n\n^\n\n\\=\n\nargmin\n\nb\n\n∈\n\nR\n\np\n\n⁡\n\nS\n\n(\n\nb\n\n)\n\n\\=\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\nX\n\nT\n\ny\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\beta }}=\\\\operatorname {argmin} \\_{b\\\\in \\\\mathbb {R} ^{p}}S(b)=(X^{\\\\operatorname {T} }X)^{-1}X^{\\\\operatorname {T} }y\\\\ .}\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}=\\\\operatorname {argmin} \\_{b\\\\in \\\\mathbb {R} ^{p}}S(b)=(X^{\\\\operatorname {T} }X)^{-1}X^{\\\\operatorname {T} }y\\\\ .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/89476bedc0a2bdb8bdffee080e1cf7595eb09404)\n\nThe product *N* = *X*T *X* is a [Gram matrix](https:\/\/en.wikipedia.org\/wiki\/Gram_matrix \"Gram matrix\"), and its inverse, *Q* = *N*−1, is the *cofactor matrix* of *β*,[\\[6\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-6)[\\[7\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-7)[\\[8\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-8) closely related to its [covariance matrix](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Covariance_matrix), *C**β*. The matrix (*X*T *X*)−1 *X*T = *Q* *X*T is called the [Moore–Penrose pseudoinverse](https:\/\/en.wikipedia.org\/wiki\/Moore%E2%80%93Penrose_pseudoinverse \"Moore–Penrose pseudoinverse\") matrix of *X*. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect [multicollinearity](https:\/\/en.wikipedia.org\/wiki\/Multicollinearity \"Multicollinearity\") between the explanatory variables (which would cause the Gram matrix to have no inverse).\n\n## Prediction\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=4 \"Edit section: Prediction\")\\]\n\nAfter we have estimated *β*, the *[fitted values](https:\/\/en.wikipedia.org\/wiki\/Fitted_value \"Fitted value\")* (or *predicted values*) from the regression will be\n\ny\n\n^\n\n\\=\n\nX\n\nβ\n\n^\n\n\\=\n\nP\n\ny\n\n,\n\n{\\\\displaystyle {\\\\hat {y}}=X{\\\\hat {\\\\beta }}=Py,}\n\n![{\\\\displaystyle {\\\\hat {y}}=X{\\\\hat {\\\\beta }}=Py,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d480dd9ab57371fb43d863a1cf345c71fb9ed7dd)\n\nwhere *P* = *X*(*X*T*X*)−1*X*T is the *[projection matrix](https:\/\/en.wikipedia.org\/wiki\/Projection_matrix \"Projection matrix\")* onto the space *V* spanned by the columns of *X*. This matrix *P* is also sometimes called the *[hat matrix](https:\/\/en.wikipedia.org\/wiki\/Hat_matrix \"Hat matrix\")* because it \"puts a hat\" onto the variable *y*. Another matrix, closely related to *P* is the *annihilator* matrix *M* = *In* − *P*; this is a projection matrix onto the space orthogonal to *V*. Both matrices *P* and *M* are [symmetric](https:\/\/en.wikipedia.org\/wiki\/Symmetric_matrix \"Symmetric matrix\") and [idempotent](https:\/\/en.wikipedia.org\/wiki\/Idempotent_matrix \"Idempotent matrix\") (meaning that *P*2 = *P* and *M*2 = *M*), and relate to the data matrix *X* via identities *PX* = *X* and *MX* = 0.[\\[9\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9) Matrix *M* creates the *residuals* from the regression:\n\nε\n\n^\n\n\\=\n\ny\n\n−\n\ny\n\n^\n\n\\=\n\ny\n\n−\n\nX\n\nβ\n\n^\n\n\\=\n\nM\n\ny\n\n\\=\n\nM\n\n(\n\nX\n\nβ\n\n\\+\n\nε\n\n)\n\n\\=\n\n(\n\nM\n\nX\n\n)\n\nβ\n\n\\+\n\nM\n\nε\n\n\\=\n\nM\n\nε\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\varepsilon }}=y-{\\\\hat {y}}=y-X{\\\\hat {\\\\beta }}=My=M(X\\\\beta +\\\\varepsilon )=(MX)\\\\beta +M\\\\varepsilon =M\\\\varepsilon .}\n\n![{\\\\displaystyle {\\\\hat {\\\\varepsilon }}=y-{\\\\hat {y}}=y-X{\\\\hat {\\\\beta }}=My=M(X\\\\beta +\\\\varepsilon )=(MX)\\\\beta +M\\\\varepsilon =M\\\\varepsilon .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/49f1e5bb8d4cd6e0e3b2b5a4f54ec0964721882e)\n\nThe variances of the predicted values s y ^ i 2 {\\\\displaystyle s\\_{{\\\\hat {y}}\\_{i}}^{2}} ![{\\\\displaystyle s\\_{{\\\\hat {y}}\\_{i}}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/800e6b83f5c87178d1e03b2cf3592caee2ce515a) are found in the main diagonal of the [variance-covariance matrix](https:\/\/en.wikipedia.org\/wiki\/Variance-covariance_matrix \"Variance-covariance matrix\") of predicted values:\n\nC\n\ny\n\n^\n\n\\=\n\ns\n\n2\n\nP\n\n,\n\n{\\\\displaystyle C\\_{\\\\hat {y}}=s^{2}P,}\n\n![{\\\\displaystyle C\\_{\\\\hat {y}}=s^{2}P,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/be9078c7868b7d295fde435739ee63bc5f7f3cc2)\n\nwhere *P* is the projection matrix and *s*2 is the sample variance.[\\[10\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-q011-10) The full matrix is very large; its diagonal elements can be calculated individually as:\n\ns\n\ny\n\n^\n\ni\n\n2\n\n\\=\n\ns\n\n2\n\nX\n\ni\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\nX\n\ni\n\nT\n\n,\n\n{\\\\displaystyle s\\_{{\\\\hat {y}}\\_{i}}^{2}=s^{2}X\\_{i}(X^{T}X)^{-1}X\\_{i}^{T},}\n\n![{\\\\displaystyle s\\_{{\\\\hat {y}}\\_{i}}^{2}=s^{2}X\\_{i}(X^{T}X)^{-1}X\\_{i}^{T},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/abdd97a20f265471ebb770d822cab8b8732786e0)\n\nwhere *X*i is the *i*\\-th row of matrix *X*.\n\n## Sample statistics\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=5 \"Edit section: Sample statistics\")\\]\n\nUsing these residuals we can estimate the sample variance *s*2 using the *[reduced chi-squared](https:\/\/en.wikipedia.org\/wiki\/Reduced_chi-squared \"Reduced chi-squared\")* statistic:\n\ns\n\n2\n\n\\=\n\nε\n\n^\n\nT\n\nε\n\n^\n\nn\n\n−\n\np\n\n\\=\n\n(\n\nM\n\ny\n\n)\n\nT\n\nM\n\ny\n\nn\n\n−\n\np\n\n\\=\n\ny\n\nT\n\nM\n\nT\n\nM\n\ny\n\nn\n\n−\n\np\n\n\\=\n\ny\n\nT\n\nM\n\ny\n\nn\n\n−\n\np\n\n\\=\n\nS\n\n(\n\nβ\n\n^\n\n)\n\nn\n\n−\n\np\n\n,\n\nσ\n\n^\n\n2\n\n\\=\n\nn\n\n−\n\np\n\nn\n\ns\n\n2\n\n{\\\\displaystyle s^{2}={\\\\frac {{\\\\hat {\\\\varepsilon }}^{\\\\mathrm {T} }{\\\\hat {\\\\varepsilon }}}{n-p}}={\\\\frac {(My)^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {y^{\\\\mathrm {T} }M^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {y^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {S({\\\\hat {\\\\beta }})}{n-p}},\\\\qquad {\\\\hat {\\\\sigma }}^{2}={\\\\frac {n-p}{n}}\\\\;s^{2}}\n\n![{\\\\displaystyle s^{2}={\\\\frac {{\\\\hat {\\\\varepsilon }}^{\\\\mathrm {T} }{\\\\hat {\\\\varepsilon }}}{n-p}}={\\\\frac {(My)^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {y^{\\\\mathrm {T} }M^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {y^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {S({\\\\hat {\\\\beta }})}{n-p}},\\\\qquad {\\\\hat {\\\\sigma }}^{2}={\\\\frac {n-p}{n}}\\\\;s^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/02aa29dec64f9231a7b9b6b2ac7d4889bbe006b2)\n\nThe denominator, *n*−*p*, is the [statistical degrees of freedom](https:\/\/en.wikipedia.org\/wiki\/Degrees_of_freedom_\$statistics\$ \"Degrees of freedom (statistics)\"). The first quantity, *s*2, is the OLS estimate for *σ*2, whereas the second, σ ^ 2 {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\sigma }}^{2}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\sigma }}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6314fcc00b383711f6a82277c41ac3aa39112bbc), is the MLE estimate for *σ*2. The two estimators are quite similar in large samples; the first estimator is always [unbiased](https:\/\/en.wikipedia.org\/wiki\/Estimator_bias \"Estimator bias\"), while the second estimator is biased but has a smaller [mean squared error](https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error \"Mean squared error\"). In practice *s*2 is used more often, since it is more convenient for the hypothesis testing. The square root of *s*2 is called the *[regression standard error](https:\/\/en.wikipedia.org\/wiki\/Regression_standard_error \"Regression standard error\")*,[\\[11\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-11) *standard error of the regression*,[\\[12\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-12)[\\[13\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-13) or *standard error of the equation*.[\\[9\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9)\n\nIt is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto *X*. The *[coefficient of determination](https:\/\/en.wikipedia.org\/wiki\/Coefficient_of_determination \"Coefficient of determination\")* *R*2 is defined as a ratio of \"explained\" variance to the \"total\" variance of the dependent variable *y*, in the cases where the regression sum of squares equals the sum of squares of residuals:[\\[14\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-14)\n\nR\n\n2\n\n\\=\n\n∑\n\n(\n\ny\n\n^\n\ni\n\n−\n\ny\n\n¯\n\n)\n\n2\n\n∑\n\n(\n\ny\n\ni\n\n−\n\ny\n\n¯\n\n)\n\n2\n\n\\=\n\ny\n\nT\n\nP\n\nT\n\nL\n\nP\n\ny\n\ny\n\nT\n\nL\n\ny\n\n\\=\n\n1\n\n−\n\ny\n\nT\n\nM\n\ny\n\ny\n\nT\n\nL\n\ny\n\n\\=\n\n1\n\n−\n\nR\n\nS\n\nS\n\nT\n\nS\n\nS\n\n{\\\\displaystyle R^{2}={\\\\frac {\\\\sum ({\\\\hat {y}}\\_{i}-{\\\\overline {y}})^{2}}{\\\\sum (y\\_{i}-{\\\\overline {y}})^{2}}}={\\\\frac {y^{\\\\mathrm {T} }P^{\\\\mathrm {T} }LPy}{y^{\\\\mathrm {T} }Ly}}=1-{\\\\frac {y^{\\\\mathrm {T} }My}{y^{\\\\mathrm {T} }Ly}}=1-{\\\\frac {\\\\rm {RSS}}{\\\\rm {TSS}}}}\n\n![{\\\\displaystyle R^{2}={\\\\frac {\\\\sum ({\\\\hat {y}}\\_{i}-{\\\\overline {y}})^{2}}{\\\\sum (y\\_{i}-{\\\\overline {y}})^{2}}}={\\\\frac {y^{\\\\mathrm {T} }P^{\\\\mathrm {T} }LPy}{y^{\\\\mathrm {T} }Ly}}=1-{\\\\frac {y^{\\\\mathrm {T} }My}{y^{\\\\mathrm {T} }Ly}}=1-{\\\\frac {\\\\rm {RSS}}{\\\\rm {TSS}}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/402af3a43381655d4ef206850bdb41f55b3d6124)\n\nwhere TSS is the *[total sum of squares](https:\/\/en.wikipedia.org\/wiki\/Total_sum_of_squares \"Total sum of squares\")* for the dependent variable, L \\= I n − 1 n J n {\\\\textstyle L=I\\_{n}-{\\\\frac {1}{n}}J\\_{n}} ![{\\\\textstyle L=I\\_{n}-{\\\\frac {1}{n}}J\\_{n}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/22262bc6f943dcc9bc10a932608ba89ea19476ba), and J n {\\\\textstyle J\\_{n}} ![{\\\\textstyle J\\_{n}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/60e444a99197edac686bd1a68cfa011b1ffc8559) is an *n*×*n* matrix of ones. (L {\\\\displaystyle L} ![{\\\\displaystyle L}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/103168b86f781fe6e9a4a87b8ea1cebe0ad4ede8) is a [centering matrix](https:\/\/en.wikipedia.org\/wiki\/Centering_matrix \"Centering matrix\") which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for *R*2 to be meaningful, the matrix *X* of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, *R*2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.\n\n### Simple linear regression model\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=6 \"Edit section: Simple linear regression model\")\\]\n\nMain article: [Simple linear regression](https:\/\/en.wikipedia.org\/wiki\/Simple_linear_regression \"Simple linear regression\")\n\nIf the data matrix *X* contains only two variables, a constant and a scalar regressor *xi*, then this is called the \"simple regression model\". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as (*α*, *β*):\n\ny\n\ni\n\n\\=\n\nα\n\n\\+\n\nβ\n\nx\n\ni\n\n\\+\n\nε\n\ni\n\n.\n\n{\\\\displaystyle y\\_{i}=\\\\alpha +\\\\beta x\\_{i}+\\\\varepsilon \\_{i}.}\n\n![{\\\\displaystyle y\\_{i}=\\\\alpha +\\\\beta x\\_{i}+\\\\varepsilon \\_{i}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/968be557dd22b1a2e536b8d22369cfdb37f58703)\n\nThe least squares estimates in this case are given by simple formulas\n\nβ\n\n^\n\n\\=\n\n∑\n\ni\n\n\\=\n\n1\n\nn\n\n(\n\nx\n\ni\n\n−\n\nx\n\n¯\n\n)\n\n(\n\ny\n\ni\n\n−\n\ny\n\n¯\n\n)\n\n∑\n\ni\n\n\\=\n\n1\n\nn\n\n(\n\nx\n\ni\n\n−\n\nx\n\n¯\n\n)\n\n2\n\nα\n\n^\n\n\\=\n\ny\n\n¯\n\n−\n\nβ\n\n^\n\nx\n\n¯\n\n,\n\n{\\\\displaystyle {\\\\begin{aligned}{\\\\widehat {\\\\beta }}&={\\\\frac {\\\\sum \\_{i=1}^{n}{(x\\_{i}-{\\\\bar {x}})(y\\_{i}-{\\\\bar {y}})}}{\\\\sum \\_{i=1}^{n}{(x\\_{i}-{\\\\bar {x}})^{2}}}}\\\\\\\\\\[2pt\\]{\\\\widehat {\\\\alpha }}&={\\\\bar {y}}-{\\\\widehat {\\\\beta }}\\\\,{\\\\bar {x}}\\\\ ,\\\\end{aligned}}}\n\n![{\\\\displaystyle {\\\\begin{aligned}{\\\\widehat {\\\\beta }}&={\\\\frac {\\\\sum \\_{i=1}^{n}{(x\\_{i}-{\\\\bar {x}})(y\\_{i}-{\\\\bar {y}})}}{\\\\sum \\_{i=1}^{n}{(x\\_{i}-{\\\\bar {x}})^{2}}}}\\\\\\\\\\[2pt\\]{\\\\widehat {\\\\alpha }}&={\\\\bar {y}}-{\\\\widehat {\\\\beta }}\\\\,{\\\\bar {x}}\\\\ ,\\\\end{aligned}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/932c6407f7ceba533fef69961fe504fc3b565e1e)\n\n## Alternative derivations\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=7 \"Edit section: Alternative derivations\")\\]\n\nIn the previous section the least squares estimator β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^*β* = (*X*T*X*)−1*X*T*y*; the only difference is in how we interpret this result.\n\n### Projection\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=8 \"Edit section: Projection\")\\]\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/8\/87\/OLS_geometric_interpretation.svg\/250px-OLS_geometric_interpretation.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_geometric_interpretation.svg)\n\nOLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of\n\nX\n\n1\n\n{\\\\displaystyle X\\_{1}}\n\n![{\\\\displaystyle X\\_{1}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f70b2694445a5901b24338a2e7a7e58f02a72a32)\n\nand\n\nX\n\n2\n\n{\\\\displaystyle X\\_{2}}\n\n![{\\\\displaystyle X\\_{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2ad47c14b8a092f182512e76c96638aea6e3bea1)\n\nrefers to a column of the data matrix.)\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/c\/c2\/Geometric_interpretation_of_least_squares_%28three_observations%29.png\/250px-Geometric_interpretation_of_least_squares_%28three_observations%29.png)](https:\/\/en.wikipedia.org\/wiki\/File:Geometric_interpretation_of_least_squares_\$three_observations\$.png)\n\nLeast squares as projection of y onto col(X) for three observations; ŷ = Xβ gives fitted values and y − ŷ is the residual.\n\n|   |   |\n|---|---|\n| ![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/0\/0f\/Mergefrom.svg\/60px-Mergefrom.svg.png) | This section **may need to be cleaned up.** It has been [merged](https:\/\/en.wikipedia.org\/wiki\/H:M \"H:M\") from *[Linear least squares](https:\/\/en.wikipedia.org\/wiki\/Linear_least_squares \"Linear least squares\")*. |\n\nFor mathematicians, OLS is an approximate solution to an overdetermined system of linear equations *Xβ* ≈ *y*, where *β* is the unknown. Assuming the system cannot be solved exactly (the number of equations *n* is much larger than the number of unknowns *p*), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies\n\nβ\n\n^\n\n\\=\n\na\n\nr\n\ng\n\nmin\n\nβ\n\n‖\n\ny\n\n−\n\nX\n\nβ\n\n‖\n\n2\n\n,\n\n{\\\\displaystyle {\\\\hat {\\\\beta }}={\\\\rm {arg}}\\\\min \\_{\\\\beta }\\\\,\\\\lVert \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}\\\\rVert ^{2},}\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}={\\\\rm {arg}}\\\\min \\_{\\\\beta }\\\\,\\\\lVert \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}\\\\rVert ^{2},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/bd635ce3922e11053e13871626396a73e87db79f)\n\nwhere ‖·‖ is the standard [*L*2 norm](https:\/\/en.wikipedia.org\/wiki\/Norm_\$mathematics\$#Euclidean_norm \"Norm (mathematics)\") in the *n*\\-dimensional [Euclidean space](https:\/\/en.wikipedia.org\/wiki\/Euclidean_space \"Euclidean space\") **R***n*. The predicted quantity *Xβ* is just a certain linear combination of the vectors of regressors. Thus, the residual vector *y* − *Xβ* will have the smallest length when *y* is [projected orthogonally](https:\/\/en.wikipedia.org\/wiki\/Projection_\$linear_algebra\$ \"Projection (linear algebra)\") onto the [linear subspace](https:\/\/en.wikipedia.org\/wiki\/Linear_subspace \"Linear subspace\") [spanned](https:\/\/en.wikipedia.org\/wiki\/Linear_span \"Linear span\") by the columns of *X*. The OLS estimator β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) in this case can be interpreted as the coefficients of [vector decomposition](https:\/\/en.wikipedia.org\/wiki\/Vector_decomposition \"Vector decomposition\") of ^*y* = *Py* along the basis of *X*.\n\nIn other words, the gradient equations at the minimum can be written as:\n\n(\n\ny\n\n−\n\nX\n\nβ\n\n^\n\n)\n\n⊤\n\nX\n\n\\=\n\n0\\.\n\n{\\\\displaystyle (\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}})^{\\\\top }\\\\mathbf {X} =0.}\n\n![{\\\\displaystyle (\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}})^{\\\\top }\\\\mathbf {X} =0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f61db732765501ffe06d27ed044ec962b5836eb9)\n\nA geometrical interpretation of these equations is that the vector of residuals, y − X β ^ {\\\\displaystyle \\\\mathbf {y} -X{\\\\hat {\\\\boldsymbol {\\\\beta }}}} ![{\\\\displaystyle \\\\mathbf {y} -X{\\\\hat {\\\\boldsymbol {\\\\beta }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d28701ed83da946fc9829429e8a174f3623ef1e1) is orthogonal to the [column space](https:\/\/en.wikipedia.org\/wiki\/Column_space \"Column space\") of *X*, since the [dot product](https:\/\/en.wikipedia.org\/wiki\/Dot_product \"Dot product\") ( y − X β ^ ) ⋅ X v {\\\\displaystyle (\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}})\\\\cdot \\\\mathbf {X} \\\\mathbf {v} } ![{\\\\displaystyle (\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}})\\\\cdot \\\\mathbf {X} \\\\mathbf {v} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/b314a7f7223989068bab62ad471e21b653f95bcf) is equal to zero for *any* conformal vector, **v**. This means that y − X β ^ {\\\\displaystyle \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\hat {\\\\beta }}}} ![{\\\\displaystyle \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\hat {\\\\beta }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/c27bab4fd405ceee972b12096dbcd827ed7af8cc) is the shortest of all possible vectors y − X β {\\\\displaystyle \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}} ![{\\\\displaystyle \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5189feb32b7fee0a242670c7182342cfb014ae29), that is, the variance of the residuals is the minimum possible. This is illustrated at the right.\n\nIntroducing γ ^ {\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\gamma }}}} ![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\gamma }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2d450058b0adc1e50ab7cf56639e7234b8098f1a) and a matrix *K* with the assumption that a matrix \\[ X K \\] {\\\\displaystyle \\[\\\\mathbf {X} \\\\ \\\\mathbf {K} \\]} ![{\\\\displaystyle \\[\\\\mathbf {X} \\\\ \\\\mathbf {K} \\]}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2ac770308f79814997ffbdfd971621c67b76aef6) is non-singular and *K*T *X* = 0 (cf. [Orthogonal projections](https:\/\/en.wikipedia.org\/wiki\/Linear_projection#Orthogonal_projections \"Linear projection\")), the residual vector should satisfy the following equation:\n\nr\n\n^\n\n:=\n\ny\n\n−\n\nX\n\nβ\n\n^\n\n\\=\n\nK\n\nγ\n\n^\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\mathbf {r} }}:=\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\mathbf {K} {\\\\hat {\\\\boldsymbol {\\\\gamma }}}.}\n\n![{\\\\displaystyle {\\\\hat {\\\\mathbf {r} }}:=\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\mathbf {K} {\\\\hat {\\\\boldsymbol {\\\\gamma }}}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/52b3dd4358358a8e91ca5f3ebc7aab78c0d6fdbd)\n\nThe equation and solution of linear least squares are thus described as follows:\n\ny\n\n\\=\n\n\\[\n\nX\n\nK\n\n\\]\n\n\\[\n\nβ\n\n^\n\nγ\n\n^\n\n\\]\n\n,\n\n⇒\n\n\\[\n\nβ\n\n^\n\nγ\n\n^\n\n\\]\n\n\\=\n\n\\[\n\nX\n\nK\n\n\\]\n\n−\n\n1\n\ny\n\n\\=\n\n\\[\n\n(\n\nX\n\n⊤\n\nX\n\n)\n\n−\n\n1\n\nX\n\n⊤\n\n(\n\nK\n\n⊤\n\nK\n\n)\n\n−\n\n1\n\nK\n\n⊤\n\n\\]\n\ny\n\n.\n\n{\\\\displaystyle {\\\\begin{aligned}\\\\mathbf {y} &={\\\\begin{bmatrix}\\\\mathbf {X} &\\\\mathbf {K} \\\\end{bmatrix}}{\\\\begin{bmatrix}{\\\\hat {\\\\boldsymbol {\\\\beta }}}\\\\\\\\{\\\\hat {\\\\boldsymbol {\\\\gamma }}}\\\\end{bmatrix}},\\\\\\\\{}\\\\Rightarrow {\\\\begin{bmatrix}{\\\\hat {\\\\boldsymbol {\\\\beta }}}\\\\\\\\{\\\\hat {\\\\boldsymbol {\\\\gamma }}}\\\\end{bmatrix}}&={\\\\begin{bmatrix}\\\\mathbf {X} &\\\\mathbf {K} \\\\end{bmatrix}}^{-1}\\\\mathbf {y} ={\\\\begin{bmatrix}\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }\\\\\\\\\\\\left(\\\\mathbf {K} ^{\\\\top }\\\\mathbf {K} \\\\right)^{-1}\\\\mathbf {K} ^{\\\\top }\\\\end{bmatrix}}\\\\mathbf {y} .\\\\end{aligned}}}\n\n![{\\\\displaystyle {\\\\begin{aligned}\\\\mathbf {y} &={\\\\begin{bmatrix}\\\\mathbf {X} &\\\\mathbf {K} \\\\end{bmatrix}}{\\\\begin{bmatrix}{\\\\hat {\\\\boldsymbol {\\\\beta }}}\\\\\\\\{\\\\hat {\\\\boldsymbol {\\\\gamma }}}\\\\end{bmatrix}},\\\\\\\\{}\\\\Rightarrow {\\\\begin{bmatrix}{\\\\hat {\\\\boldsymbol {\\\\beta }}}\\\\\\\\{\\\\hat {\\\\boldsymbol {\\\\gamma }}}\\\\end{bmatrix}}&={\\\\begin{bmatrix}\\\\mathbf {X} &\\\\mathbf {K} \\\\end{bmatrix}}^{-1}\\\\mathbf {y} ={\\\\begin{bmatrix}\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }\\\\\\\\\\\\left(\\\\mathbf {K} ^{\\\\top }\\\\mathbf {K} \\\\right)^{-1}\\\\mathbf {K} ^{\\\\top }\\\\end{bmatrix}}\\\\mathbf {y} .\\\\end{aligned}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/68f300fc2e982864b80ee1416ce010f0078e0e05)\n\nAnother way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[\\[15\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-15) Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.\n\n### Maximum likelihood\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=9 \"Edit section: Maximum likelihood\")\\]\n\nThe OLS estimator is identical to the [maximum likelihood estimator](https:\/\/en.wikipedia.org\/wiki\/Maximum_likelihood_estimator \"Maximum likelihood estimator\") (MLE) under the normality assumption for the error terms.[\\[16\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-16)[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Maximum_likelihood_approach \"Proofs involving ordinary least squares\") This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by [Yule](https:\/\/en.wikipedia.org\/wiki\/Udny_Yule \"Udny Yule\") and [Pearson](https:\/\/en.wikipedia.org\/wiki\/Karl_Pearson \"Karl Pearson\").\\[*[citation needed](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Citation_needed \"Wikipedia:Citation needed\")*\\] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [Cramér–Rao bound](https:\/\/en.wikipedia.org\/wiki\/Cram%C3%A9r%E2%80%93Rao_bound \"Cramér–Rao bound\") for variance) if the normality assumption is satisfied.[\\[17\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17)\n\n### Generalized method of moments\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=10 \"Edit section: Generalized method of moments\")\\]\n\nIn [iid](https:\/\/en.wikipedia.org\/wiki\/Iid \"Iid\") case the OLS estimator can also be viewed as a [GMM](https:\/\/en.wikipedia.org\/wiki\/Generalized_method_of_moments \"Generalized method of moments\") estimator arising from the moment conditions\n\nE\n\n\\[\n\nx\n\ni\n\n(\n\ny\n\ni\n\n−\n\nx\n\ni\n\nT\n\nβ\n\n)\n\n\\]\n\n\\=\n\n0\\.\n\n{\\\\displaystyle \\\\mathrm {E} {\\\\big \\[}\\\\,x\\_{i}\\\\left(y\\_{i}-x\\_{i}^{\\\\operatorname {T} }\\\\beta \\\\right)\\\\,{\\\\big \\]}=0.}\n\n![{\\\\displaystyle \\\\mathrm {E} {\\\\big \\[}\\\\,x\\_{i}\\\\left(y\\_{i}-x\\_{i}^{\\\\operatorname {T} }\\\\beta \\\\right)\\\\,{\\\\big \\]}=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/1d7894c141dad7e6dae3aed8bb708aada174daf2)\n\nThese moment conditions state that the regressors should be uncorrelated with the errors. Since *xi* is a *p*\\-vector, the number of moment conditions is equal to the dimension of the parameter vector *β*, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.\n\nNote that the original strict exogeneity assumption E\\[*εi* \\| *xi*\\] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E\\[*ƒ*(*xi*)·*εi*\\] = 0 will hold. However it can be shown using the [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\") that the optimal choice of function ƒ is to take *ƒ*(*x*) = *x*, which results in the moment equation posted above.\n\n## Assumptions\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=11 \"Edit section: Assumptions\")\\]\n\nSee also: [Linear regression § Assumptions](https:\/\/en.wikipedia.org\/wiki\/Linear_regression#Assumptions \"Linear regression\")\n\nThere are several different frameworks in which the [linear regression model](https:\/\/en.wikipedia.org\/wiki\/Linear_regression_model \"Linear regression model\") can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.\n\nOne of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (**random design**) the regressors *xi* are random and sampled together with the *yi*'s from some [population](https:\/\/en.wikipedia.org\/wiki\/Statistical_population \"Statistical population\"), as in an [observational study](https:\/\/en.wikipedia.org\/wiki\/Observational_study \"Observational study\"). This approach allows for more natural study of the [asymptotic properties](https:\/\/en.wikipedia.org\/wiki\/Asymptotic_theory_\$statistics\$ \"Asymptotic theory (statistics)\") of the estimators. In the other interpretation (**fixed design**), the regressors *X* are treated as known constants set by a [design](https:\/\/en.wikipedia.org\/wiki\/Design_of_experiments \"Design of experiments\"), and *y* is sampled conditionally on the values of *X* as in an [experiment](https:\/\/en.wikipedia.org\/wiki\/Experiment \"Experiment\"). For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on *X*. All results stated in this article are within the random design framework.\n\nThe classical model focuses on the \"finite sample\" estimation and inference, meaning that the number of observations *n* is fixed. This contrasts with the other approaches, which study the [asymptotic behavior](https:\/\/en.wikipedia.org\/wiki\/Asymptotic_theory_\$statistics\$ \"Asymptotic theory (statistics)\") of OLS, and in which the behavior at a large number of samples is studied. To prove finite sample unbiasedness of the OLS estimator, we require the following assumptions.\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/8\/8b\/Polyreg_scheffe.svg\/500px-Polyreg_scheffe.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Polyreg_scheffe.svg)\n\nExample of a cubic polynomial regression, which is a type of linear regression. Although *polynomial regression* fits a curve model to the data, as a [statistical estimation](https:\/\/en.wikipedia.org\/wiki\/Estimation_theory \"Estimation theory\") problem it is linear, in the sense that the conditional expectation function\n\nE\n\n\\[\n\ny\n\n\\|\n\nx\n\n\\]\n\n{\\\\displaystyle \\\\mathbb {E} \\[y\\|x\\]}\n\n![{\\\\displaystyle \\\\mathbb {E} \\[y\\|x\\]}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/391e556522e79e05efc7dd6a5248cc0e5b0c4651)\n\nis linear in the unknown [parameters](https:\/\/en.wikipedia.org\/wiki\/Parameter \"Parameter\") that are estimated from the [data](https:\/\/en.wikipedia.org\/wiki\/Data \"Data\"). For this reason, polynomial regression is considered to be a special case of [multiple linear regression](https:\/\/en.wikipedia.org\/wiki\/Multiple_linear_regression \"Multiple linear regression\").\n\n- **Exogeneity**. The regressors do not [covary](https:\/\/en.wikipedia.org\/wiki\/Covariance \"Covariance\") with the error term:\n  E\n  \n  \\[\n  \n  ε\n  \n  i\n  \n  x\n  \n  i\n  \n  \\]\n  \n  \\=\n  \n  0\\.\n  \n  {\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}x\\_{i}\\]=0.}\n  \n  ![{\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}x\\_{i}\\]=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/865dc6e3427f19d12afd3bed41c45bb7661ef289)\n  This requires, for example, that there are no [omitted variables](https:\/\/en.wikipedia.org\/wiki\/Omitted_variable_bias \"Omitted variable bias\") that covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression in [mathematical statistics](https:\/\/en.wikipedia.org\/wiki\/Mathematical_statistics \"Mathematical statistics\") is that the predictor variables *x* can be treated as fixed values, rather than [random variables](https:\/\/en.wikipedia.org\/wiki\/Random_variable \"Random variable\"). This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complex [errors-in-variables models](https:\/\/en.wikipedia.org\/wiki\/Errors-in-variables_models \"Errors-in-variables models\"), [instrumental variable models](https:\/\/en.wikipedia.org\/wiki\/Instrumental_variable \"Instrumental variable\") and the like.\n- **Linearity**, or **correct specification**. This means that the mean of the response variable is a [linear combination](https:\/\/en.wikipedia.org\/wiki\/Linear_combination \"Linear combination\") of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in [polynomial regression](https:\/\/en.wikipedia.org\/wiki\/Polynomial_regression \"Polynomial regression\"), which uses linear regression to fit the response variable as an arbitrary [polynomial](https:\/\/en.wikipedia.org\/wiki\/Polynomial \"Polynomial\") function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have \"too much power\", in that they tend to [overfit](https:\/\/en.wikipedia.org\/wiki\/Overfit \"Overfit\") the data. As a result, some kind of [regularization](https:\/\/en.wikipedia.org\/wiki\/Regularization_\$mathematics\$ \"Regularization (mathematics)\") must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are [ridge regression](https:\/\/en.wikipedia.org\/wiki\/Ridge_regression \"Ridge regression\") and [lasso regression](https:\/\/en.wikipedia.org\/wiki\/Lasso_regression \"Lasso regression\"). [Bayesian linear regression](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, [ridge regression](https:\/\/en.wikipedia.org\/wiki\/Ridge_regression \"Ridge regression\") and [lasso regression](https:\/\/en.wikipedia.org\/wiki\/Lasso_regression \"Lasso regression\") can both be viewed as special cases of Bayesian linear regression, with particular types of [prior distributions](https:\/\/en.wikipedia.org\/wiki\/Prior_distribution \"Prior distribution\") placed on the regression coefficients.)\n- [![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/6a\/Heteroscedasticity_in_Linear_Regression.png\/330px-Heteroscedasticity_in_Linear_Regression.png)](https:\/\/en.wikipedia.org\/wiki\/File:Heteroscedasticity_in_Linear_Regression.png)\n  \n  Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab\n  **Constant variance** or **[homoscedasticity](https:\/\/en.wikipedia.org\/wiki\/Homoscedasticity \"Homoscedasticity\")**. This means that the variance of the errors does not depend on the values of the predictor variables:\n  E\n  \n  \\[\n  \n  ε\n  \n  i\n  \n  2\n  \n  \\|\n  \n  x\n  \n  i\n  \n  \\]\n  \n  \\=\n  \n  σ\n  \n  2\n  \n  .\n  \n  {\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}^{2}\\|x\\_{i}\\]=\\\\sigma ^{2}.}\n  \n  ![{\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}^{2}\\|x\\_{i}\\]=\\\\sigma ^{2}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/93072a3d09cd2b89e6a544dfd4c7cbb017200cbd)\n  Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be \\$100,000 may easily have an actual income of \\$80,000 or \\$120,000—i.e., a [standard deviation](https:\/\/en.wikipedia.org\/wiki\/Standard_deviation \"Standard deviation\") of around \\$20,000—while another person with a predicted income of \\$10,000 is unlikely to have the same \\$20,000 standard deviation, since that would imply their actual income could vary anywhere between −\\$10,000 and \\$30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called [heteroscedasticity](https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity \"Heteroscedasticity\"). In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a \"fanning effect\" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see [Heteroscedasticity](https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity \"Heteroscedasticity\"). The presence of heteroscedasticity will result in an overall \"average\" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of [ordinary least squares](), not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The [mean squared error](https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error \"Mean squared error\") for the model will also be wrong. Various estimation techniques including [weighted least squares](https:\/\/en.wikipedia.org\/wiki\/Weighted_least_squares \"Weighted least squares\") and the use of [heteroscedasticity-consistent standard errors](https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity-consistent_standard_errors \"Heteroscedasticity-consistent standard errors\") can handle heteroscedasticity in a quite general way. [Bayesian linear regression](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the [logarithm](https:\/\/en.wikipedia.org\/wiki\/Logarithm \"Logarithm\") of the response variable using a linear regression model, which implies that the response variable itself has a [log-normal distribution](https:\/\/en.wikipedia.org\/wiki\/Log-normal_distribution \"Log-normal distribution\") rather than a [normal distribution](https:\/\/en.wikipedia.org\/wiki\/Normal_distribution \"Normal distribution\")).\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/0\/04\/Independence_of_Errors_Assumption_for_Linear_Regressions.png\/500px-Independence_of_Errors_Assumption_for_Linear_Regressions.png)](https:\/\/en.wikipedia.org\/wiki\/File:Independence_of_Errors_Assumption_for_Linear_Regressions.png)\n\nTo check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as [autocorrelation](https:\/\/en.wikipedia.org\/wiki\/Autocorrelation \"Autocorrelation\") in the errors or their correlation with one or more covariates.\n\n- **Uncorrelatedness of errors**. This assumes that the errors of the response variables are uncorrelated with each other:\n  E\n  \n  \\[\n  \n  ε\n  \n  i\n  \n  ε\n  \n  j\n  \n  \\|\n  \n  x\n  \n  i\n  \n  ,\n  \n  x\n  \n  j\n  \n  \\]\n  \n  \\=\n  \n  0\\.\n  \n  {\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}\\\\varepsilon \\_{j}\\|x\\_{i},x\\_{j}\\]=0.}\n  \n  ![{\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}\\\\varepsilon \\_{j}\\|x\\_{i},x\\_{j}\\]=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d2049f9a6b8ce702724b8271a5912158bf3f0fc3)\n  Some methods such as [generalized least squares](https:\/\/en.wikipedia.org\/wiki\/Generalized_least_squares \"Generalized least squares\") are capable of handling correlated errors, although they typically require significantly more data unless some sort of [regularization](https:\/\/en.wikipedia.org\/wiki\/Regularization_\$mathematics\$ \"Regularization (mathematics)\") is used to bias the model towards assuming uncorrelated errors. [Bayesian linear regression](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") is a general way of handling this issue. Full [statistical independence](https:\/\/en.wikipedia.org\/wiki\/Statistical_independence \"Statistical independence\") is a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence.\n- **Lack of perfect multicollinearity** in the predictors. For standard [least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\") estimation methods, the design matrix *X* must have full [column rank](https:\/\/en.wikipedia.org\/wiki\/Column_rank \"Column rank\") *p*: [\\[18\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_10-18)\n  Pr\n  \n  \\[\n  \n  rank\n  \n  ⁡\n  \n  (\n  \n  X\n  \n  )\n  \n  \\=\n  \n  p\n  \n  \\]\n  \n  \\=\n  \n  1\\.\n  \n  {\\\\displaystyle \\\\Pr \\\\!{\\\\big \\[}\\\\,\\\\operatorname {rank} (X)=p\\\\,{\\\\big \\]}=1.}\n  \n  ![{\\\\displaystyle \\\\Pr \\\\!{\\\\big \\[}\\\\,\\\\operatorname {rank} (X)=p\\\\,{\\\\big \\]}=1.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6a11be3b89ce51c6441155fddbe512a991132fbf)\n  If this assumption is violated, perfect [multicollinearity](https:\/\/en.wikipedia.org\/wiki\/Multicollinearity \"Multicollinearity\") exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see [Variance inflation factor](https:\/\/en.wikipedia.org\/wiki\/Variance_inflation_factor \"Variance inflation factor\")). In the case of perfect multicollinearity, the parameter vector ***β*** will be [non-identifiable](https:\/\/en.wikipedia.org\/wiki\/Non-identifiable \"Non-identifiable\")—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space **R***p*). See [partial least squares regression](https:\/\/en.wikipedia.org\/wiki\/Partial_least_squares_regression \"Partial least squares regression\"). Methods for fitting linear models with multicollinearity have been developed,[\\[19\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Tibshirani-1996-19)[\\[20\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Efron-2004-20)[\\[21\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hawkins-1973-21)[\\[22\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Jolliffe-1982-22) some of which require additional assumptions such as \"effect sparsity\"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in [generalized linear models](https:\/\/en.wikipedia.org\/wiki\/Generalized_linear_model \"Generalized linear model\"), do not suffer from this problem.\n\nViolations of these assumptions can result in biased estimations of ***β***, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:\n\n- The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.\n- The arrangement, or [probability distribution](https:\/\/en.wikipedia.org\/wiki\/Probability_distribution \"Probability distribution\") of the predictor variables **x** has a major influence on the precision of estimates of ***β***. [Sampling](https:\/\/en.wikipedia.org\/wiki\/Sampling_\$statistics\$ \"Sampling (statistics)\") and [design of experiments](https:\/\/en.wikipedia.org\/wiki\/Design_of_experiments \"Design of experiments\") are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of ***β***.\n\n## Properties\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=12 \"Edit section: Properties\")\\]\n\n### Finite sample properties\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=13 \"Edit section: Finite sample properties\")\\]\n\nFirst of all, under the *strict exogeneity* assumption the OLS estimators β ^ {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [unbiased](https:\/\/en.wikipedia.org\/wiki\/Bias_of_an_estimator \"Bias of an estimator\"), meaning that their expected values coincide with the true values of the parameters:[\\[23\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-23)[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Unbiasedness_of_.CE.B2.CC.82 \"Proofs involving ordinary least squares\")\n\nE\n\n⁡\n\n\\[\n\nβ\n\n^\n\n∣\n\nX\n\n\\]\n\n\\=\n\nβ\n\n,\n\nE\n\n⁡\n\n\\[\n\ns\n\n2\n\n∣\n\nX\n\n\\]\n\n\\=\n\nσ\n\n2\n\n.\n\n{\\\\displaystyle \\\\operatorname {E} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]=\\\\beta ,\\\\quad \\\\operatorname {E} \\[\\\\,s^{2}\\\\mid X\\\\,\\]=\\\\sigma ^{2}.}\n\n![{\\\\displaystyle \\\\operatorname {E} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]=\\\\beta ,\\\\quad \\\\operatorname {E} \\[\\\\,s^{2}\\\\mid X\\\\,\\]=\\\\sigma ^{2}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67bc2fd0f90c46da207712893fdcea01e729026c)\n\nIf the strict exogeneity does not hold (as is the case with many [time series](https:\/\/en.wikipedia.org\/wiki\/Time_series \"Time series\") models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.\n\nThe *[variance-covariance matrix](https:\/\/en.wikipedia.org\/wiki\/Variance-covariance_matrix \"Variance-covariance matrix\")* (or simply *covariance matrix*) of β ^ {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is equal to[\\[24\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-HayashiFSP-24)\n\nVar\n\n⁡\n\n\\[\n\nβ\n\n^\n\n∣\n\nX\n\n\\]\n\n\\=\n\nσ\n\n2\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\n\\=\n\nσ\n\n2\n\nQ\n\n.\n\n{\\\\displaystyle \\\\operatorname {Var} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]=\\\\sigma ^{2}\\\\left(X^{\\\\operatorname {T} }X\\\\right)^{-1}=\\\\sigma ^{2}Q.}\n\n![{\\\\displaystyle \\\\operatorname {Var} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]=\\\\sigma ^{2}\\\\left(X^{\\\\operatorname {T} }X\\\\right)^{-1}=\\\\sigma ^{2}Q.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/08f6cb596d94073731ee47f4a2571dbbfc1d214a)\n\nIn particular, the standard error of each coefficient β ^ j {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}\\_{j}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}\\_{j}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d376656c63f1577f2d1fcd2d680ccc48884ffda4) is equal to square root of the *j*\\-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity *σ*2 with its estimate *s*2. Thus,\n\ns\n\n.\n\ne\n\n.\n\n^\n\n(\n\nβ\n\n^\n\nj\n\n)\n\n\\=\n\ns\n\n2\n\n(\n\nX\n\nT\n\nX\n\n)\n\nj\n\nj\n\n−\n\n1\n\n{\\\\displaystyle {\\\\widehat {\\\\operatorname {s.\\\\!e.} }}({\\\\hat {\\\\beta }}\\_{j})={\\\\sqrt {s^{2}\\\\left(X^{\\\\operatorname {T} }X\\\\right)\\_{jj}^{-1}}}}\n\n![{\\\\displaystyle {\\\\widehat {\\\\operatorname {s.\\\\!e.} }}({\\\\hat {\\\\beta }}\\_{j})={\\\\sqrt {s^{2}\\\\left(X^{\\\\operatorname {T} }X\\\\right)\\_{jj}^{-1}}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/203c72ed1175be84e6fdd19320f0e0e21acf66ec)\n\nIt can also be easily shown that the estimator β ^ {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is uncorrelated with the residuals from the model:[\\[24\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-HayashiFSP-24)\n\nCov\n\n⁡\n\n\\[\n\nβ\n\n^\n\n,\n\nε\n\n^\n\n∣\n\nX\n\n\\]\n\n\\=\n\n0\\.\n\n{\\\\displaystyle \\\\operatorname {Cov} \\[\\\\,{\\\\hat {\\\\beta }},{\\\\hat {\\\\varepsilon }}\\\\mid X\\\\,\\]=0.}\n\n![{\\\\displaystyle \\\\operatorname {Cov} \\[\\\\,{\\\\hat {\\\\beta }},{\\\\hat {\\\\varepsilon }}\\\\mid X\\\\,\\]=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/664c1a5e37957a1aa2ae381b9bcb07350c2c816c)\n\nThe *[Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\")* states that under the *spherical errors* assumption (that is, the errors should be [uncorrelated](https:\/\/en.wikipedia.org\/wiki\/Uncorrelated \"Uncorrelated\") and [homoscedastic](https:\/\/en.wikipedia.org\/wiki\/Homoscedastic \"Homoscedastic\")) the estimator β ^ {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is efficient in the class of linear unbiased estimators. This is called the *best linear unbiased estimator* (BLUE). Efficiency should be understood as if we were to find some other estimator β ~ {\\\\displaystyle \\\\scriptstyle {\\\\tilde {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\tilde {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3f0eb5a65676eeeea992903f3f93fdfd097a4d8d) which would be linear in *y* and unbiased, then [\\[24\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-HayashiFSP-24)\n\nVar\n\n⁡\n\n\\[\n\nβ\n\n~\n\n∣\n\nX\n\n\\]\n\n−\n\nVar\n\n⁡\n\n\\[\n\nβ\n\n^\n\n∣\n\nX\n\n\\]\n\n≥\n\n0\n\n{\\\\displaystyle \\\\operatorname {Var} \\[\\\\,{\\\\tilde {\\\\beta }}\\\\mid X\\\\,\\]-\\\\operatorname {Var} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]\\\\geq 0}\n\n![{\\\\displaystyle \\\\operatorname {Var} \\[\\\\,{\\\\tilde {\\\\beta }}\\\\mid X\\\\,\\]-\\\\operatorname {Var} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]\\\\geq 0}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/53796c9205889cc4d675b9749a58eb97fcd998f1)\n\nin the sense that this is a [nonnegative-definite matrix](https:\/\/en.wikipedia.org\/wiki\/Nonnegative-definite_matrix \"Nonnegative-definite matrix\"). This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms *ε*, other, non-linear estimators may provide better results than OLS.\n\n#### Assuming normality\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=14 \"Edit section: Assuming normality\")\\]\n\nThe properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the *normality assumption* holds (that is, that *ε* ~ *N*(0, *σ*2*In*)), then additional properties of the OLS estimators can be stated.\n\nThe estimator β ^ {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is normally distributed, with mean and variance as given before:[\\[25\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-25)\n\nβ\n\n^\n\n∼\n\nN\n\n(\n\nβ\n\n,\n\nσ\n\n2\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\n)\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\beta }}\\\\ \\\\sim \\\\ {\\\\mathcal {N}}{\\\\big (}\\\\beta ,\\\\ \\\\sigma ^{2}(X^{\\\\mathrm {T} }X)^{-1}{\\\\big )}.}\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}\\\\ \\\\sim \\\\ {\\\\mathcal {N}}{\\\\big (}\\\\beta ,\\\\ \\\\sigma ^{2}(X^{\\\\mathrm {T} }X)^{-1}{\\\\big )}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5931062c24b5e51ae732b5a07a8ceb45dbed1d9f)\n\nThis estimator reaches the [Cramér–Rao bound](https:\/\/en.wikipedia.org\/wiki\/Cram%C3%A9r%E2%80%93Rao_bound \"Cramér–Rao bound\") for the model, and thus is optimal in the class of all unbiased estimators.[\\[17\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17) Note that unlike the [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\"), this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.\n\nThe estimator *s*2 will be proportional to the [chi-squared distribution](https:\/\/en.wikipedia.org\/wiki\/Chi-squared_distribution \"Chi-squared distribution\"):[\\[26\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-26)\n\ns\n\n2\n\n∼\n\nσ\n\n2\n\nn\n\n−\n\np\n\n⋅\n\nχ\n\nn\n\n−\n\np\n\n2\n\n{\\\\displaystyle s^{2}\\\\ \\\\sim \\\\ {\\\\frac {\\\\sigma ^{2}}{n-p}}\\\\cdot \\\\chi \\_{n-p}^{2}}\n\n![{\\\\displaystyle s^{2}\\\\ \\\\sim \\\\ {\\\\frac {\\\\sigma ^{2}}{n-p}}\\\\cdot \\\\chi \\_{n-p}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e006b8d7551b05b350f5d56fe88fe51062088ca9)\n\nThe variance of this estimator is equal to 2*σ*4\/(*n* − *p*), which does not attain the [Cramér–Rao bound](https:\/\/en.wikipedia.org\/wiki\/Cram%C3%A9r%E2%80%93Rao_bound \"Cramér–Rao bound\") of 2*σ*4\/*n*. However it was shown that there are no unbiased estimators of *σ*2 with variance smaller than that of the estimator *s*2.[\\[27\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-27) If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [mean squared error](https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error \"Mean squared error\")) estimator in this class will be ~*σ*2 = SSR *\/* (*n* − *p* + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (*p* = 1).[\\[28\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-28)\n\nMoreover, the estimators β ^ {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [independent](https:\/\/en.wikipedia.org\/wiki\/Independent_random_variables \"Independent random variables\"),[\\[29\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-29) the fact which comes in useful when constructing the t- and F-tests for the regression.\n\n#### Influential observations\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=15 \"Edit section: Influential observations\")\\]\n\nMain article: [Influential observation](https:\/\/en.wikipedia.org\/wiki\/Influential_observation \"Influential observation\")\n\nSee also: [Leverage (statistics)](https:\/\/en.wikipedia.org\/wiki\/Leverage_\$statistics\$ \"Leverage (statistics)\")\n\nAs was mentioned before, the estimator β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) is linear in *y*, meaning that it represents a linear combination of the dependent variables *yi*. The weights in this linear combination are functions of the regressors *X*, and generally are unequal. The observations with high weights are called **influential** because they have a more pronounced effect on the value of the estimator.\n\nTo analyze which observations are influential we remove a specific *j*\\-th observation and consider how much the estimated quantities are going to change (similarly to the [jackknife method](https:\/\/en.wikipedia.org\/wiki\/Jackknife_method \"Jackknife method\")). It can be shown that the change in the OLS estimator for *β* will be equal to [\\[30\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-DvdMck33-30)\n\nβ\n\n^\n\n(\n\nj\n\n)\n\n−\n\nβ\n\n^\n\n\\=\n\n−\n\n1\n\n1\n\n−\n\nh\n\nj\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\nx\n\nj\n\nT\n\nε\n\n^\n\nj\n\n,\n\n{\\\\displaystyle {\\\\hat {\\\\beta }}^{(j)}-{\\\\hat {\\\\beta }}=-{\\\\frac {1}{1-h\\_{j}}}(X^{\\\\mathrm {T} }X)^{-1}x\\_{j}^{\\\\mathrm {T} }{\\\\hat {\\\\varepsilon }}\\_{j}\\\\,,}\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}^{(j)}-{\\\\hat {\\\\beta }}=-{\\\\frac {1}{1-h\\_{j}}}(X^{\\\\mathrm {T} }X)^{-1}x\\_{j}^{\\\\mathrm {T} }{\\\\hat {\\\\varepsilon }}\\_{j}\\\\,,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a71f1b4756b0af3027b8f9b9d5f2a75699433107)\n\nwhere *hj* = *xj*T (*X*T*X*)−1*xj* is the *j*\\-th diagonal element of the hat matrix *P*, and *xj* is the vector of regressors corresponding to the *j*\\-th observation. Similarly, the change in the predicted value for *j*\\-th observation resulting from omitting that observation from the dataset will be equal to [\\[30\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-DvdMck33-30)\n\ny\n\n^\n\nj\n\n(\n\nj\n\n)\n\n−\n\ny\n\n^\n\nj\n\n\\=\n\nx\n\nj\n\nT\n\nβ\n\n^\n\n(\n\nj\n\n)\n\n−\n\nx\n\nj\n\nT\n\nβ\n\n^\n\n\\=\n\n−\n\nh\n\nj\n\n1\n\n−\n\nh\n\nj\n\nε\n\n^\n\nj\n\n{\\\\displaystyle {\\\\hat {y}}\\_{j}^{(j)}-{\\\\hat {y}}\\_{j}=x\\_{j}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}^{(j)}-x\\_{j}^{\\\\operatorname {T} }{\\\\hat {\\\\beta }}=-{\\\\frac {h\\_{j}}{1-h\\_{j}}}\\\\,{\\\\hat {\\\\varepsilon }}\\_{j}}\n\n![{\\\\displaystyle {\\\\hat {y}}\\_{j}^{(j)}-{\\\\hat {y}}\\_{j}=x\\_{j}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}^{(j)}-x\\_{j}^{\\\\operatorname {T} }{\\\\hat {\\\\beta }}=-{\\\\frac {h\\_{j}}{1-h\\_{j}}}\\\\,{\\\\hat {\\\\varepsilon }}\\_{j}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/902c8e32ba7ad698e4fbd9891ed85705ecaebb6c)\n\nFrom the properties of the hat matrix, 0 ≤ *hj* ≤ 1, and they sum up to *p*, so that on average *hj* ≈ *p\/n*. These quantities *hj* are called the **leverages**, and observations with high *hj* are called **leverage points**.[\\[31\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-31) Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.\n\n#### Partitioned regression\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=16 \"Edit section: Partitioned regression\")\\]\n\nSometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form\n\ny\n\n\\=\n\nX\n\n1\n\nβ\n\n1\n\n\\+\n\nX\n\n2\n\nβ\n\n2\n\n\\+\n\nε\n\n,\n\n{\\\\displaystyle y=X\\_{1}\\\\beta \\_{1}+X\\_{2}\\\\beta \\_{2}+\\\\varepsilon ,}\n\n![{\\\\displaystyle y=X\\_{1}\\\\beta \\_{1}+X\\_{2}\\\\beta \\_{2}+\\\\varepsilon ,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/8c8fb7149efa2da253167c26839f1207ccbb4f70)\n\nwhere *X*1 and *X*2 have dimensions *n*×*p*1, *n*×*p*2, and *β*1, *β*2 are *p*1×1 and *p*2×1 vectors, with *p*1 + *p*2 = *p*.\n\nThe **[Frisch–Waugh–Lovell theorem](https:\/\/en.wikipedia.org\/wiki\/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem \"Frisch–Waugh–Lovell theorem\")** states that in this regression the residuals ε ^ {\\\\displaystyle {\\\\hat {\\\\varepsilon }}} ![{\\\\displaystyle {\\\\hat {\\\\varepsilon }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f59e08078de1fc0acc5c2a08448127049373875d) and the OLS estimate β ^ 2 {\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}\\_{2}} ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}\\_{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/925f884381ce10f61d6bff6d36527621643e62b0) will be numerically identical to the residuals and the OLS estimate for *β*2 in the following regression:[\\[32\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-32)\n\nM\n\n1\n\ny\n\n\\=\n\nM\n\n1\n\nX\n\n2\n\nβ\n\n2\n\n\\+\n\nη\n\n,\n\n{\\\\displaystyle M\\_{1}y=M\\_{1}X\\_{2}\\\\beta \\_{2}+\\\\eta \\\\,,}\n\n![{\\\\displaystyle M\\_{1}y=M\\_{1}X\\_{2}\\\\beta \\_{2}+\\\\eta \\\\,,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/03086726c6b5ea95bff0c85bfdd81789e0c229ad)\n\nwhere *M*1 is the [annihilator matrix](https:\/\/en.wikipedia.org\/wiki\/Annihilator_matrix \"Annihilator matrix\") for regressors *X*1.\n\nThe theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.\n\n### Large sample properties\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=17 \"Edit section: Large sample properties\")\\]\n\nThe least squares estimators are [point estimates](https:\/\/en.wikipedia.org\/wiki\/Point_estimate \"Point estimate\") of the linear regression model parameters *β*. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [interval estimates](https:\/\/en.wikipedia.org\/wiki\/Interval_estimate \"Interval estimate\").\n\nSince we have not made any assumption about the distribution of error term *εi*, it is impossible to infer the distribution of the estimators β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) and σ ^ 2 {\\\\displaystyle {\\\\hat {\\\\sigma }}^{2}} ![{\\\\displaystyle {\\\\hat {\\\\sigma }}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/1ad9d89160c9e63c0aa4c158282cb75a894de56f). Nevertheless, we can apply the [central limit theorem](https:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem \"Central limit theorem\") to derive their *asymptotic* properties as sample size *n* goes to infinity. While the sample size is necessarily finite, it is customary to assume that *n* is \"large enough\" so that the true distribution of the OLS estimator is close to its asymptotic limit.\n\nWe can show that under the model assumptions, the least squares estimator for *β* is [consistent](https:\/\/en.wikipedia.org\/wiki\/Consistent_estimator \"Consistent estimator\") (that is β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) [converges in probability](https:\/\/en.wikipedia.org\/wiki\/Convergence_of_random_variables#Convergence_in_probability \"Convergence of random variables\") to *β*) and asymptotically normal:[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Consistency_and_asymptotic_normality_of_.CE.B2.CC.82 \"Proofs involving ordinary least squares\")\n\n(\n\nβ\n\n^\n\n−\n\nβ\n\n)\n\n→\n\nd\n\nN\n\n(\n\n0\n\n,\n\nσ\n\n2\n\nQ\n\nx\n\nx\n\n−\n\n1\n\n)\n\n,\n\n{\\\\displaystyle ({\\\\hat {\\\\beta }}-\\\\beta )\\\\ {\\\\xrightarrow {d}}\\\\ {\\\\mathcal {N}}{\\\\big (}0,\\\\;\\\\sigma ^{2}Q\\_{xx}^{-1}{\\\\big )},}\n\n![{\\\\displaystyle ({\\\\hat {\\\\beta }}-\\\\beta )\\\\ \\\\xrightarrow {d} \\\\ {\\\\mathcal {N}}{\\\\big (}0,\\\\;\\\\sigma ^{2}Q\\_{xx}^{-1}{\\\\big )},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4e1f0415269cea909e27a5e628594cd011db546a)\n\nwhere Q x x \\= X T X . {\\\\displaystyle Q\\_{xx}=X^{\\\\operatorname {T} }X.} ![{\\\\displaystyle Q\\_{xx}=X^{\\\\operatorname {T} }X.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3d8267e0743c4f55e4517e77e5f35807f2229e6d)\n\n#### Inference\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=18 \"Edit section: Inference\")\\]\n\nMain articles: [Confidence interval](https:\/\/en.wikipedia.org\/wiki\/Confidence_interval \"Confidence interval\") and [Prediction interval](https:\/\/en.wikipedia.org\/wiki\/Prediction_interval \"Prediction interval\")\n\nUsing this asymptotic distribution, approximate two-sided confidence intervals for the *j*\\-th component of the vector β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) can be constructed as\n\nβ\n\nj\n\n∈\n\n\\[\n\nβ\n\n^\n\nj\n\n±\n\nq\n\n1\n\n−\n\nα\n\n2\n\nN\n\n(\n\n0\n\n,\n\n1\n\n)\n\nσ\n\n^\n\n2\n\n\\[\n\nQ\n\nx\n\nx\n\n−\n\n1\n\n\\]\n\nj\n\nj\n\n\\]\n\n{\\\\displaystyle \\\\beta \\_{j}\\\\in {\\\\bigg \\[}\\\\ {\\\\hat {\\\\beta }}\\_{j}\\\\pm q\\_{1-{\\\\frac {\\\\alpha }{2}}}^{{\\\\mathcal {N}}(0,1)}\\\\!{\\\\sqrt {{\\\\hat {\\\\sigma }}^{2}\\\\left\\[Q\\_{xx}^{-1}\\\\right\\]\\_{jj}}}\\\\ {\\\\bigg \\]}}\n\n![{\\\\displaystyle \\\\beta \\_{j}\\\\in {\\\\bigg \\[}\\\\ {\\\\hat {\\\\beta }}\\_{j}\\\\pm q\\_{1-{\\\\frac {\\\\alpha }{2}}}^{{\\\\mathcal {N}}(0,1)}\\\\!{\\\\sqrt {{\\\\hat {\\\\sigma }}^{2}\\\\left\\[Q\\_{xx}^{-1}\\\\right\\]\\_{jj}}}\\\\ {\\\\bigg \\]}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cf79688aac9f662ff39253fbfb0d234246d370e5)\n\nat the 1 − *α* confidence level,\n\nwhere *q* denotes the [quantile function](https:\/\/en.wikipedia.org\/wiki\/Quantile_function \"Quantile function\") of standard normal distribution, and \\[·\\]*jj* is the *j*\\-th diagonal element of a matrix.\n\nSimilarly, the least squares estimator for *σ*2 is also consistent and asymptotically normal (provided that the fourth moment of *εi* exists) with limiting distribution\n\n(\n\nσ\n\n^\n\n2\n\n−\n\nσ\n\n2\n\n)\n\n→\n\nd\n\nN\n\n(\n\n0\n\n,\n\nE\n\n⁡\n\n\\[\n\nε\n\ni\n\n4\n\n\\]\n\n−\n\nσ\n\n4\n\n)\n\n.\n\n{\\\\displaystyle ({\\\\hat {\\\\sigma }}^{2}-\\\\sigma ^{2})\\\\ {\\\\xrightarrow {d}}\\\\ {\\\\mathcal {N}}\\\\left(0,\\\\;\\\\operatorname {E} \\\\left\\[\\\\varepsilon \\_{i}^{4}\\\\right\\]-\\\\sigma ^{4}\\\\right).}\n\n![{\\\\displaystyle ({\\\\hat {\\\\sigma }}^{2}-\\\\sigma ^{2})\\\\ \\\\xrightarrow {d} \\\\ {\\\\mathcal {N}}\\\\left(0,\\\\;\\\\operatorname {E} \\\\left\\[\\\\varepsilon \\_{i}^{4}\\\\right\\]-\\\\sigma ^{4}\\\\right).}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7c909dea2a4f0bf40e253680b953d1bfbb66298f)\n\nThese asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose x 0 {\\\\displaystyle x\\_{0}} ![{\\\\displaystyle x\\_{0}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/86f21d0e31751534cd6584264ecf864a6aa792cf) is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [mean response](https:\/\/en.wikipedia.org\/wiki\/Mean_response \"Mean response\") is the quantity y 0 \\= x 0 T β {\\\\displaystyle y\\_{0}=x\\_{0}^{\\\\mathrm {T} }\\\\beta } ![{\\\\displaystyle y\\_{0}=x\\_{0}^{\\\\mathrm {T} }\\\\beta }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6eda29d7b45f0754da5a0bd365ba6df87c81306c), whereas the [predicted response](https:\/\/en.wikipedia.org\/wiki\/Predicted_response \"Predicted response\") is y ^ 0 \\= x 0 T β ^ {\\\\displaystyle {\\\\hat {y}}\\_{0}=x\\_{0}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {y}}\\_{0}=x\\_{0}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/69652d0e8a9bacc094b1dc11803721f7bcf3e22d). Clearly the predicted response is a random variable, its distribution can be derived from that of β ^ {\\\\displaystyle {\\\\hat {\\\\beta }}} ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799):\n\n(\n\ny\n\n^\n\n0\n\n−\n\ny\n\n0\n\n)\n\n→\n\nd\n\nN\n\n(\n\n0\n\n,\n\nσ\n\n2\n\nx\n\n0\n\nT\n\nQ\n\nx\n\nx\n\n−\n\n1\n\nx\n\n0\n\n)\n\n,\n\n{\\\\displaystyle \\\\left({\\\\hat {y}}\\_{0}-y\\_{0}\\\\right)\\\\ {\\\\xrightarrow {d}}\\\\ {\\\\mathcal {N}}\\\\left(0,\\\\;\\\\sigma ^{2}x\\_{0}^{\\\\mathrm {T} }Q\\_{xx}^{-1}x\\_{0}\\\\right),}\n\n![{\\\\displaystyle \\\\left({\\\\hat {y}}\\_{0}-y\\_{0}\\\\right)\\\\ \\\\xrightarrow {d} \\\\ {\\\\mathcal {N}}\\\\left(0,\\\\;\\\\sigma ^{2}x\\_{0}^{\\\\mathrm {T} }Q\\_{xx}^{-1}x\\_{0}\\\\right),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/ea91e2b81cd0251f9bc1fce42e7ebfc78ceca045)\n\nwhich allows construct confidence intervals for mean response y 0 {\\\\displaystyle y\\_{0}} ![{\\\\displaystyle y\\_{0}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6d943dbbb0b56ca750c4d62c5b54b4ae29a773da) to be constructed:\n\ny\n\n0\n\n∈\n\n\\[\n\nx\n\n0\n\nT\n\nβ\n\n^\n\n±\n\nq\n\n1\n\n−\n\nα\n\n2\n\nN\n\n(\n\n0\n\n,\n\n1\n\n)\n\nσ\n\n^\n\n2\n\nx\n\n0\n\nT\n\nQ\n\nx\n\nx\n\n−\n\n1\n\nx\n\n0\n\n\\]\n\n{\\\\displaystyle y\\_{0}\\\\in \\\\left\\[\\\\ x\\_{0}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}\\\\pm q\\_{1-{\\\\frac {\\\\alpha }{2}}}^{{\\\\mathcal {N}}(0,1)}\\\\!{\\\\sqrt {{\\\\hat {\\\\sigma }}^{2}x\\_{0}^{\\\\mathrm {T} }Q\\_{xx}^{-1}x\\_{0}}}\\\\ \\\\right\\]}\n\n![{\\\\displaystyle y\\_{0}\\\\in \\\\left\\[\\\\ x\\_{0}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}\\\\pm q\\_{1-{\\\\frac {\\\\alpha }{2}}}^{{\\\\mathcal {N}}(0,1)}\\\\!{\\\\sqrt {{\\\\hat {\\\\sigma }}^{2}x\\_{0}^{\\\\mathrm {T} }Q\\_{xx}^{-1}x\\_{0}}}\\\\ \\\\right\\]}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cf86d7a311c97d35fb6e039c3cd74bc9f3e752bf)\n\nat the 1 − *α* confidence level.\n\n#### Hypothesis testing\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=19 \"Edit section: Hypothesis testing\")\\]\n\nMain article: [Hypothesis testing](https:\/\/en.wikipedia.org\/wiki\/Hypothesis_testing \"Hypothesis testing\")\n\n|   |   |\n|---|---|\n| [![\\[icon\\]](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/1\/1c\/Wiki_letter_w_cropped.svg\/20px-Wiki_letter_w_cropped.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Wiki_letter_w_cropped.svg) | This section **needs expansion**. You can help by [adding missing information](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=). *(February 2017)* |\n\nTwo hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The [null hypothesis](https:\/\/en.wikipedia.org\/wiki\/Null_hypothesis \"Null hypothesis\") of no explanatory value of the estimated regression is tested using an [F-test](https:\/\/en.wikipedia.org\/wiki\/F-test \"F-test\"). If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the [alternative hypothesis](https:\/\/en.wikipedia.org\/wiki\/Alternative_hypothesis \"Alternative hypothesis\"), that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.\n\nSecond, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's [t-statistic](https:\/\/en.wikipedia.org\/wiki\/T-statistic \"T-statistic\"), as the ratio of the coefficient estimate to its [standard error](https:\/\/en.wikipedia.org\/wiki\/Standard_error \"Standard error\"). If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.\n\nIn addition, the [Chow test](https:\/\/en.wikipedia.org\/wiki\/Chow_test \"Chow test\") is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.\n\n### Violations of assumptions\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=20 \"Edit section: Violations of assumptions\")\\]\n\n#### Time series model\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=21 \"Edit section: Time series model\")\\]\n\nIn a [time series](https:\/\/en.wikipedia.org\/wiki\/Time_series \"Time series\") model, we require the [stochastic process](https:\/\/en.wikipedia.org\/wiki\/Stochastic_process \"Stochastic process\") {*xi*, *yi*} to be [stationary](https:\/\/en.wikipedia.org\/wiki\/Stationary_process \"Stationary process\") and [ergodic](https:\/\/en.wikipedia.org\/wiki\/Ergodic_process \"Ergodic process\"); if {*xi*, *yi*} is nonstationary, OLS results are often biased unless {*xi*, *yi*} is [co-integrating](https:\/\/en.wikipedia.org\/wiki\/Cointegration \"Cointegration\").[\\[33\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-33)\n\nWe still require the regressors to be *strictly exogenous*: E\\[*xiεi*\\] = 0 for all *i* = 1, ..., *n*. If they are only [predetermined](https:\/\/en.wikipedia.org\/wiki\/Weak_exogeneity \"Weak exogeneity\"), OLS is biased in finite sample;\n\nFinally, the assumptions on the variance take the form of requiring that {*xiεi*} is a [martingale difference sequence](https:\/\/en.wikipedia.org\/wiki\/Martingale_difference_sequence \"Martingale difference sequence\"), with a finite matrix of second moments *Q**xxε*² = E\\[ *εi*2*xi xi*T \\].\n\n#### Constrained estimation\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=22 \"Edit section: Constrained estimation\")\\]\n\nMain article: [Ridge regression](https:\/\/en.wikipedia.org\/wiki\/Ridge_regression \"Ridge regression\")\n\nSuppose it is known that the coefficients in the regression satisfy a system of linear equations\n\nA\n\n:\n\nQ\n\nT\n\nβ\n\n\\=\n\nc\n\n,\n\n{\\\\displaystyle A\\\\colon \\\\quad Q^{\\\\operatorname {T} }\\\\beta =c,\\\\,}\n\n![{\\\\displaystyle A\\\\colon \\\\quad Q^{\\\\operatorname {T} }\\\\beta =c,\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/057e184c58e6c378d20d00aac2f8d5f8003f77ae)\n\nwhere *Q* is a *p*×*q* matrix of full rank, and *c* is a *q*×1 vector of known constants, where *q \\< p*. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint *A*. The **constrained least squares (CLS)** estimator can be given by an explicit formula:[\\[34\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-34)\n\nβ\n\n^\n\nc\n\n\\=\n\nβ\n\n^\n\n−\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\nQ\n\n(\n\nQ\n\nT\n\n(\n\nX\n\nT\n\nX\n\n)\n\n−\n\n1\n\nQ\n\n)\n\n−\n\n1\n\n(\n\nQ\n\nT\n\nβ\n\n^\n\n−\n\nc\n\n)\n\n.\n\n{\\\\displaystyle {\\\\hat {\\\\beta }}^{c}={\\\\hat {\\\\beta }}-(X^{\\\\operatorname {T} }X)^{-1}Q{\\\\Big (}Q^{\\\\operatorname {T} }(X^{\\\\operatorname {T} }X)^{-1}Q{\\\\Big )}^{-1}(Q^{\\\\operatorname {T} }{\\\\hat {\\\\beta }}-c).}\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}^{c}={\\\\hat {\\\\beta }}-(X^{\\\\operatorname {T} }X)^{-1}Q{\\\\Big (}Q^{\\\\operatorname {T} }(X^{\\\\operatorname {T} }X)^{-1}Q{\\\\Big )}^{-1}(Q^{\\\\operatorname {T} }{\\\\hat {\\\\beta }}-c).}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d7cebc5bf8357f7e792c566f18bcae6c7582b9ae)\n\nThis expression for the constrained estimator is valid as long as the matrix *XTX* is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, *β* will not be identifiable. However it may happen that adding the restriction *A* makes *β* identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [\\[35\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Amemiya22-35)\n\nβ\n\n^\n\nc\n\n\\=\n\nR\n\n(\n\nR\n\nT\n\nX\n\nT\n\nX\n\nR\n\n)\n\n−\n\n1\n\nR\n\nT\n\nX\n\nT\n\ny\n\n\\+\n\n(\n\nI\n\np\n\n−\n\nR\n\n(\n\nR\n\nT\n\nX\n\nT\n\nX\n\nR\n\n)\n\n−\n\n1\n\nR\n\nT\n\nX\n\nT\n\nX\n\n)\n\nQ\n\n(\n\nQ\n\nT\n\nQ\n\n)\n\n−\n\n1\n\nc\n\n,\n\n{\\\\displaystyle {\\\\hat {\\\\beta }}^{c}=R(R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }XR)^{-1}R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }y+{\\\\Big (}I\\_{p}-R(R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }XR)^{-1}R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }X{\\\\Big )}Q(Q^{\\\\operatorname {T} }Q)^{-1}c,}\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}^{c}=R(R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }XR)^{-1}R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }y+{\\\\Big (}I\\_{p}-R(R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }XR)^{-1}R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }X{\\\\Big )}Q(Q^{\\\\operatorname {T} }Q)^{-1}c,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/fd1b301b75120d4aae50fab438e36d600343652b)\n\nwhere *R* is a *p*×(*p* − *q*) matrix such that the matrix \\[*Q R*\\] is non-singular, and *RTQ* = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when *XTX* is invertible.[\\[35\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Amemiya22-35)\n\n## Example with real data\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=23 \"Edit section: Example with real data\")\\]\n\nSee also: [Simple linear regression § Example](https:\/\/en.wikipedia.org\/wiki\/Simple_linear_regression#Example \"Simple linear regression\"), and [Linear least squares § Example](https:\/\/en.wikipedia.org\/wiki\/Linear_least_squares#Example \"Linear least squares\")\n\nThe following data set gives average heights and weights for American women aged 30–39 (source: *The World Almanac and Book of Facts, 1975*).\n\n|   |   |   |   |   |   |   |\n|---|---|---|---|---|---|---|\n| Height (m) | 1\\.47 | 1\\.50 | 1\\.52 | 1\\.55 | 1\\.57 | [![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/c\/c1\/OLS_example_weight_vs_height_scatterplot.svg\/250px-OLS_example_weight_vs_height_scatterplot.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_example_weight_vs_height_scatterplot.svg) [Scatterplot](https:\/\/en.wikipedia.org\/wiki\/Scatterplot \"Scatterplot\") of the data, the relationship is slightly curved but close to linear |\n| Weight (kg) | 52\\.21 | 53\\.12 | 54\\.48 | 55\\.84 | 57\\.20 |   |\n| Height (m) | 1\\.60 | 1\\.63 | 1\\.65 | 1\\.68 | 1\\.70 |   |\n| Weight (kg) | 58\\.57 | 59\\.93 | 61\\.29 | 63\\.11 | 64\\.47 |   |\n| Height (m) | 1\\.73 | 1\\.75 | 1\\.78 | 1\\.80 | 1\\.83 |   |\n| Weight (kg) | 66\\.28 | 68\\.10 | 69\\.92 | 72\\.19 | 74\\.46 |   |\n\nWhen only one dependent variable is being modeled, a [scatterplot](https:\/\/en.wikipedia.org\/wiki\/Scatterplot \"Scatterplot\") will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model:\n\nw\n\ni\n\n\\=\n\nβ\n\n1\n\n\\+\n\nβ\n\n2\n\nh\n\ni\n\n\\+\n\nβ\n\n3\n\nh\n\ni\n\n2\n\n\\+\n\nε\n\ni\n\n.\n\n{\\\\displaystyle w\\_{i}=\\\\beta \\_{1}+\\\\beta \\_{2}h\\_{i}+\\\\beta \\_{3}h\\_{i}^{2}+\\\\varepsilon \\_{i}.}\n\n![{\\\\displaystyle w\\_{i}=\\\\beta \\_{1}+\\\\beta \\_{2}h\\_{i}+\\\\beta \\_{3}h\\_{i}^{2}+\\\\varepsilon \\_{i}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/509e6766a1b3a1d7f431d9f9dae780d20f9b59d5)\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/a\/a4\/OLS_example_weight_vs_height_fitted_line.svg\/330px-OLS_example_weight_vs_height_fitted_line.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_example_weight_vs_height_fitted_line.svg)\n\nFitted regression\n\nThe output from most popular [statistical packages](https:\/\/en.wikipedia.org\/wiki\/List_of_statistical_packages \"List of statistical packages\") will look similar to this:\n\n|   |   |   |   |   |\n|---|---|---|---|---|\n| Method | Least squares |   |   |   |\n| Dependent variable | WEIGHT |   |   |   |\n| Observations | 15 |   |   |   |\n| Parameter | Value | [Std error](https:\/\/en.wikipedia.org\/wiki\/Standard_error \"Standard error\") | [t-statistic](https:\/\/en.wikipedia.org\/wiki\/T-statistic \"T-statistic\") | [p-value](https:\/\/en.wikipedia.org\/wiki\/P-value \"P-value\") |\n| β 1 {\\\\displaystyle \\\\beta \\_{1}} ![{\\\\displaystyle \\\\beta \\_{1}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) |   |   |   |   |\n\nIn this table:\n\n- The *Value* column gives the least squares estimates of parameters *βj*\n- The *Std error* column shows [standard errors](https:\/\/en.wikipedia.org\/wiki\/Standard_error_\$statistics\$ \"Standard error (statistics)\") of each coefficient estimate:\n  σ\n  \n  ^\n  \n  j\n  \n  \\=\n  \n  (\n  \n  σ\n  \n  ^\n  \n  2\n  \n  \\[\n  \n  Q\n  \n  x\n  \n  x\n  \n  −\n  \n  1\n  \n  \\]\n  \n  j\n  \n  j\n  \n  )\n  \n  1\n  \n  2\n  \n  {\\\\displaystyle {\\\\hat {\\\\sigma }}\\_{j}=\\\\left({\\\\hat {\\\\sigma }}^{2}\\\\left\\[Q\\_{xx}^{-1}\\\\right\\]\\_{jj}\\\\right)^{\\\\frac {1}{2}}}\n  \n  ![{\\\\displaystyle {\\\\hat {\\\\sigma }}\\_{j}=\\\\left({\\\\hat {\\\\sigma }}^{2}\\\\left\\[Q\\_{xx}^{-1}\\\\right\\]\\_{jj}\\\\right)^{\\\\frac {1}{2}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5087e66171bf3ef9ad3ac75decdd715274919669)\n- The *[t-statistic](https:\/\/en.wikipedia.org\/wiki\/T-statistic \"T-statistic\")* and *p-value* columns are testing whether any of the coefficients might be equal to zero. The *t*\\-statistic is calculated simply as\n  t\n  \n  \\=\n  \n  β\n  \n  ^\n  \n  j\n  \n  \/\n  \n  σ\n  \n  ^\n  \n  j\n  \n  {\\\\displaystyle t={\\\\hat {\\\\beta }}\\_{j}\/{\\\\hat {\\\\sigma }}\\_{j}}\n  \n  ![{\\\\displaystyle t={\\\\hat {\\\\beta }}\\_{j}\/{\\\\hat {\\\\sigma }}\\_{j}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/706d1c514396be8e7301a23ab369cdcf5b1c5096)\n  . If the errors ε follow a normal distribution, *t* follows a Student-t distribution. Under weaker conditions, *t* is asymptotically normal. Large values of *t* indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, [*p*\\-value](https:\/\/en.wikipedia.org\/wiki\/P-value \"P-value\"), expresses the results of the hypothesis test as a [significance level](https:\/\/en.wikipedia.org\/wiki\/Statistical_significance \"Statistical significance\"). Conventionally, *p*\\-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.\n- *R-squared* is the [coefficient of determination](https:\/\/en.wikipedia.org\/wiki\/Coefficient_of_determination \"Coefficient of determination\") indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors *X* have no explanatory power whatsoever. This is a biased estimate of the population *R-squared*, and will never decrease if additional regressors are added, even if they are irrelevant.\n- *Adjusted R-squared* is a slightly modified version of\n  R\n  \n  2\n  \n  {\\\\displaystyle R^{2}}\n  \n  ![{\\\\displaystyle R^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5ce07e278be3e058a6303de8359f8b4a4288264a)\n  , designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than\n  R\n  \n  2\n  \n  {\\\\displaystyle R^{2}}\n  \n  ![{\\\\displaystyle R^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5ce07e278be3e058a6303de8359f8b4a4288264a)\n  , can decrease as new regressors are added, and even be negative for poorly fitting models:\n\nR\n\n¯\n\n2\n\n\\=\n\n1\n\n−\n\nn\n\n−\n\n1\n\nn\n\n−\n\np\n\n(\n\n1\n\n−\n\nR\n\n2\n\n)\n\n{\\\\displaystyle {\\\\overline {R}}^{2}=1-{\\\\frac {n-1}{n-p}}(1-R^{2})}\n\n![{\\\\displaystyle {\\\\overline {R}}^{2}=1-{\\\\frac {n-1}{n-p}}(1-R^{2})}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7ec4559807623b855036fce5201f9e8b6c7aca4b)\n\n- *Log-likelihood* is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.\n- *[Durbin–Watson statistic](https:\/\/en.wikipedia.org\/wiki\/Durbin%E2%80%93Watson_statistic \"Durbin–Watson statistic\")* tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.\n- *[Akaike information criterion](https:\/\/en.wikipedia.org\/wiki\/Akaike_information_criterion \"Akaike information criterion\")* and *[Schwarz criterion](https:\/\/en.wikipedia.org\/wiki\/Schwarz_criterion \"Schwarz criterion\")* are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[\\[36\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-36)\n- *Standard error of regression* is an estimate of *σ*, standard error of the error term.\n- *Total sum of squares*, *model sum of squared*, and *residual sum of squares* tell us how much of the initial variation in the sample were explained by the regression.\n- *F-statistic* tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has *F*(*p–1*,*n–p*) distribution under the null hypothesis and normality assumption, and its *p-value* indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as [Wald test](https:\/\/en.wikipedia.org\/wiki\/Wald_test \"Wald test\") or [LR test](https:\/\/en.wikipedia.org\/wiki\/Likelihood_ratio_test \"Likelihood ratio test\") should be used.\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/e\/e2\/OLS_example_weight_vs_height_residuals.svg\/330px-OLS_example_weight_vs_height_residuals.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_example_weight_vs_height_residuals.svg)\n\nResiduals plot\n\nOrdinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:\n\n- Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.\n- Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.\n- Residuals against the fitted values,\n  y\n  \n  ^\n  \n  {\\\\displaystyle {\\\\hat {y}}}\n  \n  ![{\\\\displaystyle {\\\\hat {y}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3dc8de3d8ea01304329ef9518fad7a6d196c4c01)\n  .\n- Residuals against the preceding residual. This plot may identify serial correlations in the residuals.\n\nAn important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.\n\n### Sensitivity to rounding\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=24 \"Edit section: Sensitivity to rounding\")\\]\n\nMain article: [Errors-in-variables models](https:\/\/en.wikipedia.org\/wiki\/Errors-in-variables_models \"Errors-in-variables models\")\n\nSee also: [Quantization error model](https:\/\/en.wikipedia.org\/wiki\/Quantization_error_model \"Quantization error model\")\n\nThis example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is *not* an exact conversion. The original inches can be recovered by Round(x\/0.0254) and then re-converted to metric without rounding. If this is done the results become:\n\n|   | Const | Height | Height2 |\n|---|---|---|---|\n| Converted to metric with rounding. | 128\\.8128 | −143.162 | 61\\.96033 |\n| Converted to metric without rounding. | 119\\.0205 | −131.5076 | 58\\.5046 |\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/e\/e7\/HeightWeightResiduals.jpg\/500px-HeightWeightResiduals.jpg)](https:\/\/en.wikipedia.org\/wiki\/File:HeightWeightResiduals.jpg)\n\nResiduals to a quadratic fit for correctly and incorrectly converted data.\n\nUsing either of these equations to predict the weight of a 5' 6\" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.\n\nWhile this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range ([extrapolation](https:\/\/en.wikipedia.org\/wiki\/Extrapolation \"Extrapolation\")).\n\nThis highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the *x* and *y* errors.\n\n## Another example with less real data\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=25 \"Edit section: Another example with less real data\")\\]\n\n### Problem statement\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=26 \"Edit section: Problem statement\")\\]\n\nWe can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is r ( θ ) \\= p 1 − e cos ⁡ ( θ ) {\\\\displaystyle r(\\\\theta )={\\\\frac {p}{1-e\\\\cos(\\\\theta )}}} ![{\\\\displaystyle r(\\\\theta )={\\\\frac {p}{1-e\\\\cos(\\\\theta )}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/864b5d331af517af4843d824394763d8d58bdb06) where r ( θ ) {\\\\displaystyle r(\\\\theta )} ![{\\\\displaystyle r(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/53df06e661affb927fb93a95195225e3455cf572) is the radius of how far the object is from one of the bodies. In the equation the parameters p {\\\\displaystyle p} ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) and e {\\\\displaystyle e} ![{\\\\displaystyle e}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cd253103f0876afc68ebead27a5aa9867d927467) are used to determine the path of the orbit. We have measured the following data.\n\n| θ {\\\\displaystyle \\\\theta } ![{\\\\displaystyle \\\\theta }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6e5ab2664b422d53eb0c7df3b87e1360d75ad9af) (in degrees) |\n|---|\n\nWe need to find the least-squares approximation of e {\\\\displaystyle e} ![{\\\\displaystyle e}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cd253103f0876afc68ebead27a5aa9867d927467) and p {\\\\displaystyle p} ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) for the given data.\n\n### Solution\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=27 \"Edit section: Solution\")\\]\n\nFirst we need to represent e and p in a linear form. So we are going to rewrite the equation r ( θ ) {\\\\displaystyle r(\\\\theta )} ![{\\\\displaystyle r(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/53df06e661affb927fb93a95195225e3455cf572) as 1 r ( θ ) \\= 1 p − e p cos ⁡ ( θ ) {\\\\displaystyle {\\\\frac {1}{r(\\\\theta )}}={\\\\frac {1}{p}}-{\\\\frac {e}{p}}\\\\cos(\\\\theta )} ![{\\\\displaystyle {\\\\frac {1}{r(\\\\theta )}}={\\\\frac {1}{p}}-{\\\\frac {e}{p}}\\\\cos(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/ed0a5c7386fb20bcca2ffa143a944ad12986a06e).\n\nFurthermore, one could fit for [apsides](https:\/\/en.wikipedia.org\/wiki\/Apsides \"Apsides\") by expanding cos ⁡ ( θ ) {\\\\displaystyle \\\\cos(\\\\theta )} ![{\\\\displaystyle \\\\cos(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/aaac7b75cda6d5570780075aa2622d27b21117cd) with an extra parameter as cos ⁡ ( θ − θ 0 ) \\= cos ⁡ ( θ ) cos ⁡ ( θ 0 ) \\+ sin ⁡ ( θ ) sin ⁡ ( θ 0 ) {\\\\displaystyle \\\\cos(\\\\theta -\\\\theta \\_{0})=\\\\cos(\\\\theta )\\\\cos(\\\\theta \\_{0})+\\\\sin(\\\\theta )\\\\sin(\\\\theta \\_{0})} ![{\\\\displaystyle \\\\cos(\\\\theta -\\\\theta \\_{0})=\\\\cos(\\\\theta )\\\\cos(\\\\theta \\_{0})+\\\\sin(\\\\theta )\\\\sin(\\\\theta \\_{0})}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3a8a6dfeb8c3f569c5e5bdc53eae13cd7396ddb3), which is linear in both cos ⁡ ( θ ) {\\\\displaystyle \\\\cos(\\\\theta )} ![{\\\\displaystyle \\\\cos(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/aaac7b75cda6d5570780075aa2622d27b21117cd) and in the extra basis function sin ⁡ ( θ ) {\\\\displaystyle \\\\sin(\\\\theta )} ![{\\\\displaystyle \\\\sin(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/acafc444aea85d63a40dabf84f035a6b4955a948).\n\nWe use the original two-parameter form to represent our observational data as:\n\nA T A ( x y ) \\= A T b , {\\\\displaystyle A^{T}A{\\\\binom {x}{y}}=A^{T}b,} ![{\\\\displaystyle A^{T}A{\\\\binom {x}{y}}=A^{T}b,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/986a7b37a9e8f4d01f04b0c1edfab0891e5c3981)\n\nwhere:\n\nx \\= 1 \/ p {\\\\displaystyle x=1\/p\\\\,} ![{\\\\displaystyle x=1\/p\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/da0dc5103de42df59c7fdf7c0e1a25b2d230c2ab); y \\= e \/ p {\\\\displaystyle y=e\/p\\\\,} ![{\\\\displaystyle y=e\/p\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/225807344e8ed6a108fb0f244389cd99b2176af6); A {\\\\displaystyle A} ![{\\\\displaystyle A}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7daff47fa58cdfd29dc333def748ff5fa4c923e3) contains the coefficients of 1 \/ p {\\\\displaystyle 1\/p} ![{\\\\displaystyle 1\/p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2cdfd6eb8d2c6f424b698d06aa99d31895c47e91) in the first column, which are all 1, and the coefficients of e \/ p {\\\\displaystyle e\/p} ![{\\\\displaystyle e\/p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e1413df0fc595b107560741c651139104ddc2b3a) in the second column, given by cos ⁡ ( θ ) {\\\\displaystyle \\\\cos(\\\\theta )\\\\,} ![{\\\\displaystyle \\\\cos(\\\\theta )\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7579c63af86d580d73d9e47b823ef1f5df5d1e7c); and b \\= 1 \/ r ( θ ) {\\\\displaystyle b=1\/r(\\\\theta )} ![{\\\\displaystyle b=1\/r(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2e66f8886331852341e28a72b691800bd161f01d), such that:\n\nA \\= \\[ 1 − 0\\.731354 1 − 0\\.707107 1 − 0\\.615661 1 0\\.052336 1 0\\.309017 1 0\\.438371 \\] , b \\= \\[ 0\\.21220 0\\.21958 0\\.24741 0\\.45071 0\\.52883 0\\.56820 \\] . {\\\\displaystyle A={\\\\begin{bmatrix}1&-0.731354\\\\\\\\1&-0.707107\\\\\\\\1&-0.615661\\\\\\\\1&\\\\ 0.052336\\\\\\\\1&0.309017\\\\\\\\1&0.438371\\\\end{bmatrix}},\\\\quad b={\\\\begin{bmatrix}0.21220\\\\\\\\0.21958\\\\\\\\0.24741\\\\\\\\0.45071\\\\\\\\0.52883\\\\\\\\0.56820\\\\end{bmatrix}}.} ![{\\\\displaystyle A={\\\\begin{bmatrix}1&-0.731354\\\\\\\\1&-0.707107\\\\\\\\1&-0.615661\\\\\\\\1&\\\\ 0.052336\\\\\\\\1&0.309017\\\\\\\\1&0.438371\\\\end{bmatrix}},\\\\quad b={\\\\begin{bmatrix}0.21220\\\\\\\\0.21958\\\\\\\\0.24741\\\\\\\\0.45071\\\\\\\\0.52883\\\\\\\\0.56820\\\\end{bmatrix}}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2fa33854f055be22aac53f6543b51595d2cb6ab5)\n\nOn solving we get ( x y ) \\= ( 0\\.43478 0\\.30435 ) {\\\\displaystyle {\\\\binom {x}{y}}={\\\\binom {0.43478}{0.30435}}\\\\,} ![{\\\\displaystyle {\\\\binom {x}{y}}={\\\\binom {0.43478}{0.30435}}\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f2754076c0b4c74864feffeff3bb26a9288d57a),\n\nso p \\= 1 x \\= 2\\.3000 {\\\\displaystyle p={\\\\frac {1}{x}}=2.3000} ![{\\\\displaystyle p={\\\\frac {1}{x}}=2.3000}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/ab356b2fc5bd7f3043c4681d1ee8e078310a40a3) and e \\= p ⋅ y \\= 0\\.70001 {\\\\displaystyle e=p\\\\cdot y=0.70001} ![{\\\\displaystyle e=p\\\\cdot y=0.70001}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7c3dfe0bd497476da093d39a831bc135fce2e725)\n\n## See also\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=28 \"Edit section: See also\")\\]\n\n- [Bayesian least squares](https:\/\/en.wikipedia.org\/wiki\/Minimum_mean_square_error \"Minimum mean square error\")\n- [Fama–MacBeth regression](https:\/\/en.wikipedia.org\/wiki\/Fama%E2%80%93MacBeth_regression \"Fama–MacBeth regression\")\n- [Nonlinear least squares](https:\/\/en.wikipedia.org\/wiki\/Non-linear_least_squares \"Non-linear least squares\")\n- [Numerical methods for linear least squares](https:\/\/en.wikipedia.org\/wiki\/Numerical_methods_for_linear_least_squares \"Numerical methods for linear least squares\")\n- [Nonlinear system identification](https:\/\/en.wikipedia.org\/wiki\/Nonlinear_system_identification \"Nonlinear system identification\")\n\n## References\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=29 \"Edit section: References\")\\]\n\n1. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-1)**\n   [\"The Origins of Ordinary Least Squares Assumptions\"](https:\/\/mathvoices.ams.org\/featurecolumn\/2022\/03\/01\/ordinary-least-squares\/). *Feature Column*. 2022-03-01. Retrieved 2024-05-16.\n2. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-2)**\n   [\"What is a complete list of the usual assumptions for linear regression?\"](https:\/\/stats.stackexchange.com\/q\/16381). *Cross Validated*. Retrieved 2022-09-28.\n3. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-3)**\n   [Goldberger, Arthur S.](https:\/\/en.wikipedia.org\/wiki\/Arthur_Goldberger \"Arthur Goldberger\") (1964). [\"Classical Linear Regression\"](https:\/\/books.google.com\/books?id=KZq5AAAAIAAJ&pg=PA156). [*Econometric Theory*](https:\/\/archive.org\/details\/econometrictheor0000gold\/page\/158). New York: John Wiley & Sons. pp. [158](https:\/\/archive.org\/details\/econometrictheor0000gold\/page\/158). [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [0-471-31101-4](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-471-31101-4 \"Special:BookSources\/0-471-31101-4\")\n   \n   .\n   \n   `{{cite book}}`: ISBN \/ Date incompatibility ([help](https:\/\/en.wikipedia.org\/wiki\/Help:CS1_errors#invalid_isbn_date \"Help:CS1 errors\"))\n4. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-4)**\n   [Hayashi, Fumio](https:\/\/en.wikipedia.org\/wiki\/Fumio_Hayashi \"Fumio Hayashi\") (2000). *Econometrics*. Princeton University Press. p. 15. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9780691010182](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9780691010182 \"Special:BookSources\/9780691010182\")\n   \n   .\n5. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-5)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 18).\n6. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-6)**\n   Ghilani, Charles D.; Wolf, Paul R. (12 June 2006). [*Adjustment Computations: Spatial Data Analysis*](https:\/\/books.google.com\/books?id=hZ4mAOXVowoC&pg=PA160). John Wiley & Sons. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9780471697282](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9780471697282 \"Special:BookSources\/9780471697282\")\n   \n   .\n7. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-7)**\n   Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). [*GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more*](https:\/\/books.google.com\/books?id=Np7y43HU_m8C&pg=PA263). Springer. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9783211730171](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9783211730171 \"Special:BookSources\/9783211730171\")\n   \n   .\n8. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-8)**\n   Xu, Guochang (5 October 2007). [*GPS: Theory, Algorithms and Applications*](https:\/\/books.google.com\/books?id=peYFZ69HqEsC&pg=PA134). Springer. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9783540727156](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9783540727156 \"Special:BookSources\/9783540727156\")\n   \n   .\n9. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-1) [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 19)\n10. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-q011_10-0)**\n    Hoaglin, David C.; Welsch, Roy E. (1978). [\"The Hat Matrix in Regression and ANOVA\"](https:\/\/doi.org\/10.1080%2F00031305.1978.10479237). *The American Statistician*. **32** (1): 17–22\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.1080\/00031305.1978.10479237](https:\/\/doi.org\/10.1080%2F00031305.1978.10479237). [hdl](https:\/\/en.wikipedia.org\/wiki\/Hdl_\$identifier\$ \"Hdl (identifier)\"):[1721\\.1\/1920](https:\/\/hdl.handle.net\/1721.1%2F1920). [ISSN](https:\/\/en.wikipedia.org\/wiki\/ISSN_\$identifier\$ \"ISSN (identifier)\") [0003-1305](https:\/\/search.worldcat.org\/issn\/0003-1305).\n11. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-11)** [Julian Faraway (2000), *Practical Regression and Anova using R*](https:\/\/cran.r-project.org\/doc\/contrib\/Faraway-PRA.pdf)\n12. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-12)**\n    Kenney, J.; Keeping, E. S. (1963). *Mathematics of Statistics*. van Nostrand. p. 187.\n13. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-13)**\n    Zwillinger, Daniel (1995). [*Standard Mathematical Tables and Formulae*](https:\/\/en.wikipedia.org\/wiki\/CRC_Standard_Mathematical_Tables \"CRC Standard Mathematical Tables\"). Chapman\\&Hall\/CRC. p. 626. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-8493-2479-3](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-8493-2479-3 \"Special:BookSources\/0-8493-2479-3\")\n    \n    .\n14. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-14)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 20)\n15. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-15)**\n    Akbarzadeh, Vahab (7 May 2014). [\"Line Estimation\"](https:\/\/mlmadesimple.wordpress.com\/2014\/05\/07\/line-estimation\/).\n16. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-16)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 49)\n17. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-1) [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 52)\n18. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_10_18-0)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 10)\n19. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Tibshirani-1996_19-0)**\n    Tibshirani, Robert (1996). \"Regression Shrinkage and Selection via the Lasso\". *Journal of the Royal Statistical Society, Series B*. **58** (1): 267–288\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.1111\/j.2517-6161.1996.tb02080.x](https:\/\/doi.org\/10.1111%2Fj.2517-6161.1996.tb02080.x). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [2346178](https:\/\/www.jstor.org\/stable\/2346178).\n20. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Efron-2004_20-0)**\n    Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). \"Least Angle Regression\". *The Annals of Statistics*. **32** (2): 407–451\\. [arXiv](https:\/\/en.wikipedia.org\/wiki\/ArXiv_\$identifier\$ \"ArXiv (identifier)\"):[math\/0406456](https:\/\/arxiv.org\/abs\/math\/0406456). [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.1214\/009053604000000067](https:\/\/doi.org\/10.1214%2F009053604000000067). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [3448465](https:\/\/www.jstor.org\/stable\/3448465). [S2CID](https:\/\/en.wikipedia.org\/wiki\/S2CID_\$identifier\$ \"S2CID (identifier)\") [204004121](https:\/\/api.semanticscholar.org\/CorpusID:204004121).\n21. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hawkins-1973_21-0)**\n    Hawkins, Douglas M. (1973). \"On the Investigation of Alternative Regressions by Principal Component Analysis\". *Journal of the Royal Statistical Society, Series C*. **22** (3): 275–286\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.2307\/2346776](https:\/\/doi.org\/10.2307%2F2346776). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [2346776](https:\/\/www.jstor.org\/stable\/2346776).\n22. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Jolliffe-1982_22-0)**\n    Jolliffe, Ian T. (1982). \"A Note on the Use of Principal Components in Regression\". *Journal of the Royal Statistical Society, Series C*. **31** (3): 300–303\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.2307\/2348005](https:\/\/doi.org\/10.2307%2F2348005). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [2348005](https:\/\/www.jstor.org\/stable\/2348005).\n23. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-23)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), pages 27, 30)\n24. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-HayashiFSP_24-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-HayashiFSP_24-1) [***c***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-HayashiFSP_24-2) [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 27)\n25. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-25)**\n    [Amemiya, Takeshi](https:\/\/en.wikipedia.org\/wiki\/Takeshi_Amemiya \"Takeshi Amemiya\") (1985). [*Advanced Econometrics*](https:\/\/archive.org\/details\/advancedeconomet00amem). Harvard University Press. p. [13](https:\/\/archive.org\/details\/advancedeconomet00amem\/page\/13). [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [9780674005600](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9780674005600 \"Special:BookSources\/9780674005600\")\n    \n    .\n26. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-26)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 14)\n27. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-27)**\n    [Rao, C. R.](https:\/\/en.wikipedia.org\/wiki\/C._R._Rao \"C. R. Rao\") (1973). *Linear Statistical Inference and its Applications* (Second ed.). New York: J. Wiley & Sons. p. 319. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-471-70823-2](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-471-70823-2 \"Special:BookSources\/0-471-70823-2\")\n    \n    .\n28. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-28)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 20)\n29. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-29)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 27)\n30. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-DvdMck33_30-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-DvdMck33_30-1)\n    Davidson, Russell; [MacKinnon, James G.](https:\/\/en.wikipedia.org\/wiki\/James_G._MacKinnon \"James G. MacKinnon\") (1993). *Estimation and Inference in Econometrics*. New York: Oxford University Press. p. 33. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-19-506011-3](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-19-506011-3 \"Special:BookSources\/0-19-506011-3\")\n    \n    .\n31. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-31)** [Davidson & MacKinnon (1993](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 36)\n32. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-32)** [Davidson & MacKinnon (1993](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 20)\n33. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-33)**\n    [\"Memento on EViews Output\"](https:\/\/scholar.harvard.edu\/files\/jbenchimol\/files\/memento-eviews.pdf) (PDF). Retrieved 28 December 2020.\n34. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-34)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 21)\n35. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Amemiya22_35-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Amemiya22_35-1) [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 22)\n36. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-36)**\n    Burnham, Kenneth P.; Anderson, David R. (2002). [*Model Selection and Multi-Model Inference*](https:\/\/archive.org\/details\/modelselectionmu0000burn) (2nd ed.). Springer. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-387-95364-7](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-387-95364-7 \"Special:BookSources\/0-387-95364-7\")\n    \n    .\n\n## Further reading\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=30 \"Edit section: Further reading\")\\]\n\n- [Dougherty, Christopher](https:\/\/en.wikipedia.org\/wiki\/Christopher_Dougherty \"Christopher Dougherty\") (2002). *Introduction to Econometrics* (2nd ed.). New York: Oxford University Press. pp. 48–113\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [0-19-877643-8](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-19-877643-8 \"Special:BookSources\/0-19-877643-8\")\n  \n  .\n- [Gujarati, Damodar N.](https:\/\/en.wikipedia.org\/wiki\/Damodar_N._Gujarati \"Damodar N. Gujarati\"); [Porter, Dawn C.](https:\/\/en.wikipedia.org\/wiki\/Dawn_C._Porter \"Dawn C. Porter\") (2009). *Basic Econometics* (Fifth ed.). Boston: McGraw-Hill Irwin. pp. 55–96\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-07-337577-9](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-07-337577-9 \"Special:BookSources\/978-0-07-337577-9\")\n  \n  .\n- [Heij, Christiaan](https:\/\/en.wikipedia.org\/wiki\/Christiaan_Heij \"Christiaan Heij\"); Boer, Paul; [Franses, Philip H.](https:\/\/en.wikipedia.org\/wiki\/Philip_Hans_Franses \"Philip Hans Franses\"); [Kloek, Teun](https:\/\/en.wikipedia.org\/wiki\/Teun_Kloek \"Teun Kloek\"); [van Dijk, Herman K.](https:\/\/en.wikipedia.org\/wiki\/Herman_K._van_Dijk \"Herman K. van Dijk\") (2004). *Econometric Methods with Applications in Business and Economics* (1st ed.). Oxford: Oxford University Press. pp. 76–115\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-19-926801-6](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-19-926801-6 \"Special:BookSources\/978-0-19-926801-6\")\n  \n  .\n- Hill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008). *Principles of Econometrics* (3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. 8–47\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-471-72360-8](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-471-72360-8 \"Special:BookSources\/978-0-471-72360-8\")\n  \n  .\n- [Wooldridge, Jeffrey](https:\/\/en.wikipedia.org\/wiki\/Jeffrey_Wooldridge \"Jeffrey Wooldridge\") (2008). [\"The Simple Regression Model\"](https:\/\/books.google.com\/books?id=64vt5TDBNLwC&pg=PA22). *Introductory Econometrics: A Modern Approach* (4th ed.). Mason, OH: Cengage Learning. pp. 22–67\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-324-58162-1](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-324-58162-1 \"Special:BookSources\/978-0-324-58162-1\")\n  \n  .\n\n| [v](https:\/\/en.wikipedia.org\/wiki\/Template:Least_squares_and_regression_analysis \"Template:Least squares and regression analysis\") [t](https:\/\/en.wikipedia.org\/wiki\/Template_talk:Least_squares_and_regression_analysis \"Template talk:Least squares and regression analysis\") [e](https:\/\/en.wikipedia.org\/wiki\/Special:EditPage\/Template:Least_squares_and_regression_analysis \"Special:EditPage\/Template:Least squares and regression analysis\")[Least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\") and [regression analysis](https:\/\/en.wikipedia.org\/wiki\/Regression_analysis \"Regression analysis\") |   |\n|---|---|\n| [Computational statistics](https:\/\/en.wikipedia.org\/wiki\/Computational_statistics \"Computational statistics\") | [Least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\") [Linear least squares](https:\/\/en.wikipedia.org\/wiki\/Linear_least_squares_\$mathematics\$ \"Linear least squares (mathematics)\") [Non-linear least squares](https:\/\/en.wikipedia.org\/wiki\/Non-linear_least_squares \"Non-linear least squares\") [Iteratively reweighted least squares](https:\/\/en.wikipedia.org\/wiki\/Iteratively_reweighted_least_squares \"Iteratively reweighted least squares\") |\n| [Correlation and dependence](https:\/\/en.wikipedia.org\/wiki\/Correlation_and_dependence \"Correlation and dependence\") | [Pearson product-moment correlation](https:\/\/en.wikipedia.org\/wiki\/Pearson_product-moment_correlation_coefficient \"Pearson product-moment correlation coefficient\") [Rank correlation](https:\/\/en.wikipedia.org\/wiki\/Rank_correlation \"Rank correlation\") ([Spearman's rho](https:\/\/en.wikipedia.org\/wiki\/Spearman%27s_rank_correlation_coefficient \"Spearman's rank correlation coefficient\") [Kendall's tau](https:\/\/en.wikipedia.org\/wiki\/Kendall_tau_rank_correlation_coefficient \"Kendall tau rank correlation coefficient\")) [Partial correlation](https:\/\/en.wikipedia.org\/wiki\/Partial_correlation \"Partial correlation\") [Confounding variable](https:\/\/en.wikipedia.org\/wiki\/Confounding \"Confounding\") |\n| [Regression analysis](https:\/\/en.wikipedia.org\/wiki\/Regression_analysis \"Regression analysis\") | [Ordinary least squares]() [Partial least squares](https:\/\/en.wikipedia.org\/wiki\/Partial_least_squares_regression \"Partial least squares regression\") [Total least squares](https:\/\/en.wikipedia.org\/wiki\/Total_least_squares \"Total least squares\") [Ridge regression](https:\/\/en.wikipedia.org\/wiki\/Tikhonov_regularization \"Tikhonov regularization\") |\n| Regression as a  [statistical model](https:\/\/en.wikipedia.org\/wiki\/Statistical_model \"Statistical model\") |   |\n|   |   |\n| [Linear regression](https:\/\/en.wikipedia.org\/wiki\/Linear_regression \"Linear regression\") | [Simple linear regression](https:\/\/en.wikipedia.org\/wiki\/Simple_linear_regression \"Simple linear regression\") [Ordinary least squares]() [Generalized least squares](https:\/\/en.wikipedia.org\/wiki\/Generalized_least_squares \"Generalized least squares\") [Weighted least squares](https:\/\/en.wikipedia.org\/wiki\/Weighted_least_squares \"Weighted least squares\") [General linear model](https:\/\/en.wikipedia.org\/wiki\/General_linear_model \"General linear model\") |\n| Predictor structure | [Polynomial regression](https:\/\/en.wikipedia.org\/wiki\/Polynomial_regression \"Polynomial regression\") [Growth curve (statistics)](https:\/\/en.wikipedia.org\/wiki\/Growth_curve_\$statistics\$ \"Growth curve (statistics)\") [Segmented regression](https:\/\/en.wikipedia.org\/wiki\/Segmented_regression \"Segmented regression\") [Local regression](https:\/\/en.wikipedia.org\/wiki\/Local_regression \"Local regression\") |\n| Non-standard | [Nonlinear regression](https:\/\/en.wikipedia.org\/wiki\/Nonlinear_regression \"Nonlinear regression\") [Nonparametric](https:\/\/en.wikipedia.org\/wiki\/Nonparametric_regression \"Nonparametric regression\") [Semiparametric](https:\/\/en.wikipedia.org\/wiki\/Semiparametric_regression \"Semiparametric regression\") [Robust](https:\/\/en.wikipedia.org\/wiki\/Robust_regression \"Robust regression\") [Quantile](https:\/\/en.wikipedia.org\/wiki\/Quantile_regression \"Quantile regression\") [Isotonic](https:\/\/en.wikipedia.org\/wiki\/Isotonic_regression \"Isotonic regression\") |\n| Non-normal errors | [Generalized linear model](https:\/\/en.wikipedia.org\/wiki\/Generalized_linear_model \"Generalized linear model\") [Binomial](https:\/\/en.wikipedia.org\/wiki\/Binomial_regression \"Binomial regression\") [Poisson](https:\/\/en.wikipedia.org\/wiki\/Poisson_regression \"Poisson regression\") [Logistic](https:\/\/en.wikipedia.org\/wiki\/Logistic_regression \"Logistic regression\") |\n| [Decomposition of variance](https:\/\/en.wikipedia.org\/wiki\/Partition_of_sums_of_squares \"Partition of sums of squares\") | [Analysis of variance](https:\/\/en.wikipedia.org\/wiki\/Analysis_of_variance \"Analysis of variance\") [Analysis of covariance](https:\/\/en.wikipedia.org\/wiki\/Analysis_of_covariance \"Analysis of covariance\") [Multivariate AOV](https:\/\/en.wikipedia.org\/wiki\/Multivariate_analysis_of_variance \"Multivariate analysis of variance\") |\n| Model exploration | [Stepwise regression](https:\/\/en.wikipedia.org\/wiki\/Stepwise_regression \"Stepwise regression\") [Model selection](https:\/\/en.wikipedia.org\/wiki\/Model_selection \"Model selection\") [Mallows's *Cp*](https:\/\/en.wikipedia.org\/wiki\/Mallows%27s_Cp \"Mallows's Cp\") [AIC](https:\/\/en.wikipedia.org\/wiki\/Akaike_information_criterion \"Akaike information criterion\") [BIC](https:\/\/en.wikipedia.org\/wiki\/Bayesian_information_criterion \"Bayesian information criterion\") [Model specification](https:\/\/en.wikipedia.org\/wiki\/Model_specification \"Model specification\") [Regression validation](https:\/\/en.wikipedia.org\/wiki\/Regression_validation \"Regression validation\") |\n| Background | [Mean and predicted response](https:\/\/en.wikipedia.org\/wiki\/Mean_and_predicted_response \"Mean and predicted response\") [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\") [Errors and residuals](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals_in_statistics \"Errors and residuals in statistics\") [Goodness of fit](https:\/\/en.wikipedia.org\/wiki\/Goodness_of_fit \"Goodness of fit\") [Studentized residual](https:\/\/en.wikipedia.org\/wiki\/Studentized_residual \"Studentized residual\") [Minimum mean-square error](https:\/\/en.wikipedia.org\/wiki\/Minimum_mean-square_error \"Minimum mean-square error\") [Frisch–Waugh–Lovell theorem](https:\/\/en.wikipedia.org\/wiki\/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem \"Frisch–Waugh–Lovell theorem\") |\n| [Design of experiments](https:\/\/en.wikipedia.org\/wiki\/Design_of_experiments \"Design of experiments\") | [Response surface methodology](https:\/\/en.wikipedia.org\/wiki\/Response_surface_methodology \"Response surface methodology\") [Optimal design](https:\/\/en.wikipedia.org\/wiki\/Optimal_design \"Optimal design\") [Bayesian design](https:\/\/en.wikipedia.org\/wiki\/Bayesian_experimental_design \"Bayesian experimental design\") |\n| [Numerical](https:\/\/en.wikipedia.org\/wiki\/Numerical_analysis \"Numerical analysis\") [approximation](https:\/\/en.wikipedia.org\/wiki\/Approximation_theory \"Approximation theory\") | [Numerical analysis](https:\/\/en.wikipedia.org\/wiki\/Numerical_analysis \"Numerical analysis\") [Approximation theory](https:\/\/en.wikipedia.org\/wiki\/Approximation_theory \"Approximation theory\") [Numerical integration](https:\/\/en.wikipedia.org\/wiki\/Numerical_integration \"Numerical integration\") [Gaussian quadrature](https:\/\/en.wikipedia.org\/wiki\/Gaussian_quadrature \"Gaussian quadrature\") [Orthogonal polynomials](https:\/\/en.wikipedia.org\/wiki\/Orthogonal_polynomials \"Orthogonal polynomials\") [Chebyshev polynomials](https:\/\/en.wikipedia.org\/wiki\/Chebyshev_polynomials \"Chebyshev polynomials\") [Chebyshev nodes](https:\/\/en.wikipedia.org\/wiki\/Chebyshev_nodes \"Chebyshev nodes\") |\n| Applications | [Curve fitting](https:\/\/en.wikipedia.org\/wiki\/Curve_fitting \"Curve fitting\") [Calibration curve](https:\/\/en.wikipedia.org\/wiki\/Calibration_curve \"Calibration curve\") [Numerical smoothing and differentiation](https:\/\/en.wikipedia.org\/wiki\/Numerical_smoothing_and_differentiation \"Numerical smoothing and differentiation\") [System identification](https:\/\/en.wikipedia.org\/wiki\/System_identification \"System identification\") [Moving least squares](https:\/\/en.wikipedia.org\/wiki\/Moving_least_squares \"Moving least squares\") |\n| [Regression analysis category](https:\/\/en.wikipedia.org\/wiki\/Category:Regression_analysis \"Category:Regression analysis\") [Statistics category](https:\/\/en.wikipedia.org\/wiki\/Category:Statistics \"Category:Statistics\") [![icon](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/3\/3e\/Nuvola_apps_edu_mathematics_blue-p.svg\/20px-Nuvola_apps_edu_mathematics_blue-p.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Nuvola_apps_edu_mathematics_blue-p.svg) [Mathematics portal](https:\/\/en.wikipedia.org\/wiki\/Portal:Mathematics \"Portal:Mathematics\") [Statistics outline](https:\/\/en.wikipedia.org\/wiki\/Outline_of_statistics \"Outline of statistics\") [Statistics topics](https:\/\/en.wikipedia.org\/wiki\/List_of_statistics_articles \"List of statistics articles\") |   |\n\n![](https:\/\/en.wikipedia.org\/wiki\/Special:CentralAutoLogin\/start?useformat=desktop&type=1x1&usesul3=1)\n\nRetrieved from \"<https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&oldid=1345164414>\"\n\n[Categories](https:\/\/en.wikipedia.org\/wiki\/Help:Category \"Help:Category\"):\n\n- [Parametric statistics](https:\/\/en.wikipedia.org\/wiki\/Category:Parametric_statistics \"Category:Parametric statistics\")\n- [Least squares](https:\/\/en.wikipedia.org\/wiki\/Category:Least_squares \"Category:Least squares\")\n\nHidden categories:\n\n- [CS1 errors: ISBN date](https:\/\/en.wikipedia.org\/wiki\/Category:CS1_errors:_ISBN_date \"Category:CS1 errors: ISBN date\")\n- [Articles with short description](https:\/\/en.wikipedia.org\/wiki\/Category:Articles_with_short_description \"Category:Articles with short description\")\n- [Short description matches Wikidata](https:\/\/en.wikipedia.org\/wiki\/Category:Short_description_matches_Wikidata \"Category:Short description matches Wikidata\")\n- [Wikipedia articles needing clarification from December 2023](https:\/\/en.wikipedia.org\/wiki\/Category:Wikipedia_articles_needing_clarification_from_December_2023 \"Category:Wikipedia articles needing clarification from December 2023\")\n- [All articles with unsourced statements](https:\/\/en.wikipedia.org\/wiki\/Category:All_articles_with_unsourced_statements \"Category:All articles with unsourced statements\")\n- [Articles with unsourced statements from February 2010](https:\/\/en.wikipedia.org\/wiki\/Category:Articles_with_unsourced_statements_from_February_2010 \"Category:Articles with unsourced statements from February 2010\")\n- [Articles to be expanded from February 2017](https:\/\/en.wikipedia.org\/wiki\/Category:Articles_to_be_expanded_from_February_2017 \"Category:Articles to be expanded from February 2017\")\n- [All articles to be expanded](https:\/\/en.wikipedia.org\/wiki\/Category:All_articles_to_be_expanded \"Category:All articles to be expanded\")\n\n- This page was last edited on 24 March 2026, at 17:11 (UTC).\n- Text is available under the [Creative Commons Attribution-ShareAlike 4.0 License](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_4.0_International_License \"Wikipedia:Text of the Creative Commons Attribution-ShareAlike 4.0 International License\"); additional terms may apply. By using this site, you agree to the [Terms of Use](https:\/\/foundation.wikimedia.org\/wiki\/Special:MyLanguage\/Policy:Terms_of_Use \"foundation:Special:MyLanguage\/Policy:Terms of Use\") and [Privacy Policy](https:\/\/foundation.wikimedia.org\/wiki\/Special:MyLanguage\/Policy:Privacy_policy \"foundation:Special:MyLanguage\/Policy:Privacy policy\"). Wikipedia® is a registered trademark of the [Wikimedia Foundation, Inc.](https:\/\/wikimediafoundation.org\/), a non-profit organization.\n\n- [Privacy policy](https:\/\/foundation.wikimedia.org\/wiki\/Special:MyLanguage\/Policy:Privacy_policy)\n- [About Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:About)\n- [Disclaimers](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:General_disclaimer)\n- [Contact Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Contact_us)\n- [Legal & safety contacts](https:\/\/foundation.wikimedia.org\/wiki\/Special:MyLanguage\/Legal:Wikimedia_Foundation_Legal_and_Safety_Contact_Information)\n- [Code of Conduct](https:\/\/foundation.wikimedia.org\/wiki\/Special:MyLanguage\/Policy:Universal_Code_of_Conduct)\n- [Developers](https:\/\/developer.wikimedia.org\/)\n- [Statistics](https:\/\/stats.wikimedia.org\/#\/en.wikipedia.org)\n- [Cookie statement](https:\/\/foundation.wikimedia.org\/wiki\/Special:MyLanguage\/Policy:Cookie_statement)\n- [Mobile view](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&mobileaction=toggle_view_mobile)\n\n- [![Wikimedia Foundation](https:\/\/en.wikipedia.org\/static\/images\/footer\/wikimedia.svg)](https:\/\/www.wikimedia.org\/)\n- [![Powered by MediaWiki](https:\/\/en.wikipedia.org\/w\/resources\/assets\/mediawiki_compact.svg)](https:\/\/www.mediawiki.org\/)\n\nSearch\n\nToggle the table of contents\n\nOrdinary least squares\n\n12 languages\n\n[Add topic](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares)","attrs_readable_markdown":"[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/7\/77\/Okuns_law_quarterly_differences.svg\/330px-Okuns_law_quarterly_differences.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Okuns_law_quarterly_differences.svg)\n\n[Okun's law](https:\/\/en.wikipedia.org\/wiki\/Okun%27s_law \"Okun's law\") in [macroeconomics](https:\/\/en.wikipedia.org\/wiki\/Macroeconomics \"Macroeconomics\") states that in an economy the [GDP](https:\/\/en.wikipedia.org\/wiki\/GDP \"GDP\") growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.\n\nIn [statistics](https:\/\/en.wikipedia.org\/wiki\/Statistics \"Statistics\"), **ordinary least squares** (**OLS**) is a type of [linear least squares](https:\/\/en.wikipedia.org\/wiki\/Linear_least_squares \"Linear least squares\") method for choosing the unknown [parameters](https:\/\/en.wikipedia.org\/wiki\/Statistical_parameter \"Statistical parameter\") in a [linear regression](https:\/\/en.wikipedia.org\/wiki\/Linear_regression \"Linear regression\") model (with fixed level-one\\[*[clarification needed](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Please_clarify \"Wikipedia:Please clarify\")*\\] effects of a [linear function](https:\/\/en.wikipedia.org\/wiki\/Linear_function \"Linear function\") of a set of [explanatory variables](https:\/\/en.wikipedia.org\/wiki\/Explanatory_variable \"Explanatory variable\")) by the principle of [least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\"): minimizing the sum of the squares of the differences between the observed [dependent variable](https:\/\/en.wikipedia.org\/wiki\/Dependent_variable \"Dependent variable\") (values of the variable being observed) in the input [dataset](https:\/\/en.wikipedia.org\/wiki\/Dataset \"Dataset\") and the output of the (linear) function of the [independent variable](https:\/\/en.wikipedia.org\/wiki\/Independent_variable \"Independent variable\"). Some sources consider OLS to be linear regression.[\\[1\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-1)\n\nGeometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting [estimator](https:\/\/en.wikipedia.org\/wiki\/Statistical_estimation \"Statistical estimation\") can be expressed by a simple formula, especially in the case of a [simple linear regression](https:\/\/en.wikipedia.org\/wiki\/Simple_linear_regression \"Simple linear regression\"), in which there is a single [regressor](https:\/\/en.wikipedia.org\/wiki\/Regressor \"Regressor\") on the right side of the regression equation.\n\nThe OLS estimator is [consistent](https:\/\/en.wikipedia.org\/wiki\/Consistent_estimator \"Consistent estimator\") for the level-one fixed effects when the regressors are [exogenous](https:\/\/en.wikipedia.org\/wiki\/Exogenous \"Exogenous\") and forms perfect [colinearity](https:\/\/en.wikipedia.org\/wiki\/Collinearity \"Collinearity\") (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[\\[2\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-2) and—by the [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\")—[optimal in the class of linear unbiased estimators](https:\/\/en.wikipedia.org\/wiki\/Best_linear_unbiased_estimator \"Best linear unbiased estimator\") when the [errors](https:\/\/en.wikipedia.org\/wiki\/Statistical_error \"Statistical error\") are [homoscedastic](https:\/\/en.wikipedia.org\/wiki\/Homoscedastic \"Homoscedastic\") and [serially uncorrelated](https:\/\/en.wikipedia.org\/wiki\/Autocorrelation \"Autocorrelation\"). Under these conditions, the method of OLS provides [minimum-variance mean-unbiased](https:\/\/en.wikipedia.org\/wiki\/UMVU \"UMVU\") estimation when the errors have finite [variances](https:\/\/en.wikipedia.org\/wiki\/Variance \"Variance\"). Under the additional assumption that the errors are [normally distributed](https:\/\/en.wikipedia.org\/wiki\/Normal_distribution \"Normal distribution\") with zero mean, OLS is the [maximum likelihood estimator](https:\/\/en.wikipedia.org\/wiki\/Maximum_likelihood_estimator \"Maximum likelihood estimator\") that outperforms any non-linear unbiased estimator.\n\nSuppose the data consists of ![{\\\\displaystyle n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [observations](https:\/\/en.wikipedia.org\/wiki\/Statistical_unit \"Statistical unit\") ![{\\\\displaystyle \\\\left\\\\{\\\\mathbf {x} \\_{i},y\\_{i}\\\\right\\\\}\\_{i=1}^{n}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a30b4d40ab94a43e79c39dab82a36f8d19bdc798). Each observation ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20) includes a scalar response ![{\\\\displaystyle y\\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67d30d30b6c2dbe4d6f150d699de040937ecc95f) and a column vector ![{\\\\displaystyle \\\\mathbf {x} \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/57d2ef3df60acdb53bdf90535264041fea7231cd) of ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) parameters (regressors), i.e., ![{\\\\displaystyle \\\\mathbf {x} \\_{i}=\\\\left\\[x\\_{i1},x\\_{i2},\\\\dots ,x\\_{ip}\\\\right\\]^{\\\\operatorname {T} }}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3278872b5bdb53e6af3474d92e9926c0238e8935). In a [linear regression model](https:\/\/en.wikipedia.org\/wiki\/Linear_regression_model \"Linear regression model\"), the response variable, ![{\\\\displaystyle y\\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67d30d30b6c2dbe4d6f150d699de040937ecc95f), is a linear function of the regressors:\n\n![{\\\\displaystyle y\\_{i}=\\\\beta \\_{1}\\\\ x\\_{i1}+\\\\beta \\_{2}\\\\ x\\_{i2}+\\\\cdots +\\\\beta \\_{p}\\\\ x\\_{ip}+\\\\varepsilon \\_{i},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7870b85fe5f4127c19581eb03c7c0a26a76035a0)\n\nor in [vector](https:\/\/en.wikipedia.org\/wiki\/Row_and_column_vectors \"Row and column vectors\") form,\n\n![{\\\\displaystyle y\\_{i}=\\\\mathbf {x} \\_{i}^{\\\\operatorname {T} }{\\\\boldsymbol {\\\\beta }}+\\\\varepsilon \\_{i},\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/b32cb3791f0e061f7d9930fa093e77c44de5df54)\n\nwhere ![{\\\\displaystyle \\\\mathbf {x} \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/57d2ef3df60acdb53bdf90535264041fea7231cd), as introduced previously, is a column vector of the ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20)\\-th observation of all the explanatory variables; ![{\\\\displaystyle {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/702cafc420cc00c54896f6d125112820956aaf6b) is a ![{\\\\displaystyle p\\\\times 1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a9b3ff9128b8bc9ccf1c3b9a3ba1d253b95f5754) vector of unknown parameters; and the scalar ![{\\\\displaystyle \\\\varepsilon \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) represents unobserved random variables ([errors](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals_in_statistics \"Errors and residuals in statistics\")) of the ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20)\\-th observation. ![{\\\\displaystyle \\\\varepsilon \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) accounts for the influences upon the responses ![{\\\\displaystyle y\\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67d30d30b6c2dbe4d6f150d699de040937ecc95f) from sources other than the explanatory variables ![{\\\\displaystyle \\\\mathbf {x} \\_{i}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/57d2ef3df60acdb53bdf90535264041fea7231cd). This model can also be written in matrix notation as\n\n![{\\\\displaystyle \\\\mathbf {y} =\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}+{\\\\boldsymbol {\\\\varepsilon }},\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e12b25ef53459d71b4879d9bd23b4387defe4aef)\n\nwhere ![{\\\\displaystyle \\\\mathbf {y} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/bb25a040b592282dc2a254c3117e792c3c81161f) and ![{\\\\displaystyle {\\\\boldsymbol {\\\\varepsilon }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/8445af5ff7da70714382bc35e78bedcacf68e825) are ![{\\\\displaystyle n\\\\times 1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d24148f103e1cccb60addeeb0a64cb1c3d5622e0) vectors of the response variables and the errors of the ![{\\\\displaystyle n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a601995d55609f2d9f5e233e36fbe9ea26011b3b) observations, and ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd) is an ![{\\\\displaystyle n\\\\times p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/43ad58cdd60e9b0ab2bec828151c740accf92028) matrix of regressors, also sometimes called the [design matrix](https:\/\/en.wikipedia.org\/wiki\/Design_matrix \"Design matrix\"), whose row ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20) is ![{\\\\displaystyle \\\\mathbf {x} \\_{i}^{\\\\operatorname {T} }}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/71723da20a9d144f526b5f42f8bce496c157d34d) and contains the ![{\\\\displaystyle i}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/add78d8608ad86e54951b8c8bd6c8d8416533d20)\\-th observations on all the explanatory variables.\n\nTypically, a constant term is included in the set of regressors ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd), say, by taking ![{\\\\displaystyle x\\_{i1}=1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e29706788693b7f680e4d6dd2cfd96078cd968d8) for all ![{\\\\displaystyle i=1,\\\\dots ,n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f3f269b2f3b2f87fec0168426652a5ea80b56112). The coefficient ![{\\\\displaystyle \\\\beta \\_{1}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) corresponding to this regressor is called the *intercept*. Without the intercept, the fitted line is forced to cross the origin when ![{\\\\displaystyle x\\_{i}={\\\\vec {0}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d46f3b2caf53f86b9bb27baa47fa68fa138d61be).\n\nRegressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent).\n\nAs a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be *quadratic* in the second regressor, but none-the-less is still considered a *linear* model because the model *is* still linear in the parameters (![{\\\\displaystyle {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/702cafc420cc00c54896f6d125112820956aaf6b)).\n\n### Matrix\/vector formulation\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=2 \"Edit section: Matrix\/vector formulation\")\\]\n\nConsider an [overdetermined system](https:\/\/en.wikipedia.org\/wiki\/Overdetermined_system \"Overdetermined system\")\n\n![{\\\\displaystyle \\\\sum \\_{j=1}^{p}x\\_{ij}\\\\beta \\_{j}=y\\_{i},\\\\ (i=1,2,\\\\dots ,n),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7080f1307d382773f003a69c9df7f481720c1fd2)\n\nof ![{\\\\displaystyle n}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [linear equations](https:\/\/en.wikipedia.org\/wiki\/Linear_equation \"Linear equation\") in ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) unknown [coefficients](https:\/\/en.wikipedia.org\/wiki\/Coefficients \"Coefficients\"), ![{\\\\displaystyle \\\\beta \\_{1},\\\\beta \\_{2},\\\\dots ,\\\\beta \\_{p}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3a420907c6b68e22d9f37306b5837bdafd46ae1e), with ![{\\\\displaystyle n\\>p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7002b8ec5aefb63e588cf894804ba7cb8401fc95). This can be written in [matrix](https:\/\/en.wikipedia.org\/wiki\/Matrix_\$mathematics\$ \"Matrix (mathematics)\") form as\n\n![{\\\\displaystyle \\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}=\\\\mathbf {y} ,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3b00013b54c8f88aeeb146a12584541c6e934529)\n\nwhere\n\n![{\\\\displaystyle \\\\mathbf {X} ={\\\\begin{bmatrix}X\\_{11}\\&X\\_{12}&\\\\cdots \\&X\\_{1p}\\\\\\\\X\\_{21}\\&X\\_{22}&\\\\cdots \\&X\\_{2p}\\\\\\\\\\\\vdots &\\\\vdots &\\\\ddots &\\\\vdots \\\\\\\\X\\_{n1}\\&X\\_{n2}&\\\\cdots \\&X\\_{np}\\\\end{bmatrix}},\\\\qquad {\\\\boldsymbol {\\\\beta }}={\\\\begin{bmatrix}\\\\beta \\_{1}\\\\\\\\\\\\beta \\_{2}\\\\\\\\\\\\vdots \\\\\\\\\\\\beta \\_{p}\\\\end{bmatrix}},\\\\qquad \\\\mathbf {y} ={\\\\begin{bmatrix}y\\_{1}\\\\\\\\y\\_{2}\\\\\\\\\\\\vdots \\\\\\\\y\\_{n}\\\\end{bmatrix}}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9da3326ca4d8536e1e444d2cea03ec869d734e6f)\n\n(Note: for a linear model as above, not all elements in ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd) contains information on the data points. The first column is populated with ones, ![{\\\\displaystyle X\\_{i1}=1}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3b81dfc9a5c55d20b4edfdc14c4ea4fc4e666bb0). Only the other columns contain actual data. So here ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) is equal to the number of regressors plus one).\n\nSuch a system usually has no exact solution, so the goal is instead to find the coefficients ![{\\\\displaystyle {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/702cafc420cc00c54896f6d125112820956aaf6b) which fit the equations \"best\", in the sense of solving the [quadratic](https:\/\/en.wikipedia.org\/wiki\/Quadratic_form_\$statistics\$ \"Quadratic form (statistics)\") [minimization](https:\/\/en.wikipedia.org\/wiki\/Mathematical_optimization \"Mathematical optimization\") problem\n\n![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}={\\\\underset {\\\\boldsymbol {\\\\beta }}{\\\\operatorname {arg\\\\,min} }}\\\\,S({\\\\boldsymbol {\\\\beta }}),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/bf227a8af979e716d08f7c82dd95b17440e33a15)\n\nwhere the objective function ![{\\\\displaystyle S}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4611d85173cd3b508e67077d4a1252c9c05abca2) is given by\n\n![{\\\\displaystyle S({\\\\boldsymbol {\\\\beta }})=\\\\sum \\_{i=1}^{n}\\\\left\\|y\\_{i}-\\\\sum \\_{j=1}^{p}X\\_{ij}\\\\beta \\_{j}\\\\right\\|^{2}=\\\\left\\\\\\|\\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}\\\\right\\\\\\|^{2}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cd9b3380cc6d4c170f743bb84a1878cac86ce009)\n\nA justification for choosing this criterion is given in [Properties](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Properties) below. This minimization problem has a unique solution, provided that the ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) columns of the matrix ![{\\\\displaystyle \\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f75966a2f9d5672136fa9401ee1e75008f95ffd) are [linearly independent](https:\/\/en.wikipedia.org\/wiki\/Linearly_independent \"Linearly independent\"), given by solving the so-called *normal equations*:\n\n![{\\\\displaystyle \\\\left(\\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {X} \\\\right){\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {y} \\\\ .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/74ccf054aed29744d095c445b7aaa7a84729db17)\n\nThe matrix ![{\\\\displaystyle \\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {X} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a8b826366f6f9df8cf2d0ea4fe3eda3c760d2fc8) is known as the *normal matrix* or [Gram matrix](https:\/\/en.wikipedia.org\/wiki\/Gram_matrix \"Gram matrix\") and the matrix ![{\\\\displaystyle \\\\mathbf {X} ^{\\\\operatorname {T} }\\\\mathbf {y} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9fdb62e41d505bc511e1f626946b2775aa00c01b) is known as the [moment matrix](https:\/\/en.wikipedia.org\/wiki\/Moment_matrix \"Moment matrix\") of regressand by regressors.[\\[3\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-3) Finally, ![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4a29ed56e80ee92ae1ef81b8ee8b7ffb4a16b614) is the coefficient vector of the least-squares [hyperplane](https:\/\/en.wikipedia.org\/wiki\/Hyperplane \"Hyperplane\"), expressed as\n\n![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }\\\\mathbf {y} .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/96ef11beb26be8b3f28df0d43e0810694ea980d1)\n\nor\n\n![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\beta }}}={\\\\boldsymbol {\\\\beta }}+\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }{\\\\boldsymbol {\\\\varepsilon }}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/dd9aa3940939f16090f954e6f8b76d914f3d9f18)\n\nSuppose *b* is a \"candidate\" value for the parameter vector *β*. The quantity *yi* − *xi*T*b*, called the *[residual](https:\/\/en.wikipedia.org\/wiki\/Errors_and_residuals_in_statistics \"Errors and residuals in statistics\")* for the *i*\\-th observation, measures the vertical distance between the data point (*xi*, *yi*) and the hyperplane *y* = *x*T*b*, and thus assesses the degree of fit between the actual data and the model. The *[sum of squared residuals](https:\/\/en.wikipedia.org\/wiki\/Sum_of_squared_residuals \"Sum of squared residuals\")* (*SSR*) (also called the *error sum of squares* (*ESS*) or *residual sum of squares* (*RSS*))[\\[4\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-4) is a measure of the overall model fit:\n\n![{\\\\displaystyle S(b)=\\\\sum \\_{i=1}^{n}(y\\_{i}-x\\_{i}^{\\\\operatorname {T} }b)^{2}=(y-Xb)^{\\\\operatorname {T} }(y-Xb),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/db93b1fe27230d86895999a749498e7180f26acd)\n\nwhere *T* denotes the matrix [transpose](https:\/\/en.wikipedia.org\/wiki\/Transpose \"Transpose\"), and the rows of *X*, denoting the values of all the independent variables associated with a particular value of the dependent variable, are *Xi = xi*T. The value of *b* which minimizes this sum is called the **OLS estimator for *β***. The function *S*(*b*) is quadratic in *b* with positive-definite [Hessian](https:\/\/en.wikipedia.org\/wiki\/Hessian_matrix \"Hessian matrix\"), and therefore this function possesses a unique global minimum at ![{\\\\displaystyle b={\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5b0af26c563f2ee2ad28a14e01ee712c5ab69d63), which can be given by the explicit formula[\\[5\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-5)[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Least_squares_estimator_for_.CE.B2 \"Proofs involving ordinary least squares\")\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}=\\\\operatorname {argmin} \\_{b\\\\in \\\\mathbb {R} ^{p}}S(b)=(X^{\\\\operatorname {T} }X)^{-1}X^{\\\\operatorname {T} }y\\\\ .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/89476bedc0a2bdb8bdffee080e1cf7595eb09404)\n\nThe product *N* = *X*T *X* is a [Gram matrix](https:\/\/en.wikipedia.org\/wiki\/Gram_matrix \"Gram matrix\"), and its inverse, *Q* = *N*−1, is the *cofactor matrix* of *β*,[\\[6\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-6)[\\[7\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-7)[\\[8\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-8) closely related to its [covariance matrix](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#Covariance_matrix), *C**β*. The matrix (*X*T *X*)−1 *X*T = *Q* *X*T is called the [Moore–Penrose pseudoinverse](https:\/\/en.wikipedia.org\/wiki\/Moore%E2%80%93Penrose_pseudoinverse \"Moore–Penrose pseudoinverse\") matrix of *X*. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect [multicollinearity](https:\/\/en.wikipedia.org\/wiki\/Multicollinearity \"Multicollinearity\") between the explanatory variables (which would cause the Gram matrix to have no inverse).\n\nAfter we have estimated *β*, the *[fitted values](https:\/\/en.wikipedia.org\/wiki\/Fitted_value \"Fitted value\")* (or *predicted values*) from the regression will be\n\n![{\\\\displaystyle {\\\\hat {y}}=X{\\\\hat {\\\\beta }}=Py,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d480dd9ab57371fb43d863a1cf345c71fb9ed7dd)\n\nwhere *P* = *X*(*X*T*X*)−1*X*T is the *[projection matrix](https:\/\/en.wikipedia.org\/wiki\/Projection_matrix \"Projection matrix\")* onto the space *V* spanned by the columns of *X*. This matrix *P* is also sometimes called the *[hat matrix](https:\/\/en.wikipedia.org\/wiki\/Hat_matrix \"Hat matrix\")* because it \"puts a hat\" onto the variable *y*. Another matrix, closely related to *P* is the *annihilator* matrix *M* = *In* − *P*; this is a projection matrix onto the space orthogonal to *V*. Both matrices *P* and *M* are [symmetric](https:\/\/en.wikipedia.org\/wiki\/Symmetric_matrix \"Symmetric matrix\") and [idempotent](https:\/\/en.wikipedia.org\/wiki\/Idempotent_matrix \"Idempotent matrix\") (meaning that *P*2 = *P* and *M*2 = *M*), and relate to the data matrix *X* via identities *PX* = *X* and *MX* = 0.[\\[9\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9) Matrix *M* creates the *residuals* from the regression:\n\n![{\\\\displaystyle {\\\\hat {\\\\varepsilon }}=y-{\\\\hat {y}}=y-X{\\\\hat {\\\\beta }}=My=M(X\\\\beta +\\\\varepsilon )=(MX)\\\\beta +M\\\\varepsilon =M\\\\varepsilon .}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/49f1e5bb8d4cd6e0e3b2b5a4f54ec0964721882e)\n\nThe variances of the predicted values ![{\\\\displaystyle s\\_{{\\\\hat {y}}\\_{i}}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/800e6b83f5c87178d1e03b2cf3592caee2ce515a) are found in the main diagonal of the [variance-covariance matrix](https:\/\/en.wikipedia.org\/wiki\/Variance-covariance_matrix \"Variance-covariance matrix\") of predicted values:\n\n![{\\\\displaystyle C\\_{\\\\hat {y}}=s^{2}P,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/be9078c7868b7d295fde435739ee63bc5f7f3cc2)\n\nwhere *P* is the projection matrix and *s*2 is the sample variance.[\\[10\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-q011-10) The full matrix is very large; its diagonal elements can be calculated individually as:\n\n![{\\\\displaystyle s\\_{{\\\\hat {y}}\\_{i}}^{2}=s^{2}X\\_{i}(X^{T}X)^{-1}X\\_{i}^{T},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/abdd97a20f265471ebb770d822cab8b8732786e0)\n\nwhere *X*i is the *i*\\-th row of matrix *X*.\n\nUsing these residuals we can estimate the sample variance *s*2 using the *[reduced chi-squared](https:\/\/en.wikipedia.org\/wiki\/Reduced_chi-squared \"Reduced chi-squared\")* statistic:\n\n![{\\\\displaystyle s^{2}={\\\\frac {{\\\\hat {\\\\varepsilon }}^{\\\\mathrm {T} }{\\\\hat {\\\\varepsilon }}}{n-p}}={\\\\frac {(My)^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {y^{\\\\mathrm {T} }M^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {y^{\\\\mathrm {T} }My}{n-p}}={\\\\frac {S({\\\\hat {\\\\beta }})}{n-p}},\\\\qquad {\\\\hat {\\\\sigma }}^{2}={\\\\frac {n-p}{n}}\\\\;s^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/02aa29dec64f9231a7b9b6b2ac7d4889bbe006b2)\n\nThe denominator, *n*−*p*, is the [statistical degrees of freedom](https:\/\/en.wikipedia.org\/wiki\/Degrees_of_freedom_\$statistics\$ \"Degrees of freedom (statistics)\"). The first quantity, *s*2, is the OLS estimate for *σ*2, whereas the second, ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\sigma }}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6314fcc00b383711f6a82277c41ac3aa39112bbc), is the MLE estimate for *σ*2. The two estimators are quite similar in large samples; the first estimator is always [unbiased](https:\/\/en.wikipedia.org\/wiki\/Estimator_bias \"Estimator bias\"), while the second estimator is biased but has a smaller [mean squared error](https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error \"Mean squared error\"). In practice *s*2 is used more often, since it is more convenient for the hypothesis testing. The square root of *s*2 is called the *[regression standard error](https:\/\/en.wikipedia.org\/wiki\/Regression_standard_error \"Regression standard error\")*,[\\[11\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-11) *standard error of the regression*,[\\[12\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-12)[\\[13\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-13) or *standard error of the equation*.[\\[9\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9)\n\nIt is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto *X*. The *[coefficient of determination](https:\/\/en.wikipedia.org\/wiki\/Coefficient_of_determination \"Coefficient of determination\")* *R*2 is defined as a ratio of \"explained\" variance to the \"total\" variance of the dependent variable *y*, in the cases where the regression sum of squares equals the sum of squares of residuals:[\\[14\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-14)\n\n![{\\\\displaystyle R^{2}={\\\\frac {\\\\sum ({\\\\hat {y}}\\_{i}-{\\\\overline {y}})^{2}}{\\\\sum (y\\_{i}-{\\\\overline {y}})^{2}}}={\\\\frac {y^{\\\\mathrm {T} }P^{\\\\mathrm {T} }LPy}{y^{\\\\mathrm {T} }Ly}}=1-{\\\\frac {y^{\\\\mathrm {T} }My}{y^{\\\\mathrm {T} }Ly}}=1-{\\\\frac {\\\\rm {RSS}}{\\\\rm {TSS}}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/402af3a43381655d4ef206850bdb41f55b3d6124)\n\nwhere TSS is the *[total sum of squares](https:\/\/en.wikipedia.org\/wiki\/Total_sum_of_squares \"Total sum of squares\")* for the dependent variable, ![{\\\\textstyle L=I\\_{n}-{\\\\frac {1}{n}}J\\_{n}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/22262bc6f943dcc9bc10a932608ba89ea19476ba), and ![{\\\\textstyle J\\_{n}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/60e444a99197edac686bd1a68cfa011b1ffc8559) is an *n*×*n* matrix of ones. (![{\\\\displaystyle L}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/103168b86f781fe6e9a4a87b8ea1cebe0ad4ede8) is a [centering matrix](https:\/\/en.wikipedia.org\/wiki\/Centering_matrix \"Centering matrix\") which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for *R*2 to be meaningful, the matrix *X* of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, *R*2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.\n\n### Simple linear regression model\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=6 \"Edit section: Simple linear regression model\")\\]\n\nIf the data matrix *X* contains only two variables, a constant and a scalar regressor *xi*, then this is called the \"simple regression model\". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as (*α*, *β*):\n\n![{\\\\displaystyle y\\_{i}=\\\\alpha +\\\\beta x\\_{i}+\\\\varepsilon \\_{i}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/968be557dd22b1a2e536b8d22369cfdb37f58703)\n\nThe least squares estimates in this case are given by simple formulas\n\n![{\\\\displaystyle {\\\\begin{aligned}{\\\\widehat {\\\\beta }}&={\\\\frac {\\\\sum \\_{i=1}^{n}{(x\\_{i}-{\\\\bar {x}})(y\\_{i}-{\\\\bar {y}})}}{\\\\sum \\_{i=1}^{n}{(x\\_{i}-{\\\\bar {x}})^{2}}}}\\\\\\\\\\[2pt\\]{\\\\widehat {\\\\alpha }}&={\\\\bar {y}}-{\\\\widehat {\\\\beta }}\\\\,{\\\\bar {x}}\\\\ ,\\\\end{aligned}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/932c6407f7ceba533fef69961fe504fc3b565e1e)\n\n## Alternative derivations\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=7 \"Edit section: Alternative derivations\")\\]\n\nIn the previous section the least squares estimator ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^*β* = (*X*T*X*)−1*X*T*y*; the only difference is in how we interpret this result.\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/8\/87\/OLS_geometric_interpretation.svg\/250px-OLS_geometric_interpretation.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_geometric_interpretation.svg)\n\nOLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of ![{\\\\displaystyle X\\_{1}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f70b2694445a5901b24338a2e7a7e58f02a72a32) and ![{\\\\displaystyle X\\_{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2ad47c14b8a092f182512e76c96638aea6e3bea1) refers to a column of the data matrix.)\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/c\/c2\/Geometric_interpretation_of_least_squares_%28three_observations%29.png\/250px-Geometric_interpretation_of_least_squares_%28three_observations%29.png)](https:\/\/en.wikipedia.org\/wiki\/File:Geometric_interpretation_of_least_squares_\$three_observations\$.png)\n\nLeast squares as projection of y onto col(X) for three observations; ŷ = Xβ gives fitted values and y − ŷ is the residual.\n\nFor mathematicians, OLS is an approximate solution to an overdetermined system of linear equations *Xβ* ≈ *y*, where *β* is the unknown. Assuming the system cannot be solved exactly (the number of equations *n* is much larger than the number of unknowns *p*), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}={\\\\rm {arg}}\\\\min \\_{\\\\beta }\\\\,\\\\lVert \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}\\\\rVert ^{2},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/bd635ce3922e11053e13871626396a73e87db79f)\n\nwhere ‖·‖ is the standard [*L*2 norm](https:\/\/en.wikipedia.org\/wiki\/Norm_\$mathematics\$#Euclidean_norm \"Norm (mathematics)\") in the *n*\\-dimensional [Euclidean space](https:\/\/en.wikipedia.org\/wiki\/Euclidean_space \"Euclidean space\") **R***n*. The predicted quantity *Xβ* is just a certain linear combination of the vectors of regressors. Thus, the residual vector *y* − *Xβ* will have the smallest length when *y* is [projected orthogonally](https:\/\/en.wikipedia.org\/wiki\/Projection_\$linear_algebra\$ \"Projection (linear algebra)\") onto the [linear subspace](https:\/\/en.wikipedia.org\/wiki\/Linear_subspace \"Linear subspace\") [spanned](https:\/\/en.wikipedia.org\/wiki\/Linear_span \"Linear span\") by the columns of *X*. The OLS estimator ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) in this case can be interpreted as the coefficients of [vector decomposition](https:\/\/en.wikipedia.org\/wiki\/Vector_decomposition \"Vector decomposition\") of ^*y* = *Py* along the basis of *X*.\n\nIn other words, the gradient equations at the minimum can be written as:\n\n![{\\\\displaystyle (\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}})^{\\\\top }\\\\mathbf {X} =0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f61db732765501ffe06d27ed044ec962b5836eb9)\n\nA geometrical interpretation of these equations is that the vector of residuals, ![{\\\\displaystyle \\\\mathbf {y} -X{\\\\hat {\\\\boldsymbol {\\\\beta }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d28701ed83da946fc9829429e8a174f3623ef1e1) is orthogonal to the [column space](https:\/\/en.wikipedia.org\/wiki\/Column_space \"Column space\") of *X*, since the [dot product](https:\/\/en.wikipedia.org\/wiki\/Dot_product \"Dot product\") ![{\\\\displaystyle (\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}})\\\\cdot \\\\mathbf {X} \\\\mathbf {v} }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/b314a7f7223989068bab62ad471e21b653f95bcf) is equal to zero for *any* conformal vector, **v**. This means that ![{\\\\displaystyle \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\hat {\\\\beta }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/c27bab4fd405ceee972b12096dbcd827ed7af8cc) is the shortest of all possible vectors ![{\\\\displaystyle \\\\mathbf {y} -\\\\mathbf {X} {\\\\boldsymbol {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5189feb32b7fee0a242670c7182342cfb014ae29), that is, the variance of the residuals is the minimum possible. This is illustrated at the right.\n\nIntroducing ![{\\\\displaystyle {\\\\hat {\\\\boldsymbol {\\\\gamma }}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2d450058b0adc1e50ab7cf56639e7234b8098f1a) and a matrix *K* with the assumption that a matrix ![{\\\\displaystyle \\[\\\\mathbf {X} \\\\ \\\\mathbf {K} \\]}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2ac770308f79814997ffbdfd971621c67b76aef6) is non-singular and *K*T *X* = 0 (cf. [Orthogonal projections](https:\/\/en.wikipedia.org\/wiki\/Linear_projection#Orthogonal_projections \"Linear projection\")), the residual vector should satisfy the following equation:\n\n![{\\\\displaystyle {\\\\hat {\\\\mathbf {r} }}:=\\\\mathbf {y} -\\\\mathbf {X} {\\\\hat {\\\\boldsymbol {\\\\beta }}}=\\\\mathbf {K} {\\\\hat {\\\\boldsymbol {\\\\gamma }}}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/52b3dd4358358a8e91ca5f3ebc7aab78c0d6fdbd)\n\nThe equation and solution of linear least squares are thus described as follows:\n\n![{\\\\displaystyle {\\\\begin{aligned}\\\\mathbf {y} &={\\\\begin{bmatrix}\\\\mathbf {X} &\\\\mathbf {K} \\\\end{bmatrix}}{\\\\begin{bmatrix}{\\\\hat {\\\\boldsymbol {\\\\beta }}}\\\\\\\\{\\\\hat {\\\\boldsymbol {\\\\gamma }}}\\\\end{bmatrix}},\\\\\\\\{}\\\\Rightarrow {\\\\begin{bmatrix}{\\\\hat {\\\\boldsymbol {\\\\beta }}}\\\\\\\\{\\\\hat {\\\\boldsymbol {\\\\gamma }}}\\\\end{bmatrix}}&={\\\\begin{bmatrix}\\\\mathbf {X} &\\\\mathbf {K} \\\\end{bmatrix}}^{-1}\\\\mathbf {y} ={\\\\begin{bmatrix}\\\\left(\\\\mathbf {X} ^{\\\\top }\\\\mathbf {X} \\\\right)^{-1}\\\\mathbf {X} ^{\\\\top }\\\\\\\\\\\\left(\\\\mathbf {K} ^{\\\\top }\\\\mathbf {K} \\\\right)^{-1}\\\\mathbf {K} ^{\\\\top }\\\\end{bmatrix}}\\\\mathbf {y} .\\\\end{aligned}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/68f300fc2e982864b80ee1416ce010f0078e0e05)\n\nAnother way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[\\[15\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-15) Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.\n\nThe OLS estimator is identical to the [maximum likelihood estimator](https:\/\/en.wikipedia.org\/wiki\/Maximum_likelihood_estimator \"Maximum likelihood estimator\") (MLE) under the normality assumption for the error terms.[\\[16\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-16)[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Maximum_likelihood_approach \"Proofs involving ordinary least squares\") This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by [Yule](https:\/\/en.wikipedia.org\/wiki\/Udny_Yule \"Udny Yule\") and [Pearson](https:\/\/en.wikipedia.org\/wiki\/Karl_Pearson \"Karl Pearson\").\\[*[citation needed](https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Citation_needed \"Wikipedia:Citation needed\")*\\] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [Cramér–Rao bound](https:\/\/en.wikipedia.org\/wiki\/Cram%C3%A9r%E2%80%93Rao_bound \"Cramér–Rao bound\") for variance) if the normality assumption is satisfied.[\\[17\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17)\n\n### Generalized method of moments\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=10 \"Edit section: Generalized method of moments\")\\]\n\nIn [iid](https:\/\/en.wikipedia.org\/wiki\/Iid \"Iid\") case the OLS estimator can also be viewed as a [GMM](https:\/\/en.wikipedia.org\/wiki\/Generalized_method_of_moments \"Generalized method of moments\") estimator arising from the moment conditions\n\n![{\\\\displaystyle \\\\mathrm {E} {\\\\big \\[}\\\\,x\\_{i}\\\\left(y\\_{i}-x\\_{i}^{\\\\operatorname {T} }\\\\beta \\\\right)\\\\,{\\\\big \\]}=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/1d7894c141dad7e6dae3aed8bb708aada174daf2)\n\nThese moment conditions state that the regressors should be uncorrelated with the errors. Since *xi* is a *p*\\-vector, the number of moment conditions is equal to the dimension of the parameter vector *β*, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.\n\nNote that the original strict exogeneity assumption E\\[*εi* \\| *xi*\\] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E\\[*ƒ*(*xi*)·*εi*\\] = 0 will hold. However it can be shown using the [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\") that the optimal choice of function ƒ is to take *ƒ*(*x*) = *x*, which results in the moment equation posted above.\n\nThere are several different frameworks in which the [linear regression model](https:\/\/en.wikipedia.org\/wiki\/Linear_regression_model \"Linear regression model\") can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.\n\nOne of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (**random design**) the regressors *xi* are random and sampled together with the *yi*'s from some [population](https:\/\/en.wikipedia.org\/wiki\/Statistical_population \"Statistical population\"), as in an [observational study](https:\/\/en.wikipedia.org\/wiki\/Observational_study \"Observational study\"). This approach allows for more natural study of the [asymptotic properties](https:\/\/en.wikipedia.org\/wiki\/Asymptotic_theory_\$statistics\$ \"Asymptotic theory (statistics)\") of the estimators. In the other interpretation (**fixed design**), the regressors *X* are treated as known constants set by a [design](https:\/\/en.wikipedia.org\/wiki\/Design_of_experiments \"Design of experiments\"), and *y* is sampled conditionally on the values of *X* as in an [experiment](https:\/\/en.wikipedia.org\/wiki\/Experiment \"Experiment\"). For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on *X*. All results stated in this article are within the random design framework.\n\nThe classical model focuses on the \"finite sample\" estimation and inference, meaning that the number of observations *n* is fixed. This contrasts with the other approaches, which study the [asymptotic behavior](https:\/\/en.wikipedia.org\/wiki\/Asymptotic_theory_\$statistics\$ \"Asymptotic theory (statistics)\") of OLS, and in which the behavior at a large number of samples is studied. To prove finite sample unbiasedness of the OLS estimator, we require the following assumptions.\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/8\/8b\/Polyreg_scheffe.svg\/500px-Polyreg_scheffe.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:Polyreg_scheffe.svg)\n\nExample of a cubic polynomial regression, which is a type of linear regression. Although *polynomial regression* fits a curve model to the data, as a [statistical estimation](https:\/\/en.wikipedia.org\/wiki\/Estimation_theory \"Estimation theory\") problem it is linear, in the sense that the conditional expectation function ![{\\\\displaystyle \\\\mathbb {E} \\[y\\|x\\]}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/391e556522e79e05efc7dd6a5248cc0e5b0c4651) is linear in the unknown [parameters](https:\/\/en.wikipedia.org\/wiki\/Parameter \"Parameter\") that are estimated from the [data](https:\/\/en.wikipedia.org\/wiki\/Data \"Data\"). For this reason, polynomial regression is considered to be a special case of [multiple linear regression](https:\/\/en.wikipedia.org\/wiki\/Multiple_linear_regression \"Multiple linear regression\").\n\n- **Exogeneity**. The regressors do not [covary](https:\/\/en.wikipedia.org\/wiki\/Covariance \"Covariance\") with the error term: ![{\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}x\\_{i}\\]=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/865dc6e3427f19d12afd3bed41c45bb7661ef289) This requires, for example, that there are no [omitted variables](https:\/\/en.wikipedia.org\/wiki\/Omitted_variable_bias \"Omitted variable bias\") that covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression in [mathematical statistics](https:\/\/en.wikipedia.org\/wiki\/Mathematical_statistics \"Mathematical statistics\") is that the predictor variables *x* can be treated as fixed values, rather than [random variables](https:\/\/en.wikipedia.org\/wiki\/Random_variable \"Random variable\"). This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complex [errors-in-variables models](https:\/\/en.wikipedia.org\/wiki\/Errors-in-variables_models \"Errors-in-variables models\"), [instrumental variable models](https:\/\/en.wikipedia.org\/wiki\/Instrumental_variable \"Instrumental variable\") and the like.\n- **Linearity**, or **correct specification**. This means that the mean of the response variable is a [linear combination](https:\/\/en.wikipedia.org\/wiki\/Linear_combination \"Linear combination\") of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in [polynomial regression](https:\/\/en.wikipedia.org\/wiki\/Polynomial_regression \"Polynomial regression\"), which uses linear regression to fit the response variable as an arbitrary [polynomial](https:\/\/en.wikipedia.org\/wiki\/Polynomial \"Polynomial\") function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have \"too much power\", in that they tend to [overfit](https:\/\/en.wikipedia.org\/wiki\/Overfit \"Overfit\") the data. As a result, some kind of [regularization](https:\/\/en.wikipedia.org\/wiki\/Regularization_\$mathematics\$ \"Regularization (mathematics)\") must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are [ridge regression](https:\/\/en.wikipedia.org\/wiki\/Ridge_regression \"Ridge regression\") and [lasso regression](https:\/\/en.wikipedia.org\/wiki\/Lasso_regression \"Lasso regression\"). [Bayesian linear regression](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, [ridge regression](https:\/\/en.wikipedia.org\/wiki\/Ridge_regression \"Ridge regression\") and [lasso regression](https:\/\/en.wikipedia.org\/wiki\/Lasso_regression \"Lasso regression\") can both be viewed as special cases of Bayesian linear regression, with particular types of [prior distributions](https:\/\/en.wikipedia.org\/wiki\/Prior_distribution \"Prior distribution\") placed on the regression coefficients.)\n- [![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/6\/6a\/Heteroscedasticity_in_Linear_Regression.png\/330px-Heteroscedasticity_in_Linear_Regression.png)](https:\/\/en.wikipedia.org\/wiki\/File:Heteroscedasticity_in_Linear_Regression.png)\n  \n  Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab\n  **Constant variance** or **[homoscedasticity](https:\/\/en.wikipedia.org\/wiki\/Homoscedasticity \"Homoscedasticity\")**. This means that the variance of the errors does not depend on the values of the predictor variables: ![{\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}^{2}\\|x\\_{i}\\]=\\\\sigma ^{2}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/93072a3d09cd2b89e6a544dfd4c7cbb017200cbd) Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be \\$100,000 may easily have an actual income of \\$80,000 or \\$120,000—i.e., a [standard deviation](https:\/\/en.wikipedia.org\/wiki\/Standard_deviation \"Standard deviation\") of around \\$20,000—while another person with a predicted income of \\$10,000 is unlikely to have the same \\$20,000 standard deviation, since that would imply their actual income could vary anywhere between −\\$10,000 and \\$30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called [heteroscedasticity](https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity \"Heteroscedasticity\"). In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a \"fanning effect\" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see [Heteroscedasticity](https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity \"Heteroscedasticity\"). The presence of heteroscedasticity will result in an overall \"average\" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of [ordinary least squares](), not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The [mean squared error](https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error \"Mean squared error\") for the model will also be wrong. Various estimation techniques including [weighted least squares](https:\/\/en.wikipedia.org\/wiki\/Weighted_least_squares \"Weighted least squares\") and the use of [heteroscedasticity-consistent standard errors](https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity-consistent_standard_errors \"Heteroscedasticity-consistent standard errors\") can handle heteroscedasticity in a quite general way. [Bayesian linear regression](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the [logarithm](https:\/\/en.wikipedia.org\/wiki\/Logarithm \"Logarithm\") of the response variable using a linear regression model, which implies that the response variable itself has a [log-normal distribution](https:\/\/en.wikipedia.org\/wiki\/Log-normal_distribution \"Log-normal distribution\") rather than a [normal distribution](https:\/\/en.wikipedia.org\/wiki\/Normal_distribution \"Normal distribution\")).\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/0\/04\/Independence_of_Errors_Assumption_for_Linear_Regressions.png\/500px-Independence_of_Errors_Assumption_for_Linear_Regressions.png)](https:\/\/en.wikipedia.org\/wiki\/File:Independence_of_Errors_Assumption_for_Linear_Regressions.png)\n\nTo check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as [autocorrelation](https:\/\/en.wikipedia.org\/wiki\/Autocorrelation \"Autocorrelation\") in the errors or their correlation with one or more covariates.\n\n- **Uncorrelatedness of errors**. This assumes that the errors of the response variables are uncorrelated with each other: ![{\\\\displaystyle \\\\mathbb {E} \\[\\\\varepsilon \\_{i}\\\\varepsilon \\_{j}\\|x\\_{i},x\\_{j}\\]=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d2049f9a6b8ce702724b8271a5912158bf3f0fc3) Some methods such as [generalized least squares](https:\/\/en.wikipedia.org\/wiki\/Generalized_least_squares \"Generalized least squares\") are capable of handling correlated errors, although they typically require significantly more data unless some sort of [regularization](https:\/\/en.wikipedia.org\/wiki\/Regularization_\$mathematics\$ \"Regularization (mathematics)\") is used to bias the model towards assuming uncorrelated errors. [Bayesian linear regression](https:\/\/en.wikipedia.org\/wiki\/Bayesian_linear_regression \"Bayesian linear regression\") is a general way of handling this issue. Full [statistical independence](https:\/\/en.wikipedia.org\/wiki\/Statistical_independence \"Statistical independence\") is a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence.\n- **Lack of perfect multicollinearity** in the predictors. For standard [least squares](https:\/\/en.wikipedia.org\/wiki\/Least_squares \"Least squares\") estimation methods, the design matrix *X* must have full [column rank](https:\/\/en.wikipedia.org\/wiki\/Column_rank \"Column rank\") *p*: [\\[18\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_10-18)![{\\\\displaystyle \\\\Pr \\\\!{\\\\big \\[}\\\\,\\\\operatorname {rank} (X)=p\\\\,{\\\\big \\]}=1.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6a11be3b89ce51c6441155fddbe512a991132fbf) If this assumption is violated, perfect [multicollinearity](https:\/\/en.wikipedia.org\/wiki\/Multicollinearity \"Multicollinearity\") exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see [Variance inflation factor](https:\/\/en.wikipedia.org\/wiki\/Variance_inflation_factor \"Variance inflation factor\")). In the case of perfect multicollinearity, the parameter vector ***β*** will be [non-identifiable](https:\/\/en.wikipedia.org\/wiki\/Non-identifiable \"Non-identifiable\")—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space **R***p*). See [partial least squares regression](https:\/\/en.wikipedia.org\/wiki\/Partial_least_squares_regression \"Partial least squares regression\"). Methods for fitting linear models with multicollinearity have been developed,[\\[19\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Tibshirani-1996-19)[\\[20\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Efron-2004-20)[\\[21\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hawkins-1973-21)[\\[22\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Jolliffe-1982-22) some of which require additional assumptions such as \"effect sparsity\"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in [generalized linear models](https:\/\/en.wikipedia.org\/wiki\/Generalized_linear_model \"Generalized linear model\"), do not suffer from this problem.\n\nViolations of these assumptions can result in biased estimations of ***β***, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:\n\n- The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.\n- The arrangement, or [probability distribution](https:\/\/en.wikipedia.org\/wiki\/Probability_distribution \"Probability distribution\") of the predictor variables **x** has a major influence on the precision of estimates of ***β***. [Sampling](https:\/\/en.wikipedia.org\/wiki\/Sampling_\$statistics\$ \"Sampling (statistics)\") and [design of experiments](https:\/\/en.wikipedia.org\/wiki\/Design_of_experiments \"Design of experiments\") are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of ***β***.\n\n### Finite sample properties\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=13 \"Edit section: Finite sample properties\")\\]\n\nFirst of all, under the *strict exogeneity* assumption the OLS estimators ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [unbiased](https:\/\/en.wikipedia.org\/wiki\/Bias_of_an_estimator \"Bias of an estimator\"), meaning that their expected values coincide with the true values of the parameters:[\\[23\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-23)[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Unbiasedness_of_.CE.B2.CC.82 \"Proofs involving ordinary least squares\")\n\n![{\\\\displaystyle \\\\operatorname {E} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]=\\\\beta ,\\\\quad \\\\operatorname {E} \\[\\\\,s^{2}\\\\mid X\\\\,\\]=\\\\sigma ^{2}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/67bc2fd0f90c46da207712893fdcea01e729026c)\n\nIf the strict exogeneity does not hold (as is the case with many [time series](https:\/\/en.wikipedia.org\/wiki\/Time_series \"Time series\") models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.\n\nThe *[variance-covariance matrix](https:\/\/en.wikipedia.org\/wiki\/Variance-covariance_matrix \"Variance-covariance matrix\")* (or simply *covariance matrix*) of ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is equal to[\\[24\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-HayashiFSP-24)\n\n![{\\\\displaystyle \\\\operatorname {Var} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]=\\\\sigma ^{2}\\\\left(X^{\\\\operatorname {T} }X\\\\right)^{-1}=\\\\sigma ^{2}Q.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/08f6cb596d94073731ee47f4a2571dbbfc1d214a)\n\nIn particular, the standard error of each coefficient ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}\\_{j}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d376656c63f1577f2d1fcd2d680ccc48884ffda4) is equal to square root of the *j*\\-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity *σ*2 with its estimate *s*2. Thus,\n\n![{\\\\displaystyle {\\\\widehat {\\\\operatorname {s.\\\\!e.} }}({\\\\hat {\\\\beta }}\\_{j})={\\\\sqrt {s^{2}\\\\left(X^{\\\\operatorname {T} }X\\\\right)\\_{jj}^{-1}}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/203c72ed1175be84e6fdd19320f0e0e21acf66ec)\n\nIt can also be easily shown that the estimator ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is uncorrelated with the residuals from the model:[\\[24\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-HayashiFSP-24)\n\n![{\\\\displaystyle \\\\operatorname {Cov} \\[\\\\,{\\\\hat {\\\\beta }},{\\\\hat {\\\\varepsilon }}\\\\mid X\\\\,\\]=0.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/664c1a5e37957a1aa2ae381b9bcb07350c2c816c)\n\nThe *[Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\")* states that under the *spherical errors* assumption (that is, the errors should be [uncorrelated](https:\/\/en.wikipedia.org\/wiki\/Uncorrelated \"Uncorrelated\") and [homoscedastic](https:\/\/en.wikipedia.org\/wiki\/Homoscedastic \"Homoscedastic\")) the estimator ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is efficient in the class of linear unbiased estimators. This is called the *best linear unbiased estimator* (BLUE). Efficiency should be understood as if we were to find some other estimator ![{\\\\displaystyle \\\\scriptstyle {\\\\tilde {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3f0eb5a65676eeeea992903f3f93fdfd097a4d8d) which would be linear in *y* and unbiased, then [\\[24\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-HayashiFSP-24)\n\n![{\\\\displaystyle \\\\operatorname {Var} \\[\\\\,{\\\\tilde {\\\\beta }}\\\\mid X\\\\,\\]-\\\\operatorname {Var} \\[\\\\,{\\\\hat {\\\\beta }}\\\\mid X\\\\,\\]\\\\geq 0}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/53796c9205889cc4d675b9749a58eb97fcd998f1)\n\nin the sense that this is a [nonnegative-definite matrix](https:\/\/en.wikipedia.org\/wiki\/Nonnegative-definite_matrix \"Nonnegative-definite matrix\"). This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms *ε*, other, non-linear estimators may provide better results than OLS.\n\nThe properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the *normality assumption* holds (that is, that *ε* ~ *N*(0, *σ*2*In*)), then additional properties of the OLS estimators can be stated.\n\nThe estimator ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is normally distributed, with mean and variance as given before:[\\[25\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-25)\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}\\\\ \\\\sim \\\\ {\\\\mathcal {N}}{\\\\big (}\\\\beta ,\\\\ \\\\sigma ^{2}(X^{\\\\mathrm {T} }X)^{-1}{\\\\big )}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5931062c24b5e51ae732b5a07a8ceb45dbed1d9f)\n\nThis estimator reaches the [Cramér–Rao bound](https:\/\/en.wikipedia.org\/wiki\/Cram%C3%A9r%E2%80%93Rao_bound \"Cramér–Rao bound\") for the model, and thus is optimal in the class of all unbiased estimators.[\\[17\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17) Note that unlike the [Gauss–Markov theorem](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem \"Gauss–Markov theorem\"), this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.\n\nThe estimator *s*2 will be proportional to the [chi-squared distribution](https:\/\/en.wikipedia.org\/wiki\/Chi-squared_distribution \"Chi-squared distribution\"):[\\[26\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-26)\n\n![{\\\\displaystyle s^{2}\\\\ \\\\sim \\\\ {\\\\frac {\\\\sigma ^{2}}{n-p}}\\\\cdot \\\\chi \\_{n-p}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e006b8d7551b05b350f5d56fe88fe51062088ca9)\n\nThe variance of this estimator is equal to 2*σ*4\/(*n* − *p*), which does not attain the [Cramér–Rao bound](https:\/\/en.wikipedia.org\/wiki\/Cram%C3%A9r%E2%80%93Rao_bound \"Cramér–Rao bound\") of 2*σ*4\/*n*. However it was shown that there are no unbiased estimators of *σ*2 with variance smaller than that of the estimator *s*2.[\\[27\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-27) If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [mean squared error](https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error \"Mean squared error\")) estimator in this class will be ~*σ*2 = SSR *\/* (*n* − *p* + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (*p* = 1).[\\[28\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-28)\n\nMoreover, the estimators ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [independent](https:\/\/en.wikipedia.org\/wiki\/Independent_random_variables \"Independent random variables\"),[\\[29\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-29) the fact which comes in useful when constructing the t- and F-tests for the regression.\n\n#### Influential observations\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=15 \"Edit section: Influential observations\")\\]\n\nAs was mentioned before, the estimator ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) is linear in *y*, meaning that it represents a linear combination of the dependent variables *yi*. The weights in this linear combination are functions of the regressors *X*, and generally are unequal. The observations with high weights are called **influential** because they have a more pronounced effect on the value of the estimator.\n\nTo analyze which observations are influential we remove a specific *j*\\-th observation and consider how much the estimated quantities are going to change (similarly to the [jackknife method](https:\/\/en.wikipedia.org\/wiki\/Jackknife_method \"Jackknife method\")). It can be shown that the change in the OLS estimator for *β* will be equal to [\\[30\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-DvdMck33-30)\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}^{(j)}-{\\\\hat {\\\\beta }}=-{\\\\frac {1}{1-h\\_{j}}}(X^{\\\\mathrm {T} }X)^{-1}x\\_{j}^{\\\\mathrm {T} }{\\\\hat {\\\\varepsilon }}\\_{j}\\\\,,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/a71f1b4756b0af3027b8f9b9d5f2a75699433107)\n\nwhere *hj* = *xj*T (*X*T*X*)−1*xj* is the *j*\\-th diagonal element of the hat matrix *P*, and *xj* is the vector of regressors corresponding to the *j*\\-th observation. Similarly, the change in the predicted value for *j*\\-th observation resulting from omitting that observation from the dataset will be equal to [\\[30\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-DvdMck33-30)\n\n![{\\\\displaystyle {\\\\hat {y}}\\_{j}^{(j)}-{\\\\hat {y}}\\_{j}=x\\_{j}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}^{(j)}-x\\_{j}^{\\\\operatorname {T} }{\\\\hat {\\\\beta }}=-{\\\\frac {h\\_{j}}{1-h\\_{j}}}\\\\,{\\\\hat {\\\\varepsilon }}\\_{j}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/902c8e32ba7ad698e4fbd9891ed85705ecaebb6c)\n\nFrom the properties of the hat matrix, 0 ≤ *hj* ≤ 1, and they sum up to *p*, so that on average *hj* ≈ *p\/n*. These quantities *hj* are called the **leverages**, and observations with high *hj* are called **leverage points**.[\\[31\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-31) Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.\n\n#### Partitioned regression\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=16 \"Edit section: Partitioned regression\")\\]\n\nSometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form\n\n![{\\\\displaystyle y=X\\_{1}\\\\beta \\_{1}+X\\_{2}\\\\beta \\_{2}+\\\\varepsilon ,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/8c8fb7149efa2da253167c26839f1207ccbb4f70)\n\nwhere *X*1 and *X*2 have dimensions *n*×*p*1, *n*×*p*2, and *β*1, *β*2 are *p*1×1 and *p*2×1 vectors, with *p*1 + *p*2 = *p*.\n\nThe **[Frisch–Waugh–Lovell theorem](https:\/\/en.wikipedia.org\/wiki\/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem \"Frisch–Waugh–Lovell theorem\")** states that in this regression the residuals ![{\\\\displaystyle {\\\\hat {\\\\varepsilon }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/f59e08078de1fc0acc5c2a08448127049373875d) and the OLS estimate ![{\\\\displaystyle \\\\scriptstyle {\\\\hat {\\\\beta }}\\_{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/925f884381ce10f61d6bff6d36527621643e62b0) will be numerically identical to the residuals and the OLS estimate for *β*2 in the following regression:[\\[32\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-32)\n\n![{\\\\displaystyle M\\_{1}y=M\\_{1}X\\_{2}\\\\beta \\_{2}+\\\\eta \\\\,,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/03086726c6b5ea95bff0c85bfdd81789e0c229ad)\n\nwhere *M*1 is the [annihilator matrix](https:\/\/en.wikipedia.org\/wiki\/Annihilator_matrix \"Annihilator matrix\") for regressors *X*1.\n\nThe theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.\n\n### Large sample properties\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=17 \"Edit section: Large sample properties\")\\]\n\nThe least squares estimators are [point estimates](https:\/\/en.wikipedia.org\/wiki\/Point_estimate \"Point estimate\") of the linear regression model parameters *β*. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [interval estimates](https:\/\/en.wikipedia.org\/wiki\/Interval_estimate \"Interval estimate\").\n\nSince we have not made any assumption about the distribution of error term *εi*, it is impossible to infer the distribution of the estimators ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) and ![{\\\\displaystyle {\\\\hat {\\\\sigma }}^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/1ad9d89160c9e63c0aa4c158282cb75a894de56f). Nevertheless, we can apply the [central limit theorem](https:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem \"Central limit theorem\") to derive their *asymptotic* properties as sample size *n* goes to infinity. While the sample size is necessarily finite, it is customary to assume that *n* is \"large enough\" so that the true distribution of the OLS estimator is close to its asymptotic limit.\n\nWe can show that under the model assumptions, the least squares estimator for *β* is [consistent](https:\/\/en.wikipedia.org\/wiki\/Consistent_estimator \"Consistent estimator\") (that is ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) [converges in probability](https:\/\/en.wikipedia.org\/wiki\/Convergence_of_random_variables#Convergence_in_probability \"Convergence of random variables\") to *β*) and asymptotically normal:[\\[proof\\]](https:\/\/en.wikipedia.org\/wiki\/Proofs_involving_ordinary_least_squares#Consistency_and_asymptotic_normality_of_.CE.B2.CC.82 \"Proofs involving ordinary least squares\")\n\n![{\\\\displaystyle ({\\\\hat {\\\\beta }}-\\\\beta )\\\\ \\\\xrightarrow {d} \\\\ {\\\\mathcal {N}}{\\\\big (}0,\\\\;\\\\sigma ^{2}Q\\_{xx}^{-1}{\\\\big )},}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/4e1f0415269cea909e27a5e628594cd011db546a)\n\nwhere ![{\\\\displaystyle Q\\_{xx}=X^{\\\\operatorname {T} }X.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3d8267e0743c4f55e4517e77e5f35807f2229e6d)\n\nUsing this asymptotic distribution, approximate two-sided confidence intervals for the *j*\\-th component of the vector ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799) can be constructed as\n\n![{\\\\displaystyle \\\\beta \\_{j}\\\\in {\\\\bigg \\[}\\\\ {\\\\hat {\\\\beta }}\\_{j}\\\\pm q\\_{1-{\\\\frac {\\\\alpha }{2}}}^{{\\\\mathcal {N}}(0,1)}\\\\!{\\\\sqrt {{\\\\hat {\\\\sigma }}^{2}\\\\left\\[Q\\_{xx}^{-1}\\\\right\\]\\_{jj}}}\\\\ {\\\\bigg \\]}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cf79688aac9f662ff39253fbfb0d234246d370e5) at the 1 − *α* confidence level,\n\nwhere *q* denotes the [quantile function](https:\/\/en.wikipedia.org\/wiki\/Quantile_function \"Quantile function\") of standard normal distribution, and \\[·\\]*jj* is the *j*\\-th diagonal element of a matrix.\n\nSimilarly, the least squares estimator for *σ*2 is also consistent and asymptotically normal (provided that the fourth moment of *εi* exists) with limiting distribution\n\n![{\\\\displaystyle ({\\\\hat {\\\\sigma }}^{2}-\\\\sigma ^{2})\\\\ \\\\xrightarrow {d} \\\\ {\\\\mathcal {N}}\\\\left(0,\\\\;\\\\operatorname {E} \\\\left\\[\\\\varepsilon \\_{i}^{4}\\\\right\\]-\\\\sigma ^{4}\\\\right).}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7c909dea2a4f0bf40e253680b953d1bfbb66298f)\n\nThese asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose ![{\\\\displaystyle x\\_{0}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/86f21d0e31751534cd6584264ecf864a6aa792cf) is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [mean response](https:\/\/en.wikipedia.org\/wiki\/Mean_response \"Mean response\") is the quantity ![{\\\\displaystyle y\\_{0}=x\\_{0}^{\\\\mathrm {T} }\\\\beta }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6eda29d7b45f0754da5a0bd365ba6df87c81306c), whereas the [predicted response](https:\/\/en.wikipedia.org\/wiki\/Predicted_response \"Predicted response\") is ![{\\\\displaystyle {\\\\hat {y}}\\_{0}=x\\_{0}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/69652d0e8a9bacc094b1dc11803721f7bcf3e22d). Clearly the predicted response is a random variable, its distribution can be derived from that of ![{\\\\displaystyle {\\\\hat {\\\\beta }}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/efdb50e00928e4013750a476dab75eeb3cbd5799):\n\n![{\\\\displaystyle \\\\left({\\\\hat {y}}\\_{0}-y\\_{0}\\\\right)\\\\ \\\\xrightarrow {d} \\\\ {\\\\mathcal {N}}\\\\left(0,\\\\;\\\\sigma ^{2}x\\_{0}^{\\\\mathrm {T} }Q\\_{xx}^{-1}x\\_{0}\\\\right),}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/ea91e2b81cd0251f9bc1fce42e7ebfc78ceca045)\n\nwhich allows construct confidence intervals for mean response ![{\\\\displaystyle y\\_{0}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6d943dbbb0b56ca750c4d62c5b54b4ae29a773da) to be constructed:\n\n![{\\\\displaystyle y\\_{0}\\\\in \\\\left\\[\\\\ x\\_{0}^{\\\\mathrm {T} }{\\\\hat {\\\\beta }}\\\\pm q\\_{1-{\\\\frac {\\\\alpha }{2}}}^{{\\\\mathcal {N}}(0,1)}\\\\!{\\\\sqrt {{\\\\hat {\\\\sigma }}^{2}x\\_{0}^{\\\\mathrm {T} }Q\\_{xx}^{-1}x\\_{0}}}\\\\ \\\\right\\]}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cf86d7a311c97d35fb6e039c3cd74bc9f3e752bf) at the 1 − *α* confidence level.\n\nTwo hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The [null hypothesis](https:\/\/en.wikipedia.org\/wiki\/Null_hypothesis \"Null hypothesis\") of no explanatory value of the estimated regression is tested using an [F-test](https:\/\/en.wikipedia.org\/wiki\/F-test \"F-test\"). If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the [alternative hypothesis](https:\/\/en.wikipedia.org\/wiki\/Alternative_hypothesis \"Alternative hypothesis\"), that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.\n\nSecond, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's [t-statistic](https:\/\/en.wikipedia.org\/wiki\/T-statistic \"T-statistic\"), as the ratio of the coefficient estimate to its [standard error](https:\/\/en.wikipedia.org\/wiki\/Standard_error \"Standard error\"). If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.\n\nIn addition, the [Chow test](https:\/\/en.wikipedia.org\/wiki\/Chow_test \"Chow test\") is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.\n\n### Violations of assumptions\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=20 \"Edit section: Violations of assumptions\")\\]\n\nIn a [time series](https:\/\/en.wikipedia.org\/wiki\/Time_series \"Time series\") model, we require the [stochastic process](https:\/\/en.wikipedia.org\/wiki\/Stochastic_process \"Stochastic process\") {*xi*, *yi*} to be [stationary](https:\/\/en.wikipedia.org\/wiki\/Stationary_process \"Stationary process\") and [ergodic](https:\/\/en.wikipedia.org\/wiki\/Ergodic_process \"Ergodic process\"); if {*xi*, *yi*} is nonstationary, OLS results are often biased unless {*xi*, *yi*} is [co-integrating](https:\/\/en.wikipedia.org\/wiki\/Cointegration \"Cointegration\").[\\[33\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-33)\n\nWe still require the regressors to be *strictly exogenous*: E\\[*xiεi*\\] = 0 for all *i* = 1, ..., *n*. If they are only [predetermined](https:\/\/en.wikipedia.org\/wiki\/Weak_exogeneity \"Weak exogeneity\"), OLS is biased in finite sample;\n\nFinally, the assumptions on the variance take the form of requiring that {*xiεi*} is a [martingale difference sequence](https:\/\/en.wikipedia.org\/wiki\/Martingale_difference_sequence \"Martingale difference sequence\"), with a finite matrix of second moments *Q**xxε*² = E\\[ *εi*2*xi xi*T \\].\n\n#### Constrained estimation\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=22 \"Edit section: Constrained estimation\")\\]\n\nSuppose it is known that the coefficients in the regression satisfy a system of linear equations\n\n![{\\\\displaystyle A\\\\colon \\\\quad Q^{\\\\operatorname {T} }\\\\beta =c,\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/057e184c58e6c378d20d00aac2f8d5f8003f77ae)\n\nwhere *Q* is a *p*×*q* matrix of full rank, and *c* is a *q*×1 vector of known constants, where *q \\< p*. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint *A*. The **constrained least squares (CLS)** estimator can be given by an explicit formula:[\\[34\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-34)\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}^{c}={\\\\hat {\\\\beta }}-(X^{\\\\operatorname {T} }X)^{-1}Q{\\\\Big (}Q^{\\\\operatorname {T} }(X^{\\\\operatorname {T} }X)^{-1}Q{\\\\Big )}^{-1}(Q^{\\\\operatorname {T} }{\\\\hat {\\\\beta }}-c).}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/d7cebc5bf8357f7e792c566f18bcae6c7582b9ae)\n\nThis expression for the constrained estimator is valid as long as the matrix *XTX* is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, *β* will not be identifiable. However it may happen that adding the restriction *A* makes *β* identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [\\[35\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Amemiya22-35)\n\n![{\\\\displaystyle {\\\\hat {\\\\beta }}^{c}=R(R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }XR)^{-1}R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }y+{\\\\Big (}I\\_{p}-R(R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }XR)^{-1}R^{\\\\operatorname {T} }X^{\\\\operatorname {T} }X{\\\\Big )}Q(Q^{\\\\operatorname {T} }Q)^{-1}c,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/fd1b301b75120d4aae50fab438e36d600343652b)\n\nwhere *R* is a *p*×(*p* − *q*) matrix such that the matrix \\[*Q R*\\] is non-singular, and *RTQ* = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when *XTX* is invertible.[\\[35\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-Amemiya22-35)\n\n## Example with real data\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=23 \"Edit section: Example with real data\")\\]\n\nThe following data set gives average heights and weights for American women aged 30–39 (source: *The World Almanac and Book of Facts, 1975*).\n\n|   |   |   |   |   |   |   |\n|---|---|---|---|---|---|---|\n| Height (m) | 1\\.47 | 1\\.50 | 1\\.52 | 1\\.55 | 1\\.57 | [![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/c\/c1\/OLS_example_weight_vs_height_scatterplot.svg\/250px-OLS_example_weight_vs_height_scatterplot.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_example_weight_vs_height_scatterplot.svg) [Scatterplot](https:\/\/en.wikipedia.org\/wiki\/Scatterplot \"Scatterplot\") of the data, the relationship is slightly curved but close to linear |\n| Weight (kg) | 52\\.21 | 53\\.12 | 54\\.48 | 55\\.84 | 57\\.20 |   |\n| Height (m) | 1\\.60 | 1\\.63 | 1\\.65 | 1\\.68 | 1\\.70 |   |\n| Weight (kg) | 58\\.57 | 59\\.93 | 61\\.29 | 63\\.11 | 64\\.47 |   |\n| Height (m) | 1\\.73 | 1\\.75 | 1\\.78 | 1\\.80 | 1\\.83 |   |\n| Weight (kg) | 66\\.28 | 68\\.10 | 69\\.92 | 72\\.19 | 74\\.46 |   |\n\nWhen only one dependent variable is being modeled, a [scatterplot](https:\/\/en.wikipedia.org\/wiki\/Scatterplot \"Scatterplot\") will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model:\n\n![{\\\\displaystyle w\\_{i}=\\\\beta \\_{1}+\\\\beta \\_{2}h\\_{i}+\\\\beta \\_{3}h\\_{i}^{2}+\\\\varepsilon \\_{i}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/509e6766a1b3a1d7f431d9f9dae780d20f9b59d5)\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/a\/a4\/OLS_example_weight_vs_height_fitted_line.svg\/330px-OLS_example_weight_vs_height_fitted_line.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_example_weight_vs_height_fitted_line.svg)\n\nFitted regression\n\nThe output from most popular [statistical packages](https:\/\/en.wikipedia.org\/wiki\/List_of_statistical_packages \"List of statistical packages\") will look similar to this:\n\n|   |   |   |   |   |\n|---|---|---|---|---|\n| Method | Least squares |   |   |   |\n| Dependent variable | WEIGHT |   |   |   |\n| Observations | 15 |   |   |   |\n| Parameter | Value | [Std error](https:\/\/en.wikipedia.org\/wiki\/Standard_error \"Standard error\") | [t-statistic](https:\/\/en.wikipedia.org\/wiki\/T-statistic \"T-statistic\") | [p-value](https:\/\/en.wikipedia.org\/wiki\/P-value \"P-value\") |\n| ![{\\\\displaystyle \\\\beta \\_{1}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) |   |   |   |   |\n\nIn this table:\n\n- The *Value* column gives the least squares estimates of parameters *βj*\n- The *Std error* column shows [standard errors](https:\/\/en.wikipedia.org\/wiki\/Standard_error_\$statistics\$ \"Standard error (statistics)\") of each coefficient estimate: ![{\\\\displaystyle {\\\\hat {\\\\sigma }}\\_{j}=\\\\left({\\\\hat {\\\\sigma }}^{2}\\\\left\\[Q\\_{xx}^{-1}\\\\right\\]\\_{jj}\\\\right)^{\\\\frac {1}{2}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5087e66171bf3ef9ad3ac75decdd715274919669)\n- The *[t-statistic](https:\/\/en.wikipedia.org\/wiki\/T-statistic \"T-statistic\")* and *p-value* columns are testing whether any of the coefficients might be equal to zero. The *t*\\-statistic is calculated simply as ![{\\\\displaystyle t={\\\\hat {\\\\beta }}\\_{j}\/{\\\\hat {\\\\sigma }}\\_{j}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/706d1c514396be8e7301a23ab369cdcf5b1c5096). If the errors ε follow a normal distribution, *t* follows a Student-t distribution. Under weaker conditions, *t* is asymptotically normal. Large values of *t* indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, [*p*\\-value](https:\/\/en.wikipedia.org\/wiki\/P-value \"P-value\"), expresses the results of the hypothesis test as a [significance level](https:\/\/en.wikipedia.org\/wiki\/Statistical_significance \"Statistical significance\"). Conventionally, *p*\\-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.\n- *R-squared* is the [coefficient of determination](https:\/\/en.wikipedia.org\/wiki\/Coefficient_of_determination \"Coefficient of determination\") indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors *X* have no explanatory power whatsoever. This is a biased estimate of the population *R-squared*, and will never decrease if additional regressors are added, even if they are irrelevant.\n- *Adjusted R-squared* is a slightly modified version of ![{\\\\displaystyle R^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5ce07e278be3e058a6303de8359f8b4a4288264a), designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than ![{\\\\displaystyle R^{2}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/5ce07e278be3e058a6303de8359f8b4a4288264a), can decrease as new regressors are added, and even be negative for poorly fitting models:\n\n![{\\\\displaystyle {\\\\overline {R}}^{2}=1-{\\\\frac {n-1}{n-p}}(1-R^{2})}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7ec4559807623b855036fce5201f9e8b6c7aca4b)\n\n- *Log-likelihood* is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.\n- *[Durbin–Watson statistic](https:\/\/en.wikipedia.org\/wiki\/Durbin%E2%80%93Watson_statistic \"Durbin–Watson statistic\")* tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.\n- *[Akaike information criterion](https:\/\/en.wikipedia.org\/wiki\/Akaike_information_criterion \"Akaike information criterion\")* and *[Schwarz criterion](https:\/\/en.wikipedia.org\/wiki\/Schwarz_criterion \"Schwarz criterion\")* are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[\\[36\\]](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_note-36)\n- *Standard error of regression* is an estimate of *σ*, standard error of the error term.\n- *Total sum of squares*, *model sum of squared*, and *residual sum of squares* tell us how much of the initial variation in the sample were explained by the regression.\n- *F-statistic* tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has *F*(*p–1*,*n–p*) distribution under the null hypothesis and normality assumption, and its *p-value* indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as [Wald test](https:\/\/en.wikipedia.org\/wiki\/Wald_test \"Wald test\") or [LR test](https:\/\/en.wikipedia.org\/wiki\/Likelihood_ratio_test \"Likelihood ratio test\") should be used.\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/e\/e2\/OLS_example_weight_vs_height_residuals.svg\/330px-OLS_example_weight_vs_height_residuals.svg.png)](https:\/\/en.wikipedia.org\/wiki\/File:OLS_example_weight_vs_height_residuals.svg)\n\nResiduals plot\n\nOrdinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:\n\n- Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.\n- Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.\n- Residuals against the fitted values, ![{\\\\displaystyle {\\\\hat {y}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3dc8de3d8ea01304329ef9518fad7a6d196c4c01).\n- Residuals against the preceding residual. This plot may identify serial correlations in the residuals.\n\nAn important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.\n\n### Sensitivity to rounding\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=24 \"Edit section: Sensitivity to rounding\")\\]\n\nThis example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is *not* an exact conversion. The original inches can be recovered by Round(x\/0.0254) and then re-converted to metric without rounding. If this is done the results become:\n\n|   | Const | Height | Height2 |\n|---|---|---|---|\n| Converted to metric with rounding. | 128\\.8128 | −143.162 | 61\\.96033 |\n| Converted to metric without rounding. | 119\\.0205 | −131.5076 | 58\\.5046 |\n\n[![](https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/e\/e7\/HeightWeightResiduals.jpg\/500px-HeightWeightResiduals.jpg)](https:\/\/en.wikipedia.org\/wiki\/File:HeightWeightResiduals.jpg)\n\nResiduals to a quadratic fit for correctly and incorrectly converted data.\n\nUsing either of these equations to predict the weight of a 5' 6\" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.\n\nWhile this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range ([extrapolation](https:\/\/en.wikipedia.org\/wiki\/Extrapolation \"Extrapolation\")).\n\nThis highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the *x* and *y* errors.\n\n## Another example with less real data\n\\[[edit](https:\/\/en.wikipedia.org\/w\/index.php?title=Ordinary_least_squares&action=edit&section=25 \"Edit section: Another example with less real data\")\\]\n\nWe can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is ![{\\\\displaystyle r(\\\\theta )={\\\\frac {p}{1-e\\\\cos(\\\\theta )}}}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/864b5d331af517af4843d824394763d8d58bdb06) where ![{\\\\displaystyle r(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/53df06e661affb927fb93a95195225e3455cf572) is the radius of how far the object is from one of the bodies. In the equation the parameters ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) and ![{\\\\displaystyle e}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cd253103f0876afc68ebead27a5aa9867d927467) are used to determine the path of the orbit. We have measured the following data.\n\n| ![{\\\\displaystyle \\\\theta }](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/6e5ab2664b422d53eb0c7df3b87e1360d75ad9af) (in degrees) |\n|---|\n\nWe need to find the least-squares approximation of ![{\\\\displaystyle e}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/cd253103f0876afc68ebead27a5aa9867d927467) and ![{\\\\displaystyle p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/81eac1e205430d1f40810df36a0edffdc367af36) for the given data.\n\nFirst we need to represent e and p in a linear form. So we are going to rewrite the equation ![{\\\\displaystyle r(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/53df06e661affb927fb93a95195225e3455cf572) as ![{\\\\displaystyle {\\\\frac {1}{r(\\\\theta )}}={\\\\frac {1}{p}}-{\\\\frac {e}{p}}\\\\cos(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/ed0a5c7386fb20bcca2ffa143a944ad12986a06e).\n\nFurthermore, one could fit for [apsides](https:\/\/en.wikipedia.org\/wiki\/Apsides \"Apsides\") by expanding ![{\\\\displaystyle \\\\cos(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/aaac7b75cda6d5570780075aa2622d27b21117cd) with an extra parameter as ![{\\\\displaystyle \\\\cos(\\\\theta -\\\\theta \\_{0})=\\\\cos(\\\\theta )\\\\cos(\\\\theta \\_{0})+\\\\sin(\\\\theta )\\\\sin(\\\\theta \\_{0})}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/3a8a6dfeb8c3f569c5e5bdc53eae13cd7396ddb3), which is linear in both ![{\\\\displaystyle \\\\cos(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/aaac7b75cda6d5570780075aa2622d27b21117cd) and in the extra basis function ![{\\\\displaystyle \\\\sin(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/acafc444aea85d63a40dabf84f035a6b4955a948).\n\nWe use the original two-parameter form to represent our observational data as:\n\n![{\\\\displaystyle A^{T}A{\\\\binom {x}{y}}=A^{T}b,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/986a7b37a9e8f4d01f04b0c1edfab0891e5c3981)\n\nwhere:\n\n![{\\\\displaystyle x=1\/p\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/da0dc5103de42df59c7fdf7c0e1a25b2d230c2ab); ![{\\\\displaystyle y=e\/p\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/225807344e8ed6a108fb0f244389cd99b2176af6); ![{\\\\displaystyle A}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7daff47fa58cdfd29dc333def748ff5fa4c923e3) contains the coefficients of ![{\\\\displaystyle 1\/p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2cdfd6eb8d2c6f424b698d06aa99d31895c47e91) in the first column, which are all 1, and the coefficients of ![{\\\\displaystyle e\/p}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/e1413df0fc595b107560741c651139104ddc2b3a) in the second column, given by ![{\\\\displaystyle \\\\cos(\\\\theta )\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7579c63af86d580d73d9e47b823ef1f5df5d1e7c); and ![{\\\\displaystyle b=1\/r(\\\\theta )}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2e66f8886331852341e28a72b691800bd161f01d), such that:\n\n![{\\\\displaystyle A={\\\\begin{bmatrix}1&-0.731354\\\\\\\\1&-0.707107\\\\\\\\1&-0.615661\\\\\\\\1&\\\\ 0.052336\\\\\\\\1&0.309017\\\\\\\\1&0.438371\\\\end{bmatrix}},\\\\quad b={\\\\begin{bmatrix}0.21220\\\\\\\\0.21958\\\\\\\\0.24741\\\\\\\\0.45071\\\\\\\\0.52883\\\\\\\\0.56820\\\\end{bmatrix}}.}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/2fa33854f055be22aac53f6543b51595d2cb6ab5)\n\nOn solving we get ![{\\\\displaystyle {\\\\binom {x}{y}}={\\\\binom {0.43478}{0.30435}}\\\\,}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/9f2754076c0b4c74864feffeff3bb26a9288d57a),\n\nso ![{\\\\displaystyle p={\\\\frac {1}{x}}=2.3000}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/ab356b2fc5bd7f3043c4681d1ee8e078310a40a3) and ![{\\\\displaystyle e=p\\\\cdot y=0.70001}](https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/7c3dfe0bd497476da093d39a831bc135fce2e725)\n\n- [Bayesian least squares](https:\/\/en.wikipedia.org\/wiki\/Minimum_mean_square_error \"Minimum mean square error\")\n- [Fama–MacBeth regression](https:\/\/en.wikipedia.org\/wiki\/Fama%E2%80%93MacBeth_regression \"Fama–MacBeth regression\")\n- [Nonlinear least squares](https:\/\/en.wikipedia.org\/wiki\/Non-linear_least_squares \"Non-linear least squares\")\n- [Numerical methods for linear least squares](https:\/\/en.wikipedia.org\/wiki\/Numerical_methods_for_linear_least_squares \"Numerical methods for linear least squares\")\n- [Nonlinear system identification](https:\/\/en.wikipedia.org\/wiki\/Nonlinear_system_identification \"Nonlinear system identification\")\n\n1. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-1)**\n   [\"The Origins of Ordinary Least Squares Assumptions\"](https:\/\/mathvoices.ams.org\/featurecolumn\/2022\/03\/01\/ordinary-least-squares\/). *Feature Column*. 2022-03-01. Retrieved 2024-05-16.\n2. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-2)**\n   [\"What is a complete list of the usual assumptions for linear regression?\"](https:\/\/stats.stackexchange.com\/q\/16381). *Cross Validated*. Retrieved 2022-09-28.\n3. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-3)**\n   [Goldberger, Arthur S.](https:\/\/en.wikipedia.org\/wiki\/Arthur_Goldberger \"Arthur Goldberger\") (1964). [\"Classical Linear Regression\"](https:\/\/books.google.com\/books?id=KZq5AAAAIAAJ&pg=PA156). [*Econometric Theory*](https:\/\/archive.org\/details\/econometrictheor0000gold\/page\/158). New York: John Wiley & Sons. pp. [158](https:\/\/archive.org\/details\/econometrictheor0000gold\/page\/158). [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [0-471-31101-4](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-471-31101-4 \"Special:BookSources\/0-471-31101-4\")\n   \n   .\n4. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-4)**\n   [Hayashi, Fumio](https:\/\/en.wikipedia.org\/wiki\/Fumio_Hayashi \"Fumio Hayashi\") (2000). *Econometrics*. Princeton University Press. p. 15. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9780691010182](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9780691010182 \"Special:BookSources\/9780691010182\")\n   \n   .\n5. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-5)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 18).\n6. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-6)**\n   Ghilani, Charles D.; Wolf, Paul R. (12 June 2006). [*Adjustment Computations: Spatial Data Analysis*](https:\/\/books.google.com\/books?id=hZ4mAOXVowoC&pg=PA160). John Wiley & Sons. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9780471697282](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9780471697282 \"Special:BookSources\/9780471697282\")\n   \n   .\n7. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-7)**\n   Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). [*GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more*](https:\/\/books.google.com\/books?id=Np7y43HU_m8C&pg=PA263). Springer. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9783211730171](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9783211730171 \"Special:BookSources\/9783211730171\")\n   \n   .\n8. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-8)**\n   Xu, Guochang (5 October 2007). [*GPS: Theory, Algorithms and Applications*](https:\/\/books.google.com\/books?id=peYFZ69HqEsC&pg=PA134). Springer. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n   \n   [9783540727156](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9783540727156 \"Special:BookSources\/9783540727156\")\n   \n   .\n9. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-1) [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 19)\n10. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-q011_10-0)**\n    Hoaglin, David C.; Welsch, Roy E. (1978). [\"The Hat Matrix in Regression and ANOVA\"](https:\/\/doi.org\/10.1080%2F00031305.1978.10479237). *The American Statistician*. **32** (1): 17–22\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.1080\/00031305.1978.10479237](https:\/\/doi.org\/10.1080%2F00031305.1978.10479237). [hdl](https:\/\/en.wikipedia.org\/wiki\/Hdl_\$identifier\$ \"Hdl (identifier)\"):[1721\\.1\/1920](https:\/\/hdl.handle.net\/1721.1%2F1920). [ISSN](https:\/\/en.wikipedia.org\/wiki\/ISSN_\$identifier\$ \"ISSN (identifier)\") [0003-1305](https:\/\/search.worldcat.org\/issn\/0003-1305).\n11. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-11)** [Julian Faraway (2000), *Practical Regression and Anova using R*](https:\/\/cran.r-project.org\/doc\/contrib\/Faraway-PRA.pdf)\n12. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-12)**\n    Kenney, J.; Keeping, E. S. (1963). *Mathematics of Statistics*. van Nostrand. p. 187.\n13. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-13)**\n    Zwillinger, Daniel (1995). [*Standard Mathematical Tables and Formulae*](https:\/\/en.wikipedia.org\/wiki\/CRC_Standard_Mathematical_Tables \"CRC Standard Mathematical Tables\"). Chapman\\&Hall\/CRC. p. 626. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-8493-2479-3](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-8493-2479-3 \"Special:BookSources\/0-8493-2479-3\")\n    \n    .\n14. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-14)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 20)\n15. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-15)**\n    Akbarzadeh, Vahab (7 May 2014). [\"Line Estimation\"](https:\/\/mlmadesimple.wordpress.com\/2014\/05\/07\/line-estimation\/).\n16. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-16)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 49)\n17. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-1) [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 52)\n18. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_10_18-0)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 10)\n19. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Tibshirani-1996_19-0)**\n    Tibshirani, Robert (1996). \"Regression Shrinkage and Selection via the Lasso\". *Journal of the Royal Statistical Society, Series B*. **58** (1): 267–288\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.1111\/j.2517-6161.1996.tb02080.x](https:\/\/doi.org\/10.1111%2Fj.2517-6161.1996.tb02080.x). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [2346178](https:\/\/www.jstor.org\/stable\/2346178).\n20. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Efron-2004_20-0)**\n    Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). \"Least Angle Regression\". *The Annals of Statistics*. **32** (2): 407–451\\. [arXiv](https:\/\/en.wikipedia.org\/wiki\/ArXiv_\$identifier\$ \"ArXiv (identifier)\"):[math\/0406456](https:\/\/arxiv.org\/abs\/math\/0406456). [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.1214\/009053604000000067](https:\/\/doi.org\/10.1214%2F009053604000000067). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [3448465](https:\/\/www.jstor.org\/stable\/3448465). [S2CID](https:\/\/en.wikipedia.org\/wiki\/S2CID_\$identifier\$ \"S2CID (identifier)\") [204004121](https:\/\/api.semanticscholar.org\/CorpusID:204004121).\n21. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Hawkins-1973_21-0)**\n    Hawkins, Douglas M. (1973). \"On the Investigation of Alternative Regressions by Principal Component Analysis\". *Journal of the Royal Statistical Society, Series C*. **22** (3): 275–286\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.2307\/2346776](https:\/\/doi.org\/10.2307%2F2346776). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [2346776](https:\/\/www.jstor.org\/stable\/2346776).\n22. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Jolliffe-1982_22-0)**\n    Jolliffe, Ian T. (1982). \"A Note on the Use of Principal Components in Regression\". *Journal of the Royal Statistical Society, Series C*. **31** (3): 300–303\\. [doi](https:\/\/en.wikipedia.org\/wiki\/Doi_\$identifier\$ \"Doi (identifier)\"):[10\\.2307\/2348005](https:\/\/doi.org\/10.2307%2F2348005). [JSTOR](https:\/\/en.wikipedia.org\/wiki\/JSTOR_\$identifier\$ \"JSTOR (identifier)\") [2348005](https:\/\/www.jstor.org\/stable\/2348005).\n23. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-23)** [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), pages 27, 30)\n24. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-HayashiFSP_24-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-HayashiFSP_24-1) [***c***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-HayashiFSP_24-2) [Hayashi (2000](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFHayashi2000), page 27)\n25. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-25)**\n    [Amemiya, Takeshi](https:\/\/en.wikipedia.org\/wiki\/Takeshi_Amemiya \"Takeshi Amemiya\") (1985). [*Advanced Econometrics*](https:\/\/archive.org\/details\/advancedeconomet00amem). Harvard University Press. p. [13](https:\/\/archive.org\/details\/advancedeconomet00amem\/page\/13). [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [9780674005600](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/9780674005600 \"Special:BookSources\/9780674005600\")\n    \n    .\n26. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-26)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 14)\n27. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-27)**\n    [Rao, C. R.](https:\/\/en.wikipedia.org\/wiki\/C._R._Rao \"C. R. Rao\") (1973). *Linear Statistical Inference and its Applications* (Second ed.). New York: J. Wiley & Sons. p. 319. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-471-70823-2](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-471-70823-2 \"Special:BookSources\/0-471-70823-2\")\n    \n    .\n28. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-28)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 20)\n29. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-29)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 27)\n30. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-DvdMck33_30-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-DvdMck33_30-1)\n    Davidson, Russell; [MacKinnon, James G.](https:\/\/en.wikipedia.org\/wiki\/James_G._MacKinnon \"James G. MacKinnon\") (1993). *Estimation and Inference in Econometrics*. New York: Oxford University Press. p. 33. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-19-506011-3](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-19-506011-3 \"Special:BookSources\/0-19-506011-3\")\n    \n    .\n31. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-31)** [Davidson & MacKinnon (1993](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 36)\n32. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-32)** [Davidson & MacKinnon (1993](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 20)\n33. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-33)**\n    [\"Memento on EViews Output\"](https:\/\/scholar.harvard.edu\/files\/jbenchimol\/files\/memento-eviews.pdf) (PDF). Retrieved 28 December 2020.\n34. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-34)** [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 21)\n35. ^ [***a***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Amemiya22_35-0) [***b***](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-Amemiya22_35-1) [Amemiya (1985](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#CITEREFAmemiya1985), page 22)\n36. **[^](https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares#cite_ref-36)**\n    Burnham, Kenneth P.; Anderson, David R. (2002). [*Model Selection and Multi-Model Inference*](https:\/\/archive.org\/details\/modelselectionmu0000burn) (2nd ed.). Springer. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n    \n    [0-387-95364-7](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-387-95364-7 \"Special:BookSources\/0-387-95364-7\")\n    \n    .\n\n- [Dougherty, Christopher](https:\/\/en.wikipedia.org\/wiki\/Christopher_Dougherty \"Christopher Dougherty\") (2002). *Introduction to Econometrics* (2nd ed.). New York: Oxford University Press. pp. 48–113\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [0-19-877643-8](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/0-19-877643-8 \"Special:BookSources\/0-19-877643-8\")\n  \n  .\n- [Gujarati, Damodar N.](https:\/\/en.wikipedia.org\/wiki\/Damodar_N._Gujarati \"Damodar N. Gujarati\"); [Porter, Dawn C.](https:\/\/en.wikipedia.org\/wiki\/Dawn_C._Porter \"Dawn C. Porter\") (2009). *Basic Econometics* (Fifth ed.). Boston: McGraw-Hill Irwin. pp. 55–96\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-07-337577-9](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-07-337577-9 \"Special:BookSources\/978-0-07-337577-9\")\n  \n  .\n- [Heij, Christiaan](https:\/\/en.wikipedia.org\/wiki\/Christiaan_Heij \"Christiaan Heij\"); Boer, Paul; [Franses, Philip H.](https:\/\/en.wikipedia.org\/wiki\/Philip_Hans_Franses \"Philip Hans Franses\"); [Kloek, Teun](https:\/\/en.wikipedia.org\/wiki\/Teun_Kloek \"Teun Kloek\"); [van Dijk, Herman K.](https:\/\/en.wikipedia.org\/wiki\/Herman_K._van_Dijk \"Herman K. van Dijk\") (2004). *Econometric Methods with Applications in Business and Economics* (1st ed.). Oxford: Oxford University Press. pp. 76–115\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-19-926801-6](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-19-926801-6 \"Special:BookSources\/978-0-19-926801-6\")\n  \n  .\n- Hill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008). *Principles of Econometrics* (3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. 8–47\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-471-72360-8](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-471-72360-8 \"Special:BookSources\/978-0-471-72360-8\")\n  \n  .\n- [Wooldridge, Jeffrey](https:\/\/en.wikipedia.org\/wiki\/Jeffrey_Wooldridge \"Jeffrey Wooldridge\") (2008). [\"The Simple Regression Model\"](https:\/\/books.google.com\/books?id=64vt5TDBNLwC&pg=PA22). *Introductory Econometrics: A Modern Approach* (4th ed.). Mason, OH: Cengage Learning. pp. 22–67\\. [ISBN](https:\/\/en.wikipedia.org\/wiki\/ISBN_\$identifier\$ \"ISBN (identifier)\")\n  \n  [978-0-324-58162-1](https:\/\/en.wikipedia.org\/wiki\/Special:BookSources\/978-0-324-58162-1 \"Special:BookSources\/978-0-324-58162-1\")\n  \n  .","meta_canonical":null,"ml_categories_json":"{\"\/Science\":890,\"\/Science\/Mathematics\":865,\"\/Science\/Mathematics\/Statistics\":841}","ml_types_json":"{\"\/Article\":997,\"\/Article\/Wiki\":606}","ml_intent_types_json":"{\"Informational\":999}","meta_language":"en","attrs_author":null,"attrs_publish_time":0,"attrs_original_publish_time":1377383308,"attrs_is_republished":0,"attrs_nr_words":"11876","attrs_boilerpipe_nr_words":"8168","body_ext_links_number":58,"body_int_links_number":430,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":0,"src_redirect":"","download_time_msec":57,"download_ttfb_msec":33,"download_size":74827}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

11 days ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	0.4 months ago (distributed domain, exempt)
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property

Value

URL

https://en.wikipedia.org/wiki/Ordinary_least_squares

Last Crawled

2026-04-13 08:10:06 (11 days ago)

First Indexed

2013-08-24 22:28:28 (12 years ago)

HTTP Status Code

200

Content

Meta Title

Ordinary least squares - Wikipedia

Meta Description

null

Meta Canonical

null

Boilerpipe Text

Okun's law in macroeconomics states that in an economy the GDP growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law. In statistics , ordinary least squares ( OLS ) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one [ clarification needed ] effects of a linear function of a set of explanatory variables ) by the principle of least squares : minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the input dataset and the output of the (linear) function of the independent variable . Some sources consider OLS to be linear regression. [ 1 ] Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression , in which there is a single regressor on the right side of the regression equation. The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments [ 2 ] and—by the Gauss–Markov theorem — optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated . Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances . Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator. Suppose the data consists of observations . Each observation includes a scalar response and a column vector of parameters (regressors), i.e., . In a linear regression model , the response variable, , is a linear function of the regressors: or in vector form, where , as introduced previously, is a column vector of the -th observation of all the explanatory variables; is a vector of unknown parameters; and the scalar represents unobserved random variables ( errors ) of the -th observation. accounts for the influences upon the responses from sources other than the explanatory variables . This model can also be written in matrix notation as where and are vectors of the response variables and the errors of the observations, and is an matrix of regressors, also sometimes called the design matrix , whose row is and contains the -th observations on all the explanatory variables. Typically, a constant term is included in the set of regressors , say, by taking for all . The coefficient corresponding to this regressor is called the intercept . Without the intercept, the fitted line is forced to cross the origin when . Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent). As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters ( ). Matrix/vector formulation [ edit ] Consider an overdetermined system of linear equations in unknown coefficients , , with . This can be written in matrix form as where (Note: for a linear model as above, not all elements in contains information on the data points. The first column is populated with ones, . Only the other columns contain actual data. So here is equal to the number of regressors plus one). Such a system usually has no exact solution, so the goal is instead to find the coefficients which fit the equations "best", in the sense of solving the quadratic minimization problem where the objective function is given by A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the columns of the matrix are linearly independent , given by solving the so-called normal equations : The matrix is known as the normal matrix or Gram matrix and the matrix is known as the moment matrix of regressand by regressors. [ 3 ] Finally, is the coefficient vector of the least-squares hyperplane , expressed as or Suppose b is a "candidate" value for the parameter vector β . The quantity y i − x i T b , called the residual for the i -th observation, measures the vertical distance between the data point ( x i , y i ) and the hyperplane y = x T b , and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals ( SSR ) (also called the error sum of squares ( ESS ) or residual sum of squares ( RSS )) [ 4 ] is a measure of the overall model fit: where T denotes the matrix transpose , and the rows of X , denoting the values of all the independent variables associated with a particular value of the dependent variable, are X i = x i T . The value of b which minimizes this sum is called the OLS estimator for β . The function S ( b ) is quadratic in b with positive-definite Hessian , and therefore this function possesses a unique global minimum at , which can be given by the explicit formula [ 5 ] [proof] The product N = X T X is a Gram matrix , and its inverse, Q = N −1 , is the cofactor matrix of β , [ 6 ] [ 7 ] [ 8 ] closely related to its covariance matrix , C β . The matrix ( X T X ) −1 X T = Q X T is called the Moore–Penrose pseudoinverse matrix of X . This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the Gram matrix to have no inverse). After we have estimated β , the fitted values (or predicted values ) from the regression will be where P = X ( X T X ) −1 X T is the projection matrix onto the space V spanned by the columns of X . This matrix P is also sometimes called the hat matrix because it "puts a hat" onto the variable y . Another matrix, closely related to P is the annihilator matrix M = I n − P ; this is a projection matrix onto the space orthogonal to V . Both matrices P and M are symmetric and idempotent (meaning that P 2 = P and M 2 = M ), and relate to the data matrix X via identities PX = X and MX = 0 . [ 9 ] Matrix M creates the residuals from the regression: The variances of the predicted values are found in the main diagonal of the variance-covariance matrix of predicted values: where P is the projection matrix and s 2 is the sample variance. [ 10 ] The full matrix is very large; its diagonal elements can be calculated individually as: where X i is the i -th row of matrix X . Using these residuals we can estimate the sample variance s 2 using the reduced chi-squared statistic: The denominator, n − p , is the statistical degrees of freedom . The first quantity, s 2 , is the OLS estimate for σ 2 , whereas the second, , is the MLE estimate for σ 2 . The two estimators are quite similar in large samples; the first estimator is always unbiased , while the second estimator is biased but has a smaller mean squared error . In practice s 2 is used more often, since it is more convenient for the hypothesis testing. The square root of s 2 is called the regression standard error , [ 11 ] standard error of the regression , [ 12 ] [ 13 ] or standard error of the equation . [ 9 ] It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X . The coefficient of determination R 2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y , in the cases where the regression sum of squares equals the sum of squares of residuals: [ 14 ] where TSS is the total sum of squares for the dependent variable, , and is an n × n matrix of ones. ( is a centering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for R 2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R 2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit. Simple linear regression model [ edit ] If the data matrix X contains only two variables, a constant and a scalar regressor x i , then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as ( α , β ) : The least squares estimates in this case are given by simple formulas Alternative derivations [ edit ] In the previous section the least squares estimator was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^ β = ( X T X ) −1 X T y ; the only difference is in how we interpret this result. OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of and refers to a column of the data matrix.) Least squares as projection of y onto col(X) for three observations; ŷ = Xβ gives fitted values and y − ŷ is the residual. For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations Xβ ≈ y , where β is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p ), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies where ‖ · ‖ is the standard L 2 norm in the n -dimensional Euclidean space R n . The predicted quantity Xβ is just a certain linear combination of the vectors of regressors. Thus, the residual vector y − Xβ will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X . The OLS estimator in this case can be interpreted as the coefficients of vector decomposition of ^ y = Py along the basis of X . In other words, the gradient equations at the minimum can be written as: A geometrical interpretation of these equations is that the vector of residuals, is orthogonal to the column space of X , since the dot product is equal to zero for any conformal vector, v . This means that is the shortest of all possible vectors , that is, the variance of the residuals is the minimum possible. This is illustrated at the right. Introducing and a matrix K with the assumption that a matrix is non-singular and K T X = 0 (cf. Orthogonal projections ), the residual vector should satisfy the following equation: The equation and solution of linear least squares are thus described as follows: Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset. [ 15 ] Although this way of calculation is more computationally expensive, it provides a better intuition on OLS. The OLS estimator is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms. [ 16 ] [proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson . [ citation needed ] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if the normality assumption is satisfied. [ 17 ] Generalized method of moments [ edit ] In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions These moment conditions state that the regressors should be uncorrelated with the errors. Since x i is a p -vector, the number of moment conditions is equal to the dimension of the parameter vector β , and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix. Note that the original strict exogeneity assumption E[ ε i | x i ] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ , the moment condition E[ ƒ ( x i )· ε i ] = 0 will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function ƒ is to take ƒ ( x ) = x , which results in the moment equation posted above. There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed. One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case ( random design ) the regressors x i are random and sampled together with the y i ' s from some population , as in an observational study . This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation ( fixed design ), the regressors X are treated as known constants set by a design , and y is sampled conditionally on the values of X as in an experiment . For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X . All results stated in this article are within the random design framework. The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the behavior at a large number of samples is studied. To prove finite sample unbiasedness of the OLS estimator, we require the following assumptions. Example of a cubic polynomial regression, which is a type of linear regression. Although polynomial regression fits a curve model to the data, as a statistical estimation problem it is linear, in the sense that the conditional expectation function is linear in the unknown parameters that are estimated from the data . For this reason, polynomial regression is considered to be a special case of multiple linear regression . Exogeneity . The regressors do not covary with the error term: This requires, for example, that there are no omitted variables that covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression in mathematical statistics is that the predictor variables x can be treated as fixed values, rather than random variables . This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complex errors-in-variables models , instrumental variable models and the like. Linearity , or correct specification . This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in polynomial regression , which uses linear regression to fit the response variable as an arbitrary polynomial function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend to overfit the data. As a result, some kind of regularization must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are ridge regression and lasso regression . Bayesian linear regression can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, ridge regression and lasso regression can both be viewed as special cases of Bayesian linear regression, with particular types of prior distributions placed on the regression coefficients.) Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab Constant variance or homoscedasticity . This means that the variance of the errors does not depend on the values of the predictor variables: Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000—i.e., a standard deviation of around $20,000—while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, since that would imply their actual income could vary anywhere between −$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called heteroscedasticity . In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see Heteroscedasticity . The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of ordinary least squares , not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The mean squared error for the model will also be wrong. Various estimation techniques including weighted least squares and the use of heteroscedasticity-consistent standard errors can handle heteroscedasticity in a quite general way. Bayesian linear regression techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the logarithm of the response variable using a linear regression model, which implies that the response variable itself has a log-normal distribution rather than a normal distribution ). To check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as autocorrelation in the errors or their correlation with one or more covariates. Uncorrelatedness of errors . This assumes that the errors of the response variables are uncorrelated with each other: Some methods such as generalized least squares are capable of handling correlated errors, although they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this issue. Full statistical independence is a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence. Lack of perfect multicollinearity in the predictors. For standard least squares estimation methods, the design matrix X must have full column rank p : [ 18 ] If this assumption is violated, perfect multicollinearity exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see Variance inflation factor ). In the case of perfect multicollinearity, the parameter vector β will be non-identifiable —it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space R p ). See partial least squares regression . Methods for fitting linear models with multicollinearity have been developed, [ 19 ] [ 20 ] [ 21 ] [ 22 ] some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in generalized linear models , do not suffer from this problem. Violations of these assumptions can result in biased estimations of β , biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods: The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent. The arrangement, or probability distribution of the predictor variables x has a major influence on the precision of estimates of β . Sampling and design of experiments are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of β . Finite sample properties [ edit ] First of all, under the strict exogeneity assumption the OLS estimators and s 2 are unbiased , meaning that their expected values coincide with the true values of the parameters: [ 23 ] [proof] If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples. The variance-covariance matrix (or simply covariance matrix ) of is equal to [ 24 ] In particular, the standard error of each coefficient is equal to square root of the j -th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ 2 with its estimate s 2 . Thus, It can also be easily shown that the estimator is uncorrelated with the residuals from the model: [ 24 ] The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic ) the estimator is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator which would be linear in y and unbiased, then [ 24 ] in the sense that this is a nonnegative-definite matrix . This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε , other, non-linear estimators may provide better results than OLS. The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the normality assumption holds (that is, that ε ~ N (0, σ 2 I n ) ), then additional properties of the OLS estimators can be stated. The estimator is normally distributed, with mean and variance as given before: [ 25 ] This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators. [ 17 ] Note that unlike the Gauss–Markov theorem , this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms. The estimator s 2 will be proportional to the chi-squared distribution : [ 26 ] The variance of this estimator is equal to 2 σ 4 /( n − p ) , which does not attain the Cramér–Rao bound of 2 σ 4 / n . However it was shown that there are no unbiased estimators of σ 2 with variance smaller than that of the estimator s 2 . [ 27 ] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error ) estimator in this class will be ~ σ 2 = SSR / ( n − p + 2) , which even beats the Cramér–Rao bound in case when there is only one regressor ( p = 1 ). [ 28 ] Moreover, the estimators and s 2 are independent , [ 29 ] the fact which comes in useful when constructing the t- and F-tests for the regression. Influential observations [ edit ] As was mentioned before, the estimator is linear in y , meaning that it represents a linear combination of the dependent variables y i . The weights in this linear combination are functions of the regressors X , and generally are unequal. The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator. To analyze which observations are influential we remove a specific j -th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method ). It can be shown that the change in the OLS estimator for β will be equal to [ 30 ] where h j = x j T ( X T X ) −1 x j is the j -th diagonal element of the hat matrix P , and x j is the vector of regressors corresponding to the j -th observation. Similarly, the change in the predicted value for j -th observation resulting from omitting that observation from the dataset will be equal to [ 30 ] From the properties of the hat matrix, 0 ≤ h j ≤ 1 , and they sum up to p , so that on average h j ≈ p/n . These quantities h j are called the leverages , and observations with high h j are called leverage points . [ 31 ] Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset. Partitioned regression [ edit ] Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form where X 1 and X 2 have dimensions n × p 1 , n × p 2 , and β 1 , β 2 are p 1 ×1 and p 2 ×1 vectors, with p 1 + p 2 = p . The Frisch–Waugh–Lovell theorem states that in this regression the residuals and the OLS estimate will be numerically identical to the residuals and the OLS estimate for β 2 in the following regression: [ 32 ] where M 1 is the annihilator matrix for regressors X 1 . The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term. Large sample properties [ edit ] The least squares estimators are point estimates of the linear regression model parameters β . However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates . Since we have not made any assumption about the distribution of error term ε i , it is impossible to infer the distribution of the estimators and . Nevertheless, we can apply the central limit theorem to derive their asymptotic properties as sample size n goes to infinity. While the sample size is necessarily finite, it is customary to assume that n is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit. We can show that under the model assumptions, the least squares estimator for β is consistent (that is converges in probability to β ) and asymptotically normal: [proof] where Using this asymptotic distribution, approximate two-sided confidence intervals for the j -th component of the vector can be constructed as at the 1 − α confidence level, where q denotes the quantile function of standard normal distribution, and [·] jj is the j -th diagonal element of a matrix. Similarly, the least squares estimator for σ 2 is also consistent and asymptotically normal (provided that the fourth moment of ε i exists) with limiting distribution These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The mean response is the quantity , whereas the predicted response is . Clearly the predicted response is a random variable, its distribution can be derived from that of : which allows construct confidence intervals for mean response to be constructed: at the 1 − α confidence level. Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The null hypothesis of no explanatory value of the estimated regression is tested using an F-test . If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the alternative hypothesis , that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted. Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's t-statistic , as the ratio of the coefficient estimate to its standard error . If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted. In addition, the Chow test is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted. Violations of assumptions [ edit ] In a time series model, we require the stochastic process { x i , y i } to be stationary and ergodic ; if { x i , y i } is nonstationary, OLS results are often biased unless { x i , y i } is co-integrating . [ 33 ] We still require the regressors to be strictly exogenous : E[ x i ε i ] = 0 for all i = 1, ..., n . If they are only predetermined , OLS is biased in finite sample; Finally, the assumptions on the variance take the form of requiring that { x i ε i } is a martingale difference sequence , with a finite matrix of second moments Q xxε ² = E[ ε i 2 x i x i T ] . Constrained estimation [ edit ] Suppose it is known that the coefficients in the regression satisfy a system of linear equations where Q is a p × q matrix of full rank, and c is a q ×1 vector of known constants, where q < p . In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint A . The constrained least squares (CLS) estimator can be given by an explicit formula: [ 34 ] This expression for the constrained estimator is valid as long as the matrix X T X is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, β will not be identifiable. However it may happen that adding the restriction A makes β identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [ 35 ] where R is a p ×( p − q ) matrix such that the matrix [ Q R ] is non-singular, and R T Q = 0 . Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when X T X is invertible. [ 35 ] Example with real data [ edit ] The following data set gives average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975 ). Height (m) 1.47 1.50 1.52 1.55 1.57 Scatterplot of the data, the relationship is slightly curved but close to linear Weight (kg) 52.21 53.12 54.48 55.84 57.20 Height (m) 1.60 1.63 1.65 1.68 1.70 Weight (kg) 58.57 59.93 61.29 63.11 64.47 Height (m) 1.73 1.75 1.78 1.80 1.83 Weight (kg) 66.28 68.10 69.92 72.19 74.46 When only one dependent variable is being modeled, a scatterplot will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT 2 . The regression model then becomes a multiple linear model: Fitted regression The output from most popular statistical packages will look similar to this: Method Least squares Dependent variable WEIGHT Observations 15 Parameter Value Std error t-statistic p-value 128.8128 16.3083 7.8986 0.0000 −143.1620 19.8332 −7.2183 0.0000 61.9603 6.0084 10.3122 0.0000 R 2 0.9989 S.E. of regression 0.2516 Adjusted R 2 0.9987 Model sum-of-sq. 692.61 Log-likelihood 1.0890 Residual sum-of-sq. 0.7595 Durbin–Watson stat. 2.1013 Total sum-of-sq. 693.37 Akaike criterion 0.2548 F-statistic 5471.2 Schwarz criterion 0.3964 p-value (F-stat) 0.0000 In this table: The Value column gives the least squares estimates of parameters β j The Std error column shows standard errors of each coefficient estimate: The t-statistic and p-value columns are testing whether any of the coefficients might be equal to zero. The t -statistic is calculated simply as . If the errors ε follow a normal distribution, t follows a Student-t distribution. Under weaker conditions, t is asymptotically normal. Large values of t indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, p -value , expresses the results of the hypothesis test as a significance level . Conventionally, p -values smaller than 0.05 are taken as evidence that the population coefficient is nonzero. R-squared is the coefficient of determination indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors X have no explanatory power whatsoever. This is a biased estimate of the population R-squared , and will never decrease if additional regressors are added, even if they are irrelevant. Adjusted R-squared is a slightly modified version of , designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than , can decrease as new regressors are added, and even be negative for poorly fitting models: Log-likelihood is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests. Durbin–Watson statistic tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation. Akaike information criterion and Schwarz criterion are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model. [ 36 ] Standard error of regression is an estimate of σ , standard error of the error term. Total sum of squares , model sum of squared , and residual sum of squares tell us how much of the initial variation in the sample were explained by the regression. F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has F ( p–1 , n–p ) distribution under the null hypothesis and normality assumption, and its p-value indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as Wald test or LR test should be used. Residuals plot Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots: Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity. Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model. Residuals against the fitted values, . Residuals against the preceding residual. This plot may identify serial correlations in the residuals. An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height. Sensitivity to rounding [ edit ] This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is not an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become: Const Height Height 2 Converted to metric with rounding. 128.8128 −143.162 61.96033 Converted to metric without rounding. 119.0205 −131.5076 58.5046 Residuals to a quadratic fit for correctly and incorrectly converted data. Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation. While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range ( extrapolation ). This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the x and y errors. Another example with less real data [ edit ] We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is where is the radius of how far the object is from one of the bodies. In the equation the parameters and are used to determine the path of the orbit. We have measured the following data. (in degrees) 43 45 52 93 108 116 4.7126 4.5542 4.0419 2.2187 1.8910 1.7599 We need to find the least-squares approximation of and for the given data. First we need to represent e and p in a linear form. So we are going to rewrite the equation as . Furthermore, one could fit for apsides by expanding with an extra parameter as , which is linear in both and in the extra basis function . We use the original two-parameter form to represent our observational data as: where: ; ; contains the coefficients of in the first column, which are all 1, and the coefficients of in the second column, given by ; and , such that: On solving we get , so and Bayesian least squares Fama–MacBeth regression Nonlinear least squares Numerical methods for linear least squares Nonlinear system identification ^ "The Origins of Ordinary Least Squares Assumptions" . Feature Column . 2022-03-01 . Retrieved 2024-05-16 . ^ "What is a complete list of the usual assumptions for linear regression?" . Cross Validated . Retrieved 2022-09-28 . ^ Goldberger, Arthur S. (1964). "Classical Linear Regression" . Econometric Theory . New York: John Wiley & Sons. pp. 158 . ISBN 0-471-31101-4 . ^ Hayashi, Fumio (2000). Econometrics . Princeton University Press. p. 15. ISBN 9780691010182 . ^ Hayashi (2000 , page 18). ^ Ghilani, Charles D.; Wolf, Paul R. (12 June 2006). Adjustment Computations: Spatial Data Analysis . John Wiley & Sons. ISBN 9780471697282 . ^ Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more . Springer. ISBN 9783211730171 . ^ Xu, Guochang (5 October 2007). GPS: Theory, Algorithms and Applications . Springer. ISBN 9783540727156 . ^ a b Hayashi (2000 , page 19) ^ Hoaglin, David C.; Welsch, Roy E. (1978). "The Hat Matrix in Regression and ANOVA" . The American Statistician . 32 (1): 17– 22. doi : 10.1080/00031305.1978.10479237 . hdl : 1721.1/1920 . ISSN 0003-1305 . ^ Julian Faraway (2000), Practical Regression and Anova using R ^ Kenney, J.; Keeping, E. S. (1963). Mathematics of Statistics . van Nostrand. p. 187. ^ Zwillinger, Daniel (1995). Standard Mathematical Tables and Formulae . Chapman&Hall/CRC. p. 626. ISBN 0-8493-2479-3 . ^ Hayashi (2000 , page 20) ^ Akbarzadeh, Vahab (7 May 2014). "Line Estimation" . ^ Hayashi (2000 , page 49) ^ a b Hayashi (2000 , page 52) ^ Hayashi (2000 , page 10) ^ Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso". Journal of the Royal Statistical Society, Series B . 58 (1): 267– 288. doi : 10.1111/j.2517-6161.1996.tb02080.x . JSTOR 2346178 . ^ Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). "Least Angle Regression". The Annals of Statistics . 32 (2): 407– 451. arXiv : math/0406456 . doi : 10.1214/009053604000000067 . JSTOR 3448465 . S2CID 204004121 . ^ Hawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis". Journal of the Royal Statistical Society, Series C . 22 (3): 275– 286. doi : 10.2307/2346776 . JSTOR 2346776 . ^ Jolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression". Journal of the Royal Statistical Society, Series C . 31 (3): 300– 303. doi : 10.2307/2348005 . JSTOR 2348005 . ^ Hayashi (2000 , pages 27, 30) ^ a b c Hayashi (2000 , page 27) ^ Amemiya, Takeshi (1985). Advanced Econometrics . Harvard University Press. p. 13 . ISBN 9780674005600 . ^ Amemiya (1985 , page 14) ^ Rao, C. R. (1973). Linear Statistical Inference and its Applications (Second ed.). New York: J. Wiley & Sons. p. 319. ISBN 0-471-70823-2 . ^ Amemiya (1985 , page 20) ^ Amemiya (1985 , page 27) ^ a b Davidson, Russell; MacKinnon, James G. (1993). Estimation and Inference in Econometrics . New York: Oxford University Press. p. 33. ISBN 0-19-506011-3 . ^ Davidson & MacKinnon (1993 , page 36) ^ Davidson & MacKinnon (1993 , page 20) ^ "Memento on EViews Output" (PDF) . Retrieved 28 December 2020 . ^ Amemiya (1985 , page 21) ^ a b Amemiya (1985 , page 22) ^ Burnham, Kenneth P.; Anderson, David R. (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. ISBN 0-387-95364-7 . Dougherty, Christopher (2002). Introduction to Econometrics (2nd ed.). New York: Oxford University Press. pp. 48– 113. ISBN 0-19-877643-8 . Gujarati, Damodar N. ; Porter, Dawn C. (2009). Basic Econometics (Fifth ed.). Boston: McGraw-Hill Irwin. pp. 55– 96. ISBN 978-0-07-337577-9 . Heij, Christiaan ; Boer, Paul; Franses, Philip H. ; Kloek, Teun ; van Dijk, Herman K. (2004). Econometric Methods with Applications in Business and Economics (1st ed.). Oxford: Oxford University Press. pp. 76– 115. ISBN 978-0-19-926801-6 . Hill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008). Principles of Econometrics (3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. 8– 47. ISBN 978-0-471-72360-8 . Wooldridge, Jeffrey (2008). "The Simple Regression Model" . Introductory Econometrics: A Modern Approach (4th ed.). Mason, OH: Cengage Learning. pp. 22– 67. ISBN 978-0-324-58162-1 .

Markdown

[Jump to content](https://en.wikipedia.org/wiki/Ordinary_least_squares#bodyContent) Main menu Main menu move to sidebar hide Navigation - [Main page](https://en.wikipedia.org/wiki/Main_Page "Visit the main page [z]") - [Contents](https://en.wikipedia.org/wiki/Wikipedia:Contents "Guides to browsing Wikipedia") - [Current events](https://en.wikipedia.org/wiki/Portal:Current_events "Articles related to current events") - [Random article](https://en.wikipedia.org/wiki/Special:Random "Visit a randomly selected article [x]") - [About Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:About "Learn about Wikipedia and how it works") - [Contact us](https://en.wikipedia.org/wiki/Wikipedia:Contact_us "How to contact Wikipedia") Contribute - [Help](https://en.wikipedia.org/wiki/Help:Contents "Guidance on how to use and edit Wikipedia") - [Learn to edit](https://en.wikipedia.org/wiki/Help:Introduction "Learn how to edit Wikipedia") - [Community portal](https://en.wikipedia.org/wiki/Wikipedia:Community_portal "The hub for editors") - [Recent changes](https://en.wikipedia.org/wiki/Special:RecentChanges "A list of recent changes to Wikipedia [r]") - [Upload file](https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard "Add images or other media for use on Wikipedia") - [Special pages](https://en.wikipedia.org/wiki/Special:SpecialPages "A list of all special pages [q]") [![](https://en.wikipedia.org/static/images/icons/enwiki-25.svg) ![Wikipedia](https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en-25.svg) ![The Free Encyclopedia](https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en-25.svg)](https://en.wikipedia.org/wiki/Main_Page) [Search](https://en.wikipedia.org/wiki/Special:Search "Search Wikipedia [f]") Appearance - [Donate](https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en) - [Create account](https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Ordinary+least+squares "You are encouraged to create an account and log in; however, it is not mandatory") - [Log in](https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Ordinary+least+squares "You're encouraged to log in; however, it's not mandatory. [o]") Personal tools - [Donate](https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en) - [Create account](https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Ordinary+least+squares "You are encouraged to create an account and log in; however, it is not mandatory") - [Log in](https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Ordinary+least+squares "You're encouraged to log in; however, it's not mandatory. [o]") ## Contents move to sidebar hide - [(Top)](https://en.wikipedia.org/wiki/Ordinary_least_squares) - [1 Linear model](https://en.wikipedia.org/wiki/Ordinary_least_squares#Linear_model) Toggle Linear model subsection - [1\.1 Matrix/vector formulation](https://en.wikipedia.org/wiki/Ordinary_least_squares#Matrix/vector_formulation) - [2 Estimation](https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation) - [3 Prediction](https://en.wikipedia.org/wiki/Ordinary_least_squares#Prediction) - [4 Sample statistics](https://en.wikipedia.org/wiki/Ordinary_least_squares#Sample_statistics) Toggle Sample statistics subsection - [4\.1 Simple linear regression model](https://en.wikipedia.org/wiki/Ordinary_least_squares#Simple_linear_regression_model) - [5 Alternative derivations](https://en.wikipedia.org/wiki/Ordinary_least_squares#Alternative_derivations) Toggle Alternative derivations subsection - [5\.1 Projection](https://en.wikipedia.org/wiki/Ordinary_least_squares#Projection) - [5\.2 Maximum likelihood](https://en.wikipedia.org/wiki/Ordinary_least_squares#Maximum_likelihood) - [5\.3 Generalized method of moments](https://en.wikipedia.org/wiki/Ordinary_least_squares#Generalized_method_of_moments) - [6 Assumptions](https://en.wikipedia.org/wiki/Ordinary_least_squares#Assumptions) - [7 Properties](https://en.wikipedia.org/wiki/Ordinary_least_squares#Properties) Toggle Properties subsection - [7\.1 Finite sample properties](https://en.wikipedia.org/wiki/Ordinary_least_squares#Finite_sample_properties) - [7\.1.1 Assuming normality](https://en.wikipedia.org/wiki/Ordinary_least_squares#Assuming_normality) - [7\.1.2 Influential observations](https://en.wikipedia.org/wiki/Ordinary_least_squares#Influential_observations) - [7\.1.3 Partitioned regression](https://en.wikipedia.org/wiki/Ordinary_least_squares#Partitioned_regression) - [7\.2 Large sample properties](https://en.wikipedia.org/wiki/Ordinary_least_squares#Large_sample_properties) - [7\.2.1 Inference](https://en.wikipedia.org/wiki/Ordinary_least_squares#Inference) - [7\.2.2 Hypothesis testing](https://en.wikipedia.org/wiki/Ordinary_least_squares#Hypothesis_testing) - [7\.3 Violations of assumptions](https://en.wikipedia.org/wiki/Ordinary_least_squares#Violations_of_assumptions) - [7\.3.1 Time series model](https://en.wikipedia.org/wiki/Ordinary_least_squares#Time_series_model) - [7\.3.2 Constrained estimation](https://en.wikipedia.org/wiki/Ordinary_least_squares#Constrained_estimation) - [8 Example with real data](https://en.wikipedia.org/wiki/Ordinary_least_squares#Example_with_real_data) Toggle Example with real data subsection - [8\.1 Sensitivity to rounding](https://en.wikipedia.org/wiki/Ordinary_least_squares#Sensitivity_to_rounding) - [9 Another example with less real data](https://en.wikipedia.org/wiki/Ordinary_least_squares#Another_example_with_less_real_data) Toggle Another example with less real data subsection - [9\.1 Problem statement](https://en.wikipedia.org/wiki/Ordinary_least_squares#Problem_statement) - [9\.2 Solution](https://en.wikipedia.org/wiki/Ordinary_least_squares#Solution) - [10 See also](https://en.wikipedia.org/wiki/Ordinary_least_squares#See_also) - [11 References](https://en.wikipedia.org/wiki/Ordinary_least_squares#References) - [12 Further reading](https://en.wikipedia.org/wiki/Ordinary_least_squares#Further_reading) Toggle the table of contents # Ordinary least squares 12 languages - [العربية](https://ar.wikipedia.org/wiki/%D9%85%D8%B1%D8%A8%D8%B9%D8%A7%D8%AA_%D8%B5%D8%BA%D8%B1%D9%89_%D8%B9%D8%A7%D8%AF%D9%8A%D8%A9 "مربعات صغرى عادية – Arabic") - [Asturianu](https://ast.wikipedia.org/wiki/M%C3%ADnimos_cuadraos_ordinarios "Mínimos cuadraos ordinarios – Asturian") - [Català](https://ca.wikipedia.org/wiki/M%C3%A8tode_dels_m%C3%ADnims_quadrats_ordinaris "Mètode dels mínims quadrats ordinaris – Catalan") - [Ελληνικά](https://el.wikipedia.org/wiki/%CE%9C%CE%AD%CE%B8%CE%BF%CE%B4%CE%BF%CF%82_%CE%B5%CE%BB%CE%B1%CF%87%CE%AF%CF%83%CF%84%CF%89%CE%BD_%CF%84%CE%B5%CF%84%CF%81%CE%B1%CE%B3%CF%8E%CE%BD%CF%89%CE%BD "Μέθοδος ελαχίστων τετραγώνων – Greek") - [Español](https://es.wikipedia.org/wiki/M%C3%ADnimos_cuadrados_ordinarios "Mínimos cuadrados ordinarios – Spanish") - [فارسی](https://fa.wikipedia.org/wiki/%DA%A9%D9%85%D8%AA%D8%B1%DB%8C%D9%86_%D9%85%D8%B1%D8%A8%D8%B9%D8%A7%D8%AA_%D9%85%D8%B9%D9%85%D9%88%D9%84%DB%8C "کمترین مربعات معمولی – Persian") - [Français](https://fr.wikipedia.org/wiki/M%C3%A9thode_des_moindres_carr%C3%A9s_ordinaire "Méthode des moindres carrés ordinaire – French") - [한국어](https://ko.wikipedia.org/wiki/%EC%A0%95%EA%B7%9C%EB%B0%A9%EC%A0%95%EC%8B%9D "정규방정식 – Korean") - [Simple English](https://simple.wikipedia.org/wiki/Ordinary_least_squares "Ordinary least squares – Simple English") - [Slovenščina](https://sl.wikipedia.org/wiki/Navadni_najmanj%C5%A1i_kvadrati "Navadni najmanjši kvadrati – Slovenian") - [粵語](https://zh-yue.wikipedia.org/wiki/%E6%99%AE%E9%80%9A%E6%9C%80%E5%B0%8F%E4%BA%8C%E4%B9%98%E6%B3%95 "普通最小二乘法 – Cantonese") - [中文](https://zh.wikipedia.org/wiki/%E6%99%AE%E9%80%9A%E6%9C%80%E5%B0%8F%E4%BA%8C%E4%B9%98%E6%B3%95 "普通最小二乘法 – Chinese") [Edit links](https://www.wikidata.org/wiki/Special:EntityPage/Q2912993#sitelinks-wikipedia "Edit interlanguage links") - [Article](https://en.wikipedia.org/wiki/Ordinary_least_squares "View the content page [c]") - [Talk](https://en.wikipedia.org/wiki/Talk:Ordinary_least_squares "Discuss improvements to the content page [t]") English - [Read](https://en.wikipedia.org/wiki/Ordinary_least_squares) - [Edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit "Edit this page [e]") - [View history](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=history "Past revisions of this page [h]") Tools Tools move to sidebar hide Actions - [Read](https://en.wikipedia.org/wiki/Ordinary_least_squares) - [Edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit "Edit this page [e]") - [View history](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=history) General - [What links here](https://en.wikipedia.org/wiki/Special:WhatLinksHere/Ordinary_least_squares "List of all English Wikipedia pages containing links to this page [j]") - [Related changes](https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Ordinary_least_squares "Recent changes in pages linked from this page [k]") - [Upload file](https://en.wikipedia.org/wiki/Wikipedia:File_Upload_Wizard "Upload files [u]") - [Permanent link](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&oldid=1345164414 "Permanent link to this revision of this page") - [Page information](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=info "More information about this page") - [Cite this page](https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Ordinary_least_squares&id=1345164414&wpFormIdentifier=titleform "Information on how to cite this page") - [Get shortened URL](https://en.wikipedia.org/w/index.php?title=Special:UrlShortener&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FOrdinary_least_squares) Print/export - [Download as PDF](https://en.wikipedia.org/w/index.php?title=Special:DownloadAsPdf&page=Ordinary_least_squares&action=show-download-screen "Download this page as a PDF file") - [Printable version](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&printable=yes "Printable version of this page [p]") In other projects - [Wikidata item](https://www.wikidata.org/wiki/Special:EntityPage/Q2912993 "Structured data on this page hosted by Wikidata [g]") Appearance move to sidebar hide From Wikipedia, the free encyclopedia Method for estimating the unknown parameters in a linear regression model [![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Okuns_law_quarterly_differences.svg/330px-Okuns_law_quarterly_differences.svg.png)](https://en.wikipedia.org/wiki/File:Okuns_law_quarterly_differences.svg) [Okun's law](https://en.wikipedia.org/wiki/Okun%27s_law "Okun's law") in [macroeconomics](https://en.wikipedia.org/wiki/Macroeconomics "Macroeconomics") states that in an economy the [GDP](https://en.wikipedia.org/wiki/GDP "GDP") growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law. | | |---| | Part of a series on | | [Regression analysis](https://en.wikipedia.org/wiki/Regression_analysis "Regression analysis") | | Models | | [Linear regression](https://en.wikipedia.org/wiki/Linear_regression "Linear regression") [Simple regression](https://en.wikipedia.org/wiki/Simple_linear_regression "Simple linear regression") [Polynomial regression](https://en.wikipedia.org/wiki/Polynomial_regression "Polynomial regression") [General linear model](https://en.wikipedia.org/wiki/General_linear_model "General linear model") | | [Generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model "Generalized linear model") [Vector generalized linear model](https://en.wikipedia.org/wiki/Vector_generalized_linear_model "Vector generalized linear model") [Discrete choice](https://en.wikipedia.org/wiki/Discrete_choice "Discrete choice") [Binomial regression](https://en.wikipedia.org/wiki/Binomial_regression "Binomial regression") [Binary regression](https://en.wikipedia.org/wiki/Binary_regression "Binary regression") [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression "Logistic regression") [Multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression "Multinomial logistic regression") [Mixed logit](https://en.wikipedia.org/wiki/Mixed_logit "Mixed logit") [Probit](https://en.wikipedia.org/wiki/Probit_model "Probit model") [Multinomial probit](https://en.wikipedia.org/wiki/Multinomial_probit "Multinomial probit") [Ordered logit](https://en.wikipedia.org/wiki/Ordered_logit "Ordered logit") [Ordered probit](https://en.wikipedia.org/wiki/Ordered_probit "Ordered probit") [Poisson](https://en.wikipedia.org/wiki/Poisson_regression "Poisson regression") | | [Multilevel model](https://en.wikipedia.org/wiki/Multilevel_model "Multilevel model") [Fixed effects](https://en.wikipedia.org/wiki/Fixed_effects_model "Fixed effects model") [Random effects](https://en.wikipedia.org/wiki/Random_effects_model "Random effects model") [Linear mixed-effects model](https://en.wikipedia.org/wiki/Mixed_model "Mixed model") [Nonlinear mixed-effects model](https://en.wikipedia.org/wiki/Nonlinear_mixed-effects_model "Nonlinear mixed-effects model") | | [Nonlinear regression](https://en.wikipedia.org/wiki/Nonlinear_regression "Nonlinear regression") [Nonparametric](https://en.wikipedia.org/wiki/Nonparametric_regression "Nonparametric regression") [Semiparametric](https://en.wikipedia.org/wiki/Semiparametric_regression "Semiparametric regression") [Robust](https://en.wikipedia.org/wiki/Robust_regression "Robust regression") [Quantile](https://en.wikipedia.org/wiki/Quantile_regression "Quantile regression") [Isotonic](https://en.wikipedia.org/wiki/Isotonic_regression "Isotonic regression") [Principal components](https://en.wikipedia.org/wiki/Principal_component_regression "Principal component regression") [Least angle](https://en.wikipedia.org/wiki/Least-angle_regression "Least-angle regression") [Local](https://en.wikipedia.org/wiki/Local_regression "Local regression") [Segmented](https://en.wikipedia.org/wiki/Segmented_regression "Segmented regression") | | [Errors-in-variables](https://en.wikipedia.org/wiki/Errors-in-variables_models "Errors-in-variables models") | | Estimation | | [Least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares") [Linear](https://en.wikipedia.org/wiki/Linear_least_squares "Linear least squares") [Non-linear](https://en.wikipedia.org/wiki/Non-linear_least_squares "Non-linear least squares") | | [Ordinary]() [Weighted](https://en.wikipedia.org/wiki/Weighted_least_squares "Weighted least squares") [Generalized](https://en.wikipedia.org/wiki/Generalized_least_squares "Generalized least squares") [Generalized estimating equation](https://en.wikipedia.org/wiki/Generalized_estimating_equation "Generalized estimating equation") | | [Partial](https://en.wikipedia.org/wiki/Partial_least_squares_regression "Partial least squares regression") [Total](https://en.wikipedia.org/wiki/Total_least_squares "Total least squares") [Non-negative](https://en.wikipedia.org/wiki/Non-negative_least_squares "Non-negative least squares") [Ridge regression](https://en.wikipedia.org/wiki/Tikhonov_regularization "Tikhonov regularization") [Regularized](https://en.wikipedia.org/wiki/Regularized_least_squares "Regularized least squares") | | [Least absolute deviations](https://en.wikipedia.org/wiki/Least_absolute_deviations "Least absolute deviations") [Iteratively reweighted](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares "Iteratively reweighted least squares") [Bayesian](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") [Bayesian multivariate](https://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression "Bayesian multivariate linear regression") [Least-squares spectral analysis](https://en.wikipedia.org/wiki/Least-squares_spectral_analysis "Least-squares spectral analysis") | | Background | | [Regression validation](https://en.wikipedia.org/wiki/Regression_validation "Regression validation") [Mean and predicted response](https://en.wikipedia.org/wiki/Mean_and_predicted_response "Mean and predicted response") [Errors and residuals](https://en.wikipedia.org/wiki/Errors_and_residuals "Errors and residuals") [Goodness of fit](https://en.wikipedia.org/wiki/Goodness_of_fit "Goodness of fit") [Studentized residual](https://en.wikipedia.org/wiki/Studentized_residual "Studentized residual") [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem") | | [![icon](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Nuvola_apps_edu_mathematics_blue-p.svg/40px-Nuvola_apps_edu_mathematics_blue-p.svg.png)](https://en.wikipedia.org/wiki/File:Nuvola_apps_edu_mathematics_blue-p.svg) [Mathematics portal](https://en.wikipedia.org/wiki/Portal:Mathematics "Portal:Mathematics") | | [v](https://en.wikipedia.org/wiki/Template:Regression_bar "Template:Regression bar") [t](https://en.wikipedia.org/wiki/Template_talk:Regression_bar "Template talk:Regression bar") [e](https://en.wikipedia.org/wiki/Special:EditPage/Template:Regression_bar "Special:EditPage/Template:Regression bar") | In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), **ordinary least squares** (**OLS**) is a type of [linear least squares](https://en.wikipedia.org/wiki/Linear_least_squares "Linear least squares") method for choosing the unknown [parameters](https://en.wikipedia.org/wiki/Statistical_parameter "Statistical parameter") in a [linear regression](https://en.wikipedia.org/wiki/Linear_regression "Linear regression") model (with fixed level-one\[*[clarification needed](https://en.wikipedia.org/wiki/Wikipedia:Please_clarify "Wikipedia:Please clarify")*\] effects of a [linear function](https://en.wikipedia.org/wiki/Linear_function "Linear function") of a set of [explanatory variables](https://en.wikipedia.org/wiki/Explanatory_variable "Explanatory variable")) by the principle of [least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares"): minimizing the sum of the squares of the differences between the observed [dependent variable](https://en.wikipedia.org/wiki/Dependent_variable "Dependent variable") (values of the variable being observed) in the input [dataset](https://en.wikipedia.org/wiki/Dataset "Dataset") and the output of the (linear) function of the [independent variable](https://en.wikipedia.org/wiki/Independent_variable "Independent variable"). Some sources consider OLS to be linear regression.[\[1\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-1) Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting [estimator](https://en.wikipedia.org/wiki/Statistical_estimation "Statistical estimation") can be expressed by a simple formula, especially in the case of a [simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression "Simple linear regression"), in which there is a single [regressor](https://en.wikipedia.org/wiki/Regressor "Regressor") on the right side of the regression equation. The OLS estimator is [consistent](https://en.wikipedia.org/wiki/Consistent_estimator "Consistent estimator") for the level-one fixed effects when the regressors are [exogenous](https://en.wikipedia.org/wiki/Exogenous "Exogenous") and forms perfect [colinearity](https://en.wikipedia.org/wiki/Collinearity "Collinearity") (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[\[2\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-2) and—by the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem")—[optimal in the class of linear unbiased estimators](https://en.wikipedia.org/wiki/Best_linear_unbiased_estimator "Best linear unbiased estimator") when the [errors](https://en.wikipedia.org/wiki/Statistical_error "Statistical error") are [homoscedastic](https://en.wikipedia.org/wiki/Homoscedastic "Homoscedastic") and [serially uncorrelated](https://en.wikipedia.org/wiki/Autocorrelation "Autocorrelation"). Under these conditions, the method of OLS provides [minimum-variance mean-unbiased](https://en.wikipedia.org/wiki/UMVU "UMVU") estimation when the errors have finite [variances](https://en.wikipedia.org/wiki/Variance "Variance"). Under the additional assumption that the errors are [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution "Normal distribution") with zero mean, OLS is the [maximum likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimator "Maximum likelihood estimator") that outperforms any non-linear unbiased estimator. ## Linear model \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=1 "Edit section: Linear model")\] Main article: [Linear regression model](https://en.wikipedia.org/wiki/Linear_regression_model "Linear regression model") Suppose the data consists of n {\\displaystyle n} ![{\\displaystyle n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [observations](https://en.wikipedia.org/wiki/Statistical_unit "Statistical unit") { x i , y i } i \= 1 n {\\displaystyle \\left\\{\\mathbf {x} \_{i},y\_{i}\\right\\}\_{i=1}^{n}} ![{\\displaystyle \\left\\{\\mathbf {x} \_{i},y\_{i}\\right\\}\_{i=1}^{n}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a30b4d40ab94a43e79c39dab82a36f8d19bdc798). Each observation i {\\displaystyle i} ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20) includes a scalar response y i {\\displaystyle y\_{i}} ![{\\displaystyle y\_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67d30d30b6c2dbe4d6f150d699de040937ecc95f) and a column vector x i {\\displaystyle \\mathbf {x} \_{i}} ![{\\displaystyle \\mathbf {x} \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/57d2ef3df60acdb53bdf90535264041fea7231cd) of p {\\displaystyle p} ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) parameters (regressors), i.e., x i \= \[ x i 1 , x i 2 , … , x i p \] T {\\displaystyle \\mathbf {x} \_{i}=\\left\[x\_{i1},x\_{i2},\\dots ,x\_{ip}\\right\]^{\\operatorname {T} }} ![{\\displaystyle \\mathbf {x} \_{i}=\\left\[x\_{i1},x\_{i2},\\dots ,x\_{ip}\\right\]^{\\operatorname {T} }}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3278872b5bdb53e6af3474d92e9926c0238e8935). In a [linear regression model](https://en.wikipedia.org/wiki/Linear_regression_model "Linear regression model"), the response variable, y i {\\displaystyle y\_{i}} ![{\\displaystyle y\_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67d30d30b6c2dbe4d6f150d699de040937ecc95f), is a linear function of the regressors: y i \= β 1 x i 1 \+ β 2 x i 2 \+ ⋯ \+ β p x i p \+ ε i , {\\displaystyle y\_{i}=\\beta \_{1}\\ x\_{i1}+\\beta \_{2}\\ x\_{i2}+\\cdots +\\beta \_{p}\\ x\_{ip}+\\varepsilon \_{i},} ![{\\displaystyle y\_{i}=\\beta \_{1}\\ x\_{i1}+\\beta \_{2}\\ x\_{i2}+\\cdots +\\beta \_{p}\\ x\_{ip}+\\varepsilon \_{i},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7870b85fe5f4127c19581eb03c7c0a26a76035a0) or in [vector](https://en.wikipedia.org/wiki/Row_and_column_vectors "Row and column vectors") form, y i \= x i T β \+ ε i , {\\displaystyle y\_{i}=\\mathbf {x} \_{i}^{\\operatorname {T} }{\\boldsymbol {\\beta }}+\\varepsilon \_{i},\\,} ![{\\displaystyle y\_{i}=\\mathbf {x} \_{i}^{\\operatorname {T} }{\\boldsymbol {\\beta }}+\\varepsilon \_{i},\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/b32cb3791f0e061f7d9930fa093e77c44de5df54) where x i {\\displaystyle \\mathbf {x} \_{i}} ![{\\displaystyle \\mathbf {x} \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/57d2ef3df60acdb53bdf90535264041fea7231cd), as introduced previously, is a column vector of the i {\\displaystyle i} ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20)\-th observation of all the explanatory variables; β {\\displaystyle {\\boldsymbol {\\beta }}} ![{\\displaystyle {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/702cafc420cc00c54896f6d125112820956aaf6b) is a p × 1 {\\displaystyle p\\times 1} ![{\\displaystyle p\\times 1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a9b3ff9128b8bc9ccf1c3b9a3ba1d253b95f5754) vector of unknown parameters; and the scalar ε i {\\displaystyle \\varepsilon \_{i}} ![{\\displaystyle \\varepsilon \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) represents unobserved random variables ([errors](https://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics "Errors and residuals in statistics")) of the i {\\displaystyle i} ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20)\-th observation. ε i {\\displaystyle \\varepsilon \_{i}} ![{\\displaystyle \\varepsilon \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) accounts for the influences upon the responses y i {\\displaystyle y\_{i}} ![{\\displaystyle y\_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67d30d30b6c2dbe4d6f150d699de040937ecc95f) from sources other than the explanatory variables x i {\\displaystyle \\mathbf {x} \_{i}} ![{\\displaystyle \\mathbf {x} \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/57d2ef3df60acdb53bdf90535264041fea7231cd). This model can also be written in matrix notation as y \= X β \+ ε , {\\displaystyle \\mathbf {y} =\\mathbf {X} {\\boldsymbol {\\beta }}+{\\boldsymbol {\\varepsilon }},\\,} ![{\\displaystyle \\mathbf {y} =\\mathbf {X} {\\boldsymbol {\\beta }}+{\\boldsymbol {\\varepsilon }},\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e12b25ef53459d71b4879d9bd23b4387defe4aef) where y {\\displaystyle \\mathbf {y} } ![{\\displaystyle \\mathbf {y} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/bb25a040b592282dc2a254c3117e792c3c81161f) and ε {\\displaystyle {\\boldsymbol {\\varepsilon }}} ![{\\displaystyle {\\boldsymbol {\\varepsilon }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8445af5ff7da70714382bc35e78bedcacf68e825) are n × 1 {\\displaystyle n\\times 1} ![{\\displaystyle n\\times 1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d24148f103e1cccb60addeeb0a64cb1c3d5622e0) vectors of the response variables and the errors of the n {\\displaystyle n} ![{\\displaystyle n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a601995d55609f2d9f5e233e36fbe9ea26011b3b) observations, and X {\\displaystyle \\mathbf {X} } ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd) is an n × p {\\displaystyle n\\times p} ![{\\displaystyle n\\times p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/43ad58cdd60e9b0ab2bec828151c740accf92028) matrix of regressors, also sometimes called the [design matrix](https://en.wikipedia.org/wiki/Design_matrix "Design matrix"), whose row i {\\displaystyle i} ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20) is x i T {\\displaystyle \\mathbf {x} \_{i}^{\\operatorname {T} }} ![{\\displaystyle \\mathbf {x} \_{i}^{\\operatorname {T} }}](https://wikimedia.org/api/rest_v1/media/math/render/svg/71723da20a9d144f526b5f42f8bce496c157d34d) and contains the i {\\displaystyle i} ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20)\-th observations on all the explanatory variables. Typically, a constant term is included in the set of regressors X {\\displaystyle \\mathbf {X} } ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd), say, by taking x i 1 \= 1 {\\displaystyle x\_{i1}=1} ![{\\displaystyle x\_{i1}=1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e29706788693b7f680e4d6dd2cfd96078cd968d8) for all i \= 1 , … , n {\\displaystyle i=1,\\dots ,n} ![{\\displaystyle i=1,\\dots ,n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f3f269b2f3b2f87fec0168426652a5ea80b56112). The coefficient β 1 {\\displaystyle \\beta \_{1}} ![{\\displaystyle \\beta \_{1}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) corresponding to this regressor is called the *intercept*. Without the intercept, the fitted line is forced to cross the origin when x i \= 0 → {\\displaystyle x\_{i}={\\vec {0}}} ![{\\displaystyle x\_{i}={\\vec {0}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d46f3b2caf53f86b9bb27baa47fa68fa138d61be). Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent). As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be *quadratic* in the second regressor, but none-the-less is still considered a *linear* model because the model *is* still linear in the parameters (β {\\displaystyle {\\boldsymbol {\\beta }}} ![{\\displaystyle {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/702cafc420cc00c54896f6d125112820956aaf6b)). ### Matrix/vector formulation \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=2 "Edit section: Matrix/vector formulation")\] Consider an [overdetermined system](https://en.wikipedia.org/wiki/Overdetermined_system "Overdetermined system") ∑ j \= 1 p x i j β j \= y i , ( i \= 1 , 2 , … , n ) , {\\displaystyle \\sum \_{j=1}^{p}x\_{ij}\\beta \_{j}=y\_{i},\\ (i=1,2,\\dots ,n),} ![{\\displaystyle \\sum \_{j=1}^{p}x\_{ij}\\beta \_{j}=y\_{i},\\ (i=1,2,\\dots ,n),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7080f1307d382773f003a69c9df7f481720c1fd2) of n {\\displaystyle n} ![{\\displaystyle n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [linear equations](https://en.wikipedia.org/wiki/Linear_equation "Linear equation") in p {\\displaystyle p} ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) unknown [coefficients](https://en.wikipedia.org/wiki/Coefficients "Coefficients"), β 1 , β 2 , … , β p {\\displaystyle \\beta \_{1},\\beta \_{2},\\dots ,\\beta \_{p}} ![{\\displaystyle \\beta \_{1},\\beta \_{2},\\dots ,\\beta \_{p}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3a420907c6b68e22d9f37306b5837bdafd46ae1e), with n \> p {\\displaystyle n\>p} ![{\\displaystyle n\>p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7002b8ec5aefb63e588cf894804ba7cb8401fc95). This can be written in [matrix](https://en.wikipedia.org/wiki/Matrix_$mathematics$ "Matrix (mathematics)") form as X β \= y , {\\displaystyle \\mathbf {X} {\\boldsymbol {\\beta }}=\\mathbf {y} ,} ![{\\displaystyle \\mathbf {X} {\\boldsymbol {\\beta }}=\\mathbf {y} ,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3b00013b54c8f88aeeb146a12584541c6e934529) where X \= \[ X 11 X 12 ⋯ X 1 p X 21 X 22 ⋯ X 2 p ⋮ ⋮ ⋱ ⋮ X n 1 X n 2 ⋯ X n p \] , β \= \[ β 1 β 2 ⋮ β p \] , y \= \[ y 1 y 2 ⋮ y n \] . {\\displaystyle \\mathbf {X} ={\\begin{bmatrix}X\_{11}\&X\_{12}&\\cdots \&X\_{1p}\\\\X\_{21}\&X\_{22}&\\cdots \&X\_{2p}\\\\\\vdots &\\vdots &\\ddots &\\vdots \\\\X\_{n1}\&X\_{n2}&\\cdots \&X\_{np}\\end{bmatrix}},\\qquad {\\boldsymbol {\\beta }}={\\begin{bmatrix}\\beta \_{1}\\\\\\beta \_{2}\\\\\\vdots \\\\\\beta \_{p}\\end{bmatrix}},\\qquad \\mathbf {y} ={\\begin{bmatrix}y\_{1}\\\\y\_{2}\\\\\\vdots \\\\y\_{n}\\end{bmatrix}}.} ![{\\displaystyle \\mathbf {X} ={\\begin{bmatrix}X\_{11}\&X\_{12}&\\cdots \&X\_{1p}\\\\X\_{21}\&X\_{22}&\\cdots \&X\_{2p}\\\\\\vdots &\\vdots &\\ddots &\\vdots \\\\X\_{n1}\&X\_{n2}&\\cdots \&X\_{np}\\end{bmatrix}},\\qquad {\\boldsymbol {\\beta }}={\\begin{bmatrix}\\beta \_{1}\\\\\\beta \_{2}\\\\\\vdots \\\\\\beta \_{p}\\end{bmatrix}},\\qquad \\mathbf {y} ={\\begin{bmatrix}y\_{1}\\\\y\_{2}\\\\\\vdots \\\\y\_{n}\\end{bmatrix}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9da3326ca4d8536e1e444d2cea03ec869d734e6f) (Note: for a linear model as above, not all elements in X {\\displaystyle \\mathbf {X} } ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd) contains information on the data points. The first column is populated with ones, X i 1 \= 1 {\\displaystyle X\_{i1}=1} ![{\\displaystyle X\_{i1}=1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3b81dfc9a5c55d20b4edfdc14c4ea4fc4e666bb0). Only the other columns contain actual data. So here p {\\displaystyle p} ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) is equal to the number of regressors plus one). Such a system usually has no exact solution, so the goal is instead to find the coefficients β {\\displaystyle {\\boldsymbol {\\beta }}} ![{\\displaystyle {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/702cafc420cc00c54896f6d125112820956aaf6b) which fit the equations "best", in the sense of solving the [quadratic](https://en.wikipedia.org/wiki/Quadratic_form_$statistics$ "Quadratic form (statistics)") [minimization](https://en.wikipedia.org/wiki/Mathematical_optimization "Mathematical optimization") problem β ^ \= a r g m i n β S ( β ) , {\\displaystyle {\\hat {\\boldsymbol {\\beta }}}={\\underset {\\boldsymbol {\\beta }}{\\operatorname {arg\\,min} }}\\,S({\\boldsymbol {\\beta }}),} ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}={\\underset {\\boldsymbol {\\beta }}{\\operatorname {arg\\,min} }}\\,S({\\boldsymbol {\\beta }}),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/bf227a8af979e716d08f7c82dd95b17440e33a15) where the objective function S {\\displaystyle S} ![{\\displaystyle S}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4611d85173cd3b508e67077d4a1252c9c05abca2) is given by S ( β ) \= ∑ i \= 1 n \| y i − ∑ j \= 1 p X i j β j \| 2 \= ‖ y − X β ‖ 2 . {\\displaystyle S({\\boldsymbol {\\beta }})=\\sum \_{i=1}^{n}\\left\|y\_{i}-\\sum \_{j=1}^{p}X\_{ij}\\beta \_{j}\\right\|^{2}=\\left\\\|\\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}\\right\\\|^{2}.} ![{\\displaystyle S({\\boldsymbol {\\beta }})=\\sum \_{i=1}^{n}\\left\|y\_{i}-\\sum \_{j=1}^{p}X\_{ij}\\beta \_{j}\\right\|^{2}=\\left\\\|\\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}\\right\\\|^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cd9b3380cc6d4c170f743bb84a1878cac86ce009) A justification for choosing this criterion is given in [Properties](https://en.wikipedia.org/wiki/Ordinary_least_squares#Properties) below. This minimization problem has a unique solution, provided that the p {\\displaystyle p} ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) columns of the matrix X {\\displaystyle \\mathbf {X} } ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd) are [linearly independent](https://en.wikipedia.org/wiki/Linearly_independent "Linearly independent"), given by solving the so-called *normal equations*: ( X T X ) β ^ \= X T y . {\\displaystyle \\left(\\mathbf {X} ^{\\operatorname {T} }\\mathbf {X} \\right){\\hat {\\boldsymbol {\\beta }}}=\\mathbf {X} ^{\\operatorname {T} }\\mathbf {y} \\ .} ![{\\displaystyle \\left(\\mathbf {X} ^{\\operatorname {T} }\\mathbf {X} \\right){\\hat {\\boldsymbol {\\beta }}}=\\mathbf {X} ^{\\operatorname {T} }\\mathbf {y} \\ .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/74ccf054aed29744d095c445b7aaa7a84729db17) The matrix X T X {\\displaystyle \\mathbf {X} ^{\\operatorname {T} }\\mathbf {X} } ![{\\displaystyle \\mathbf {X} ^{\\operatorname {T} }\\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/a8b826366f6f9df8cf2d0ea4fe3eda3c760d2fc8) is known as the *normal matrix* or [Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix "Gram matrix") and the matrix X T y {\\displaystyle \\mathbf {X} ^{\\operatorname {T} }\\mathbf {y} } ![{\\displaystyle \\mathbf {X} ^{\\operatorname {T} }\\mathbf {y} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9fdb62e41d505bc511e1f626946b2775aa00c01b) is known as the [moment matrix](https://en.wikipedia.org/wiki/Moment_matrix "Moment matrix") of regressand by regressors.[\[3\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-3) Finally, β ^ {\\displaystyle {\\hat {\\boldsymbol {\\beta }}}} ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4a29ed56e80ee92ae1ef81b8ee8b7ffb4a16b614) is the coefficient vector of the least-squares [hyperplane](https://en.wikipedia.org/wiki/Hyperplane "Hyperplane"), expressed as β ^ \= ( X ⊤ X ) − 1 X ⊤ y . {\\displaystyle {\\hat {\\boldsymbol {\\beta }}}=\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }\\mathbf {y} .} ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}=\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }\\mathbf {y} .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/96ef11beb26be8b3f28df0d43e0810694ea980d1) or β ^ \= β \+ ( X ⊤ X ) − 1 X ⊤ ε . {\\displaystyle {\\hat {\\boldsymbol {\\beta }}}={\\boldsymbol {\\beta }}+\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }{\\boldsymbol {\\varepsilon }}.} ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}={\\boldsymbol {\\beta }}+\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }{\\boldsymbol {\\varepsilon }}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/dd9aa3940939f16090f954e6f8b76d914f3d9f18) ## Estimation \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=3 "Edit section: Estimation")\] Suppose *b* is a "candidate" value for the parameter vector *β*. The quantity *yi* − *xi*T*b*, called the *[residual](https://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics "Errors and residuals in statistics")* for the *i*\-th observation, measures the vertical distance between the data point (*xi*, *yi*) and the hyperplane *y* = *x*T*b*, and thus assesses the degree of fit between the actual data and the model. The *[sum of squared residuals](https://en.wikipedia.org/wiki/Sum_of_squared_residuals "Sum of squared residuals")* (*SSR*) (also called the *error sum of squares* (*ESS*) or *residual sum of squares* (*RSS*))[\[4\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-4) is a measure of the overall model fit: S ( b ) \= ∑ i \= 1 n ( y i − x i T b ) 2 \= ( y − X b ) T ( y − X b ) , {\\displaystyle S(b)=\\sum \_{i=1}^{n}(y\_{i}-x\_{i}^{\\operatorname {T} }b)^{2}=(y-Xb)^{\\operatorname {T} }(y-Xb),} ![{\\displaystyle S(b)=\\sum \_{i=1}^{n}(y\_{i}-x\_{i}^{\\operatorname {T} }b)^{2}=(y-Xb)^{\\operatorname {T} }(y-Xb),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/db93b1fe27230d86895999a749498e7180f26acd) where *T* denotes the matrix [transpose](https://en.wikipedia.org/wiki/Transpose "Transpose"), and the rows of *X*, denoting the values of all the independent variables associated with a particular value of the dependent variable, are *Xi = xi*T. The value of *b* which minimizes this sum is called the **OLS estimator for *β***. The function *S*(*b*) is quadratic in *b* with positive-definite [Hessian](https://en.wikipedia.org/wiki/Hessian_matrix "Hessian matrix"), and therefore this function possesses a unique global minimum at b \= β ^ {\\displaystyle b={\\hat {\\beta }}} ![{\\displaystyle b={\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5b0af26c563f2ee2ad28a14e01ee712c5ab69d63), which can be given by the explicit formula[\[5\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-5)[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Least_squares_estimator_for_.CE.B2 "Proofs involving ordinary least squares") β ^ \= argmin b ∈ R p ⁡ S ( b ) \= ( X T X ) − 1 X T y . {\\displaystyle {\\hat {\\beta }}=\\operatorname {argmin} \_{b\\in \\mathbb {R} ^{p}}S(b)=(X^{\\operatorname {T} }X)^{-1}X^{\\operatorname {T} }y\\ .} ![{\\displaystyle {\\hat {\\beta }}=\\operatorname {argmin} \_{b\\in \\mathbb {R} ^{p}}S(b)=(X^{\\operatorname {T} }X)^{-1}X^{\\operatorname {T} }y\\ .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/89476bedc0a2bdb8bdffee080e1cf7595eb09404) The product *N* = *X*T *X* is a [Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix "Gram matrix"), and its inverse, *Q* = *N*−1, is the *cofactor matrix* of *β*,[\[6\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-6)[\[7\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-7)[\[8\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-8) closely related to its [covariance matrix](https://en.wikipedia.org/wiki/Ordinary_least_squares#Covariance_matrix), *C**β*. The matrix (*X*T *X*)−1 *X*T = *Q* *X*T is called the [Moore–Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse "Moore–Penrose pseudoinverse") matrix of *X*. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity "Multicollinearity") between the explanatory variables (which would cause the Gram matrix to have no inverse). ## Prediction \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=4 "Edit section: Prediction")\] After we have estimated *β*, the *[fitted values](https://en.wikipedia.org/wiki/Fitted_value "Fitted value")* (or *predicted values*) from the regression will be y ^ \= X β ^ \= P y , {\\displaystyle {\\hat {y}}=X{\\hat {\\beta }}=Py,} ![{\\displaystyle {\\hat {y}}=X{\\hat {\\beta }}=Py,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d480dd9ab57371fb43d863a1cf345c71fb9ed7dd) where *P* = *X*(*X*T*X*)−1*X*T is the *[projection matrix](https://en.wikipedia.org/wiki/Projection_matrix "Projection matrix")* onto the space *V* spanned by the columns of *X*. This matrix *P* is also sometimes called the *[hat matrix](https://en.wikipedia.org/wiki/Hat_matrix "Hat matrix")* because it "puts a hat" onto the variable *y*. Another matrix, closely related to *P* is the *annihilator* matrix *M* = *In* − *P*; this is a projection matrix onto the space orthogonal to *V*. Both matrices *P* and *M* are [symmetric](https://en.wikipedia.org/wiki/Symmetric_matrix "Symmetric matrix") and [idempotent](https://en.wikipedia.org/wiki/Idempotent_matrix "Idempotent matrix") (meaning that *P*2 = *P* and *M*2 = *M*), and relate to the data matrix *X* via identities *PX* = *X* and *MX* = 0.[\[9\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9) Matrix *M* creates the *residuals* from the regression: ε ^ \= y − y ^ \= y − X β ^ \= M y \= M ( X β \+ ε ) \= ( M X ) β \+ M ε \= M ε . {\\displaystyle {\\hat {\\varepsilon }}=y-{\\hat {y}}=y-X{\\hat {\\beta }}=My=M(X\\beta +\\varepsilon )=(MX)\\beta +M\\varepsilon =M\\varepsilon .} ![{\\displaystyle {\\hat {\\varepsilon }}=y-{\\hat {y}}=y-X{\\hat {\\beta }}=My=M(X\\beta +\\varepsilon )=(MX)\\beta +M\\varepsilon =M\\varepsilon .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/49f1e5bb8d4cd6e0e3b2b5a4f54ec0964721882e) The variances of the predicted values s y ^ i 2 {\\displaystyle s\_{{\\hat {y}}\_{i}}^{2}} ![{\\displaystyle s\_{{\\hat {y}}\_{i}}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/800e6b83f5c87178d1e03b2cf3592caee2ce515a) are found in the main diagonal of the [variance-covariance matrix](https://en.wikipedia.org/wiki/Variance-covariance_matrix "Variance-covariance matrix") of predicted values: C y ^ \= s 2 P , {\\displaystyle C\_{\\hat {y}}=s^{2}P,} ![{\\displaystyle C\_{\\hat {y}}=s^{2}P,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/be9078c7868b7d295fde435739ee63bc5f7f3cc2) where *P* is the projection matrix and *s*2 is the sample variance.[\[10\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-q011-10) The full matrix is very large; its diagonal elements can be calculated individually as: s y ^ i 2 \= s 2 X i ( X T X ) − 1 X i T , {\\displaystyle s\_{{\\hat {y}}\_{i}}^{2}=s^{2}X\_{i}(X^{T}X)^{-1}X\_{i}^{T},} ![{\\displaystyle s\_{{\\hat {y}}\_{i}}^{2}=s^{2}X\_{i}(X^{T}X)^{-1}X\_{i}^{T},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/abdd97a20f265471ebb770d822cab8b8732786e0) where *X*i is the *i*\-th row of matrix *X*. ## Sample statistics \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=5 "Edit section: Sample statistics")\] Using these residuals we can estimate the sample variance *s*2 using the *[reduced chi-squared](https://en.wikipedia.org/wiki/Reduced_chi-squared "Reduced chi-squared")* statistic: s 2 \= ε ^ T ε ^ n − p \= ( M y ) T M y n − p \= y T M T M y n − p \= y T M y n − p \= S ( β ^ ) n − p , σ ^ 2 \= n − p n s 2 {\\displaystyle s^{2}={\\frac {{\\hat {\\varepsilon }}^{\\mathrm {T} }{\\hat {\\varepsilon }}}{n-p}}={\\frac {(My)^{\\mathrm {T} }My}{n-p}}={\\frac {y^{\\mathrm {T} }M^{\\mathrm {T} }My}{n-p}}={\\frac {y^{\\mathrm {T} }My}{n-p}}={\\frac {S({\\hat {\\beta }})}{n-p}},\\qquad {\\hat {\\sigma }}^{2}={\\frac {n-p}{n}}\\;s^{2}} ![{\\displaystyle s^{2}={\\frac {{\\hat {\\varepsilon }}^{\\mathrm {T} }{\\hat {\\varepsilon }}}{n-p}}={\\frac {(My)^{\\mathrm {T} }My}{n-p}}={\\frac {y^{\\mathrm {T} }M^{\\mathrm {T} }My}{n-p}}={\\frac {y^{\\mathrm {T} }My}{n-p}}={\\frac {S({\\hat {\\beta }})}{n-p}},\\qquad {\\hat {\\sigma }}^{2}={\\frac {n-p}{n}}\\;s^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/02aa29dec64f9231a7b9b6b2ac7d4889bbe006b2) The denominator, *n*−*p*, is the [statistical degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_$statistics$ "Degrees of freedom (statistics)"). The first quantity, *s*2, is the OLS estimate for *σ*2, whereas the second, σ ^ 2 {\\displaystyle \\scriptstyle {\\hat {\\sigma }}^{2}} ![{\\displaystyle \\scriptstyle {\\hat {\\sigma }}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6314fcc00b383711f6a82277c41ac3aa39112bbc), is the MLE estimate for *σ*2. The two estimators are quite similar in large samples; the first estimator is always [unbiased](https://en.wikipedia.org/wiki/Estimator_bias "Estimator bias"), while the second estimator is biased but has a smaller [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error "Mean squared error"). In practice *s*2 is used more often, since it is more convenient for the hypothesis testing. The square root of *s*2 is called the *[regression standard error](https://en.wikipedia.org/wiki/Regression_standard_error "Regression standard error")*,[\[11\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-11) *standard error of the regression*,[\[12\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-12)[\[13\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-13) or *standard error of the equation*.[\[9\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9) It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto *X*. The *[coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination "Coefficient of determination")* *R*2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable *y*, in the cases where the regression sum of squares equals the sum of squares of residuals:[\[14\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-14) R 2 \= ∑ ( y ^ i − y ¯ ) 2 ∑ ( y i − y ¯ ) 2 \= y T P T L P y y T L y \= 1 − y T M y y T L y \= 1 − R S S T S S {\\displaystyle R^{2}={\\frac {\\sum ({\\hat {y}}\_{i}-{\\overline {y}})^{2}}{\\sum (y\_{i}-{\\overline {y}})^{2}}}={\\frac {y^{\\mathrm {T} }P^{\\mathrm {T} }LPy}{y^{\\mathrm {T} }Ly}}=1-{\\frac {y^{\\mathrm {T} }My}{y^{\\mathrm {T} }Ly}}=1-{\\frac {\\rm {RSS}}{\\rm {TSS}}}} ![{\\displaystyle R^{2}={\\frac {\\sum ({\\hat {y}}\_{i}-{\\overline {y}})^{2}}{\\sum (y\_{i}-{\\overline {y}})^{2}}}={\\frac {y^{\\mathrm {T} }P^{\\mathrm {T} }LPy}{y^{\\mathrm {T} }Ly}}=1-{\\frac {y^{\\mathrm {T} }My}{y^{\\mathrm {T} }Ly}}=1-{\\frac {\\rm {RSS}}{\\rm {TSS}}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/402af3a43381655d4ef206850bdb41f55b3d6124) where TSS is the *[total sum of squares](https://en.wikipedia.org/wiki/Total_sum_of_squares "Total sum of squares")* for the dependent variable, L \= I n − 1 n J n {\\textstyle L=I\_{n}-{\\frac {1}{n}}J\_{n}} ![{\\textstyle L=I\_{n}-{\\frac {1}{n}}J\_{n}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/22262bc6f943dcc9bc10a932608ba89ea19476ba), and J n {\\textstyle J\_{n}} ![{\\textstyle J\_{n}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/60e444a99197edac686bd1a68cfa011b1ffc8559) is an *n*×*n* matrix of ones. (L {\\displaystyle L} ![{\\displaystyle L}](https://wikimedia.org/api/rest_v1/media/math/render/svg/103168b86f781fe6e9a4a87b8ea1cebe0ad4ede8) is a [centering matrix](https://en.wikipedia.org/wiki/Centering_matrix "Centering matrix") which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for *R*2 to be meaningful, the matrix *X* of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, *R*2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit. ### Simple linear regression model \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=6 "Edit section: Simple linear regression model")\] Main article: [Simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression "Simple linear regression") If the data matrix *X* contains only two variables, a constant and a scalar regressor *xi*, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as (*α*, *β*): y i \= α \+ β x i \+ ε i . {\\displaystyle y\_{i}=\\alpha +\\beta x\_{i}+\\varepsilon \_{i}.} ![{\\displaystyle y\_{i}=\\alpha +\\beta x\_{i}+\\varepsilon \_{i}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/968be557dd22b1a2e536b8d22369cfdb37f58703) The least squares estimates in this case are given by simple formulas β ^ \= ∑ i \= 1 n ( x i − x ¯ ) ( y i − y ¯ ) ∑ i \= 1 n ( x i − x ¯ ) 2 α ^ \= y ¯ − β ^ x ¯ , {\\displaystyle {\\begin{aligned}{\\widehat {\\beta }}&={\\frac {\\sum \_{i=1}^{n}{(x\_{i}-{\\bar {x}})(y\_{i}-{\\bar {y}})}}{\\sum \_{i=1}^{n}{(x\_{i}-{\\bar {x}})^{2}}}}\\\\\[2pt\]{\\widehat {\\alpha }}&={\\bar {y}}-{\\widehat {\\beta }}\\,{\\bar {x}}\\ ,\\end{aligned}}} ![{\\displaystyle {\\begin{aligned}{\\widehat {\\beta }}&={\\frac {\\sum \_{i=1}^{n}{(x\_{i}-{\\bar {x}})(y\_{i}-{\\bar {y}})}}{\\sum \_{i=1}^{n}{(x\_{i}-{\\bar {x}})^{2}}}}\\\\\[2pt\]{\\widehat {\\alpha }}&={\\bar {y}}-{\\widehat {\\beta }}\\,{\\bar {x}}\\ ,\\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/932c6407f7ceba533fef69961fe504fc3b565e1e) ## Alternative derivations \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=7 "Edit section: Alternative derivations")\] In the previous section the least squares estimator β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^*β* = (*X*T*X*)−1*X*T*y*; the only difference is in how we interpret this result. ### Projection \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=8 "Edit section: Projection")\] [![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/OLS_geometric_interpretation.svg/250px-OLS_geometric_interpretation.svg.png)](https://en.wikipedia.org/wiki/File:OLS_geometric_interpretation.svg) OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of X 1 {\\displaystyle X\_{1}} ![{\\displaystyle X\_{1}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f70b2694445a5901b24338a2e7a7e58f02a72a32) and X 2 {\\displaystyle X\_{2}} ![{\\displaystyle X\_{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2ad47c14b8a092f182512e76c96638aea6e3bea1) refers to a column of the data matrix.) [![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Geometric_interpretation_of_least_squares_%28three_observations%29.png/250px-Geometric_interpretation_of_least_squares_%28three_observations%29.png)](https://en.wikipedia.org/wiki/File:Geometric_interpretation_of_least_squares_$three_observations$.png) Least squares as projection of y onto col(X) for three observations; ŷ = Xβ gives fitted values and y − ŷ is the residual. | | | |---|---| | ![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Mergefrom.svg/60px-Mergefrom.svg.png) | This section **may need to be cleaned up.** It has been [merged](https://en.wikipedia.org/wiki/H:M "H:M") from *[Linear least squares](https://en.wikipedia.org/wiki/Linear_least_squares "Linear least squares")*. | For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations *Xβ* ≈ *y*, where *β* is the unknown. Assuming the system cannot be solved exactly (the number of equations *n* is much larger than the number of unknowns *p*), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies β ^ \= a r g min β ‖ y − X β ‖ 2 , {\\displaystyle {\\hat {\\beta }}={\\rm {arg}}\\min \_{\\beta }\\,\\lVert \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}\\rVert ^{2},} ![{\\displaystyle {\\hat {\\beta }}={\\rm {arg}}\\min \_{\\beta }\\,\\lVert \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}\\rVert ^{2},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd635ce3922e11053e13871626396a73e87db79f) where ‖·‖ is the standard [*L*2 norm](https://en.wikipedia.org/wiki/Norm_$mathematics$#Euclidean_norm "Norm (mathematics)") in the *n*\-dimensional [Euclidean space](https://en.wikipedia.org/wiki/Euclidean_space "Euclidean space") **R***n*. The predicted quantity *Xβ* is just a certain linear combination of the vectors of regressors. Thus, the residual vector *y* − *Xβ* will have the smallest length when *y* is [projected orthogonally](https://en.wikipedia.org/wiki/Projection_$linear_algebra$ "Projection (linear algebra)") onto the [linear subspace](https://en.wikipedia.org/wiki/Linear_subspace "Linear subspace") [spanned](https://en.wikipedia.org/wiki/Linear_span "Linear span") by the columns of *X*. The OLS estimator β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) in this case can be interpreted as the coefficients of [vector decomposition](https://en.wikipedia.org/wiki/Vector_decomposition "Vector decomposition") of ^*y* = *Py* along the basis of *X*. In other words, the gradient equations at the minimum can be written as: ( y − X β ^ ) ⊤ X \= 0\. {\\displaystyle (\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}})^{\\top }\\mathbf {X} =0.} ![{\\displaystyle (\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}})^{\\top }\\mathbf {X} =0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f61db732765501ffe06d27ed044ec962b5836eb9) A geometrical interpretation of these equations is that the vector of residuals, y − X β ^ {\\displaystyle \\mathbf {y} -X{\\hat {\\boldsymbol {\\beta }}}} ![{\\displaystyle \\mathbf {y} -X{\\hat {\\boldsymbol {\\beta }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d28701ed83da946fc9829429e8a174f3623ef1e1) is orthogonal to the [column space](https://en.wikipedia.org/wiki/Column_space "Column space") of *X*, since the [dot product](https://en.wikipedia.org/wiki/Dot_product "Dot product") ( y − X β ^ ) ⋅ X v {\\displaystyle (\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}})\\cdot \\mathbf {X} \\mathbf {v} } ![{\\displaystyle (\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}})\\cdot \\mathbf {X} \\mathbf {v} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/b314a7f7223989068bab62ad471e21b653f95bcf) is equal to zero for *any* conformal vector, **v**. This means that y − X β ^ {\\displaystyle \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\hat {\\beta }}}} ![{\\displaystyle \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\hat {\\beta }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/c27bab4fd405ceee972b12096dbcd827ed7af8cc) is the shortest of all possible vectors y − X β {\\displaystyle \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}} ![{\\displaystyle \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5189feb32b7fee0a242670c7182342cfb014ae29), that is, the variance of the residuals is the minimum possible. This is illustrated at the right. Introducing γ ^ {\\displaystyle {\\hat {\\boldsymbol {\\gamma }}}} ![{\\displaystyle {\\hat {\\boldsymbol {\\gamma }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2d450058b0adc1e50ab7cf56639e7234b8098f1a) and a matrix *K* with the assumption that a matrix \[ X K \] {\\displaystyle \[\\mathbf {X} \\ \\mathbf {K} \]} ![{\\displaystyle \[\\mathbf {X} \\ \\mathbf {K} \]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2ac770308f79814997ffbdfd971621c67b76aef6) is non-singular and *K*T *X* = 0 (cf. [Orthogonal projections](https://en.wikipedia.org/wiki/Linear_projection#Orthogonal_projections "Linear projection")), the residual vector should satisfy the following equation: r ^ := y − X β ^ \= K γ ^ . {\\displaystyle {\\hat {\\mathbf {r} }}:=\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}}=\\mathbf {K} {\\hat {\\boldsymbol {\\gamma }}}.} ![{\\displaystyle {\\hat {\\mathbf {r} }}:=\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}}=\\mathbf {K} {\\hat {\\boldsymbol {\\gamma }}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/52b3dd4358358a8e91ca5f3ebc7aab78c0d6fdbd) The equation and solution of linear least squares are thus described as follows: y \= \[ X K \] \[ β ^ γ ^ \] , ⇒ \[ β ^ γ ^ \] \= \[ X K \] − 1 y \= \[ ( X ⊤ X ) − 1 X ⊤ ( K ⊤ K ) − 1 K ⊤ \] y . {\\displaystyle {\\begin{aligned}\\mathbf {y} &={\\begin{bmatrix}\\mathbf {X} &\\mathbf {K} \\end{bmatrix}}{\\begin{bmatrix}{\\hat {\\boldsymbol {\\beta }}}\\\\{\\hat {\\boldsymbol {\\gamma }}}\\end{bmatrix}},\\\\{}\\Rightarrow {\\begin{bmatrix}{\\hat {\\boldsymbol {\\beta }}}\\\\{\\hat {\\boldsymbol {\\gamma }}}\\end{bmatrix}}&={\\begin{bmatrix}\\mathbf {X} &\\mathbf {K} \\end{bmatrix}}^{-1}\\mathbf {y} ={\\begin{bmatrix}\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }\\\\\\left(\\mathbf {K} ^{\\top }\\mathbf {K} \\right)^{-1}\\mathbf {K} ^{\\top }\\end{bmatrix}}\\mathbf {y} .\\end{aligned}}} ![{\\displaystyle {\\begin{aligned}\\mathbf {y} &={\\begin{bmatrix}\\mathbf {X} &\\mathbf {K} \\end{bmatrix}}{\\begin{bmatrix}{\\hat {\\boldsymbol {\\beta }}}\\\\{\\hat {\\boldsymbol {\\gamma }}}\\end{bmatrix}},\\\\{}\\Rightarrow {\\begin{bmatrix}{\\hat {\\boldsymbol {\\beta }}}\\\\{\\hat {\\boldsymbol {\\gamma }}}\\end{bmatrix}}&={\\begin{bmatrix}\\mathbf {X} &\\mathbf {K} \\end{bmatrix}}^{-1}\\mathbf {y} ={\\begin{bmatrix}\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }\\\\\\left(\\mathbf {K} ^{\\top }\\mathbf {K} \\right)^{-1}\\mathbf {K} ^{\\top }\\end{bmatrix}}\\mathbf {y} .\\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/68f300fc2e982864b80ee1416ce010f0078e0e05) Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[\[15\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-15) Although this way of calculation is more computationally expensive, it provides a better intuition on OLS. ### Maximum likelihood \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=9 "Edit section: Maximum likelihood")\] The OLS estimator is identical to the [maximum likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimator "Maximum likelihood estimator") (MLE) under the normality assumption for the error terms.[\[16\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-16)[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Maximum_likelihood_approach "Proofs involving ordinary least squares") This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by [Yule](https://en.wikipedia.org/wiki/Udny_Yule "Udny Yule") and [Pearson](https://en.wikipedia.org/wiki/Karl_Pearson "Karl Pearson").\[*[citation needed](https://en.wikipedia.org/wiki/Wikipedia:Citation_needed "Wikipedia:Citation needed")*\] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [Cramér–Rao bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound "Cramér–Rao bound") for variance) if the normality assumption is satisfied.[\[17\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17) ### Generalized method of moments \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=10 "Edit section: Generalized method of moments")\] In [iid](https://en.wikipedia.org/wiki/Iid "Iid") case the OLS estimator can also be viewed as a [GMM](https://en.wikipedia.org/wiki/Generalized_method_of_moments "Generalized method of moments") estimator arising from the moment conditions E \[ x i ( y i − x i T β ) \] \= 0\. {\\displaystyle \\mathrm {E} {\\big \[}\\,x\_{i}\\left(y\_{i}-x\_{i}^{\\operatorname {T} }\\beta \\right)\\,{\\big \]}=0.} ![{\\displaystyle \\mathrm {E} {\\big \[}\\,x\_{i}\\left(y\_{i}-x\_{i}^{\\operatorname {T} }\\beta \\right)\\,{\\big \]}=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/1d7894c141dad7e6dae3aed8bb708aada174daf2) These moment conditions state that the regressors should be uncorrelated with the errors. Since *xi* is a *p*\-vector, the number of moment conditions is equal to the dimension of the parameter vector *β*, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix. Note that the original strict exogeneity assumption E\[*εi* \| *xi*\] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E\[*ƒ*(*xi*)·*εi*\] = 0 will hold. However it can be shown using the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem") that the optimal choice of function ƒ is to take *ƒ*(*x*) = *x*, which results in the moment equation posted above. ## Assumptions \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=11 "Edit section: Assumptions")\] See also: [Linear regression § Assumptions](https://en.wikipedia.org/wiki/Linear_regression#Assumptions "Linear regression") There are several different frameworks in which the [linear regression model](https://en.wikipedia.org/wiki/Linear_regression_model "Linear regression model") can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed. One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (**random design**) the regressors *xi* are random and sampled together with the *yi*'s from some [population](https://en.wikipedia.org/wiki/Statistical_population "Statistical population"), as in an [observational study](https://en.wikipedia.org/wiki/Observational_study "Observational study"). This approach allows for more natural study of the [asymptotic properties](https://en.wikipedia.org/wiki/Asymptotic_theory_$statistics$ "Asymptotic theory (statistics)") of the estimators. In the other interpretation (**fixed design**), the regressors *X* are treated as known constants set by a [design](https://en.wikipedia.org/wiki/Design_of_experiments "Design of experiments"), and *y* is sampled conditionally on the values of *X* as in an [experiment](https://en.wikipedia.org/wiki/Experiment "Experiment"). For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on *X*. All results stated in this article are within the random design framework. The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations *n* is fixed. This contrasts with the other approaches, which study the [asymptotic behavior](https://en.wikipedia.org/wiki/Asymptotic_theory_$statistics$ "Asymptotic theory (statistics)") of OLS, and in which the behavior at a large number of samples is studied. To prove finite sample unbiasedness of the OLS estimator, we require the following assumptions. [![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Polyreg_scheffe.svg/500px-Polyreg_scheffe.svg.png)](https://en.wikipedia.org/wiki/File:Polyreg_scheffe.svg) Example of a cubic polynomial regression, which is a type of linear regression. Although *polynomial regression* fits a curve model to the data, as a [statistical estimation](https://en.wikipedia.org/wiki/Estimation_theory "Estimation theory") problem it is linear, in the sense that the conditional expectation function E \[ y \| x \] {\\displaystyle \\mathbb {E} \[y\|x\]} ![{\\displaystyle \\mathbb {E} \[y\|x\]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/391e556522e79e05efc7dd6a5248cc0e5b0c4651) is linear in the unknown [parameters](https://en.wikipedia.org/wiki/Parameter "Parameter") that are estimated from the [data](https://en.wikipedia.org/wiki/Data "Data"). For this reason, polynomial regression is considered to be a special case of [multiple linear regression](https://en.wikipedia.org/wiki/Multiple_linear_regression "Multiple linear regression"). - **Exogeneity**. The regressors do not [covary](https://en.wikipedia.org/wiki/Covariance "Covariance") with the error term: E \[ ε i x i \] \= 0\. {\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}x\_{i}\]=0.} ![{\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}x\_{i}\]=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/865dc6e3427f19d12afd3bed41c45bb7661ef289) This requires, for example, that there are no [omitted variables](https://en.wikipedia.org/wiki/Omitted_variable_bias "Omitted variable bias") that covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression in [mathematical statistics](https://en.wikipedia.org/wiki/Mathematical_statistics "Mathematical statistics") is that the predictor variables *x* can be treated as fixed values, rather than [random variables](https://en.wikipedia.org/wiki/Random_variable "Random variable"). This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complex [errors-in-variables models](https://en.wikipedia.org/wiki/Errors-in-variables_models "Errors-in-variables models"), [instrumental variable models](https://en.wikipedia.org/wiki/Instrumental_variable "Instrumental variable") and the like. - **Linearity**, or **correct specification**. This means that the mean of the response variable is a [linear combination](https://en.wikipedia.org/wiki/Linear_combination "Linear combination") of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in [polynomial regression](https://en.wikipedia.org/wiki/Polynomial_regression "Polynomial regression"), which uses linear regression to fit the response variable as an arbitrary [polynomial](https://en.wikipedia.org/wiki/Polynomial "Polynomial") function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend to [overfit](https://en.wikipedia.org/wiki/Overfit "Overfit") the data. As a result, some kind of [regularization](https://en.wikipedia.org/wiki/Regularization_$mathematics$ "Regularization (mathematics)") must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are [ridge regression](https://en.wikipedia.org/wiki/Ridge_regression "Ridge regression") and [lasso regression](https://en.wikipedia.org/wiki/Lasso_regression "Lasso regression"). [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, [ridge regression](https://en.wikipedia.org/wiki/Ridge_regression "Ridge regression") and [lasso regression](https://en.wikipedia.org/wiki/Lasso_regression "Lasso regression") can both be viewed as special cases of Bayesian linear regression, with particular types of [prior distributions](https://en.wikipedia.org/wiki/Prior_distribution "Prior distribution") placed on the regression coefficients.) - [![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Heteroscedasticity_in_Linear_Regression.png/330px-Heteroscedasticity_in_Linear_Regression.png)](https://en.wikipedia.org/wiki/File:Heteroscedasticity_in_Linear_Regression.png) Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab **Constant variance** or **[homoscedasticity](https://en.wikipedia.org/wiki/Homoscedasticity "Homoscedasticity")**. This means that the variance of the errors does not depend on the values of the predictor variables: E \[ ε i 2 \| x i \] \= σ 2 . {\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}^{2}\|x\_{i}\]=\\sigma ^{2}.} ![{\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}^{2}\|x\_{i}\]=\\sigma ^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/93072a3d09cd2b89e6a544dfd4c7cbb017200cbd) Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be \$100,000 may easily have an actual income of \$80,000 or \$120,000—i.e., a [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation "Standard deviation") of around \$20,000—while another person with a predicted income of \$10,000 is unlikely to have the same \$20,000 standard deviation, since that would imply their actual income could vary anywhere between −\$10,000 and \$30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called [heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity "Heteroscedasticity"). In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see [Heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity "Heteroscedasticity"). The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of [ordinary least squares](), not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error "Mean squared error") for the model will also be wrong. Various estimation techniques including [weighted least squares](https://en.wikipedia.org/wiki/Weighted_least_squares "Weighted least squares") and the use of [heteroscedasticity-consistent standard errors](https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors "Heteroscedasticity-consistent standard errors") can handle heteroscedasticity in a quite general way. [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the [logarithm](https://en.wikipedia.org/wiki/Logarithm "Logarithm") of the response variable using a linear regression model, which implies that the response variable itself has a [log-normal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution "Log-normal distribution") rather than a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution "Normal distribution")). [![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Independence_of_Errors_Assumption_for_Linear_Regressions.png/500px-Independence_of_Errors_Assumption_for_Linear_Regressions.png)](https://en.wikipedia.org/wiki/File:Independence_of_Errors_Assumption_for_Linear_Regressions.png) To check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation "Autocorrelation") in the errors or their correlation with one or more covariates. - **Uncorrelatedness of errors**. This assumes that the errors of the response variables are uncorrelated with each other: E \[ ε i ε j \| x i , x j \] \= 0\. {\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}\\varepsilon \_{j}\|x\_{i},x\_{j}\]=0.} ![{\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}\\varepsilon \_{j}\|x\_{i},x\_{j}\]=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d2049f9a6b8ce702724b8271a5912158bf3f0fc3) Some methods such as [generalized least squares](https://en.wikipedia.org/wiki/Generalized_least_squares "Generalized least squares") are capable of handling correlated errors, although they typically require significantly more data unless some sort of [regularization](https://en.wikipedia.org/wiki/Regularization_$mathematics$ "Regularization (mathematics)") is used to bias the model towards assuming uncorrelated errors. [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") is a general way of handling this issue. Full [statistical independence](https://en.wikipedia.org/wiki/Statistical_independence "Statistical independence") is a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence. - **Lack of perfect multicollinearity** in the predictors. For standard [least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares") estimation methods, the design matrix *X* must have full [column rank](https://en.wikipedia.org/wiki/Column_rank "Column rank") *p*: [\[18\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_10-18) Pr \[ rank ⁡ ( X ) \= p \] \= 1\. {\\displaystyle \\Pr \\!{\\big \[}\\,\\operatorname {rank} (X)=p\\,{\\big \]}=1.} ![{\\displaystyle \\Pr \\!{\\big \[}\\,\\operatorname {rank} (X)=p\\,{\\big \]}=1.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6a11be3b89ce51c6441155fddbe512a991132fbf) If this assumption is violated, perfect [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity "Multicollinearity") exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see [Variance inflation factor](https://en.wikipedia.org/wiki/Variance_inflation_factor "Variance inflation factor")). In the case of perfect multicollinearity, the parameter vector ***β*** will be [non-identifiable](https://en.wikipedia.org/wiki/Non-identifiable "Non-identifiable")—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space **R***p*). See [partial least squares regression](https://en.wikipedia.org/wiki/Partial_least_squares_regression "Partial least squares regression"). Methods for fitting linear models with multicollinearity have been developed,[\[19\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Tibshirani-1996-19)[\[20\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Efron-2004-20)[\[21\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hawkins-1973-21)[\[22\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Jolliffe-1982-22) some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model "Generalized linear model"), do not suffer from this problem. Violations of these assumptions can result in biased estimations of ***β***, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods: - The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent. - The arrangement, or [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution "Probability distribution") of the predictor variables **x** has a major influence on the precision of estimates of ***β***. [Sampling](https://en.wikipedia.org/wiki/Sampling_$statistics$ "Sampling (statistics)") and [design of experiments](https://en.wikipedia.org/wiki/Design_of_experiments "Design of experiments") are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of ***β***. ## Properties \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=12 "Edit section: Properties")\] ### Finite sample properties \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=13 "Edit section: Finite sample properties")\] First of all, under the *strict exogeneity* assumption the OLS estimators β ^ {\\displaystyle \\scriptstyle {\\hat {\\beta }}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [unbiased](https://en.wikipedia.org/wiki/Bias_of_an_estimator "Bias of an estimator"), meaning that their expected values coincide with the true values of the parameters:[\[23\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-23)[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Unbiasedness_of_.CE.B2.CC.82 "Proofs involving ordinary least squares") E ⁡ \[ β ^ ∣ X \] \= β , E ⁡ \[ s 2 ∣ X \] \= σ 2 . {\\displaystyle \\operatorname {E} \[\\,{\\hat {\\beta }}\\mid X\\,\]=\\beta ,\\quad \\operatorname {E} \[\\,s^{2}\\mid X\\,\]=\\sigma ^{2}.} ![{\\displaystyle \\operatorname {E} \[\\,{\\hat {\\beta }}\\mid X\\,\]=\\beta ,\\quad \\operatorname {E} \[\\,s^{2}\\mid X\\,\]=\\sigma ^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67bc2fd0f90c46da207712893fdcea01e729026c) If the strict exogeneity does not hold (as is the case with many [time series](https://en.wikipedia.org/wiki/Time_series "Time series") models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples. The *[variance-covariance matrix](https://en.wikipedia.org/wiki/Variance-covariance_matrix "Variance-covariance matrix")* (or simply *covariance matrix*) of β ^ {\\displaystyle \\scriptstyle {\\hat {\\beta }}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is equal to[\[24\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-HayashiFSP-24) Var ⁡ \[ β ^ ∣ X \] \= σ 2 ( X T X ) − 1 \= σ 2 Q . {\\displaystyle \\operatorname {Var} \[\\,{\\hat {\\beta }}\\mid X\\,\]=\\sigma ^{2}\\left(X^{\\operatorname {T} }X\\right)^{-1}=\\sigma ^{2}Q.} ![{\\displaystyle \\operatorname {Var} \[\\,{\\hat {\\beta }}\\mid X\\,\]=\\sigma ^{2}\\left(X^{\\operatorname {T} }X\\right)^{-1}=\\sigma ^{2}Q.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/08f6cb596d94073731ee47f4a2571dbbfc1d214a) In particular, the standard error of each coefficient β ^ j {\\displaystyle \\scriptstyle {\\hat {\\beta }}\_{j}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}\_{j}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d376656c63f1577f2d1fcd2d680ccc48884ffda4) is equal to square root of the *j*\-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity *σ*2 with its estimate *s*2. Thus, s . e . ^ ( β ^ j ) \= s 2 ( X T X ) j j − 1 {\\displaystyle {\\widehat {\\operatorname {s.\\!e.} }}({\\hat {\\beta }}\_{j})={\\sqrt {s^{2}\\left(X^{\\operatorname {T} }X\\right)\_{jj}^{-1}}}} ![{\\displaystyle {\\widehat {\\operatorname {s.\\!e.} }}({\\hat {\\beta }}\_{j})={\\sqrt {s^{2}\\left(X^{\\operatorname {T} }X\\right)\_{jj}^{-1}}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/203c72ed1175be84e6fdd19320f0e0e21acf66ec) It can also be easily shown that the estimator β ^ {\\displaystyle \\scriptstyle {\\hat {\\beta }}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is uncorrelated with the residuals from the model:[\[24\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-HayashiFSP-24) Cov ⁡ \[ β ^ , ε ^ ∣ X \] \= 0\. {\\displaystyle \\operatorname {Cov} \[\\,{\\hat {\\beta }},{\\hat {\\varepsilon }}\\mid X\\,\]=0.} ![{\\displaystyle \\operatorname {Cov} \[\\,{\\hat {\\beta }},{\\hat {\\varepsilon }}\\mid X\\,\]=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/664c1a5e37957a1aa2ae381b9bcb07350c2c816c) The *[Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem")* states that under the *spherical errors* assumption (that is, the errors should be [uncorrelated](https://en.wikipedia.org/wiki/Uncorrelated "Uncorrelated") and [homoscedastic](https://en.wikipedia.org/wiki/Homoscedastic "Homoscedastic")) the estimator β ^ {\\displaystyle \\scriptstyle {\\hat {\\beta }}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is efficient in the class of linear unbiased estimators. This is called the *best linear unbiased estimator* (BLUE). Efficiency should be understood as if we were to find some other estimator β ~ {\\displaystyle \\scriptstyle {\\tilde {\\beta }}} ![{\\displaystyle \\scriptstyle {\\tilde {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3f0eb5a65676eeeea992903f3f93fdfd097a4d8d) which would be linear in *y* and unbiased, then [\[24\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-HayashiFSP-24) Var ⁡ \[ β ~ ∣ X \] − Var ⁡ \[ β ^ ∣ X \] ≥ 0 {\\displaystyle \\operatorname {Var} \[\\,{\\tilde {\\beta }}\\mid X\\,\]-\\operatorname {Var} \[\\,{\\hat {\\beta }}\\mid X\\,\]\\geq 0} ![{\\displaystyle \\operatorname {Var} \[\\,{\\tilde {\\beta }}\\mid X\\,\]-\\operatorname {Var} \[\\,{\\hat {\\beta }}\\mid X\\,\]\\geq 0}](https://wikimedia.org/api/rest_v1/media/math/render/svg/53796c9205889cc4d675b9749a58eb97fcd998f1) in the sense that this is a [nonnegative-definite matrix](https://en.wikipedia.org/wiki/Nonnegative-definite_matrix "Nonnegative-definite matrix"). This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms *ε*, other, non-linear estimators may provide better results than OLS. #### Assuming normality \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=14 "Edit section: Assuming normality")\] The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the *normality assumption* holds (that is, that *ε* ~ *N*(0, *σ*2*In*)), then additional properties of the OLS estimators can be stated. The estimator β ^ {\\displaystyle \\scriptstyle {\\hat {\\beta }}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is normally distributed, with mean and variance as given before:[\[25\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-25) β ^ ∼ N ( β , σ 2 ( X T X ) − 1 ) . {\\displaystyle {\\hat {\\beta }}\\ \\sim \\ {\\mathcal {N}}{\\big (}\\beta ,\\ \\sigma ^{2}(X^{\\mathrm {T} }X)^{-1}{\\big )}.} ![{\\displaystyle {\\hat {\\beta }}\\ \\sim \\ {\\mathcal {N}}{\\big (}\\beta ,\\ \\sigma ^{2}(X^{\\mathrm {T} }X)^{-1}{\\big )}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5931062c24b5e51ae732b5a07a8ceb45dbed1d9f) This estimator reaches the [Cramér–Rao bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound "Cramér–Rao bound") for the model, and thus is optimal in the class of all unbiased estimators.[\[17\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17) Note that unlike the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem"), this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms. The estimator *s*2 will be proportional to the [chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution "Chi-squared distribution"):[\[26\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-26) s 2 ∼ σ 2 n − p ⋅ χ n − p 2 {\\displaystyle s^{2}\\ \\sim \\ {\\frac {\\sigma ^{2}}{n-p}}\\cdot \\chi \_{n-p}^{2}} ![{\\displaystyle s^{2}\\ \\sim \\ {\\frac {\\sigma ^{2}}{n-p}}\\cdot \\chi \_{n-p}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e006b8d7551b05b350f5d56fe88fe51062088ca9) The variance of this estimator is equal to 2*σ*4/(*n* − *p*), which does not attain the [Cramér–Rao bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound "Cramér–Rao bound") of 2*σ*4/*n*. However it was shown that there are no unbiased estimators of *σ*2 with variance smaller than that of the estimator *s*2.[\[27\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-27) If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error "Mean squared error")) estimator in this class will be ~*σ*2 = SSR */* (*n* − *p* + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (*p* = 1).[\[28\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-28) Moreover, the estimators β ^ {\\displaystyle \\scriptstyle {\\hat {\\beta }}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [independent](https://en.wikipedia.org/wiki/Independent_random_variables "Independent random variables"),[\[29\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-29) the fact which comes in useful when constructing the t- and F-tests for the regression. #### Influential observations \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=15 "Edit section: Influential observations")\] Main article: [Influential observation](https://en.wikipedia.org/wiki/Influential_observation "Influential observation") See also: [Leverage (statistics)](https://en.wikipedia.org/wiki/Leverage_$statistics$ "Leverage (statistics)") As was mentioned before, the estimator β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) is linear in *y*, meaning that it represents a linear combination of the dependent variables *yi*. The weights in this linear combination are functions of the regressors *X*, and generally are unequal. The observations with high weights are called **influential** because they have a more pronounced effect on the value of the estimator. To analyze which observations are influential we remove a specific *j*\-th observation and consider how much the estimated quantities are going to change (similarly to the [jackknife method](https://en.wikipedia.org/wiki/Jackknife_method "Jackknife method")). It can be shown that the change in the OLS estimator for *β* will be equal to [\[30\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-DvdMck33-30) β ^ ( j ) − β ^ \= − 1 1 − h j ( X T X ) − 1 x j T ε ^ j , {\\displaystyle {\\hat {\\beta }}^{(j)}-{\\hat {\\beta }}=-{\\frac {1}{1-h\_{j}}}(X^{\\mathrm {T} }X)^{-1}x\_{j}^{\\mathrm {T} }{\\hat {\\varepsilon }}\_{j}\\,,} ![{\\displaystyle {\\hat {\\beta }}^{(j)}-{\\hat {\\beta }}=-{\\frac {1}{1-h\_{j}}}(X^{\\mathrm {T} }X)^{-1}x\_{j}^{\\mathrm {T} }{\\hat {\\varepsilon }}\_{j}\\,,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a71f1b4756b0af3027b8f9b9d5f2a75699433107) where *hj* = *xj*T (*X*T*X*)−1*xj* is the *j*\-th diagonal element of the hat matrix *P*, and *xj* is the vector of regressors corresponding to the *j*\-th observation. Similarly, the change in the predicted value for *j*\-th observation resulting from omitting that observation from the dataset will be equal to [\[30\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-DvdMck33-30) y ^ j ( j ) − y ^ j \= x j T β ^ ( j ) − x j T β ^ \= − h j 1 − h j ε ^ j {\\displaystyle {\\hat {y}}\_{j}^{(j)}-{\\hat {y}}\_{j}=x\_{j}^{\\mathrm {T} }{\\hat {\\beta }}^{(j)}-x\_{j}^{\\operatorname {T} }{\\hat {\\beta }}=-{\\frac {h\_{j}}{1-h\_{j}}}\\,{\\hat {\\varepsilon }}\_{j}} ![{\\displaystyle {\\hat {y}}\_{j}^{(j)}-{\\hat {y}}\_{j}=x\_{j}^{\\mathrm {T} }{\\hat {\\beta }}^{(j)}-x\_{j}^{\\operatorname {T} }{\\hat {\\beta }}=-{\\frac {h\_{j}}{1-h\_{j}}}\\,{\\hat {\\varepsilon }}\_{j}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/902c8e32ba7ad698e4fbd9891ed85705ecaebb6c) From the properties of the hat matrix, 0 ≤ *hj* ≤ 1, and they sum up to *p*, so that on average *hj* ≈ *p/n*. These quantities *hj* are called the **leverages**, and observations with high *hj* are called **leverage points**.[\[31\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-31) Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset. #### Partitioned regression \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=16 "Edit section: Partitioned regression")\] Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form y \= X 1 β 1 \+ X 2 β 2 \+ ε , {\\displaystyle y=X\_{1}\\beta \_{1}+X\_{2}\\beta \_{2}+\\varepsilon ,} ![{\\displaystyle y=X\_{1}\\beta \_{1}+X\_{2}\\beta \_{2}+\\varepsilon ,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8c8fb7149efa2da253167c26839f1207ccbb4f70) where *X*1 and *X*2 have dimensions *n*×*p*1, *n*×*p*2, and *β*1, *β*2 are *p*1×1 and *p*2×1 vectors, with *p*1 + *p*2 = *p*. The **[Frisch–Waugh–Lovell theorem](https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem "Frisch–Waugh–Lovell theorem")** states that in this regression the residuals ε ^ {\\displaystyle {\\hat {\\varepsilon }}} ![{\\displaystyle {\\hat {\\varepsilon }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f59e08078de1fc0acc5c2a08448127049373875d) and the OLS estimate β ^ 2 {\\displaystyle \\scriptstyle {\\hat {\\beta }}\_{2}} ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}\_{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/925f884381ce10f61d6bff6d36527621643e62b0) will be numerically identical to the residuals and the OLS estimate for *β*2 in the following regression:[\[32\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-32) M 1 y \= M 1 X 2 β 2 \+ η , {\\displaystyle M\_{1}y=M\_{1}X\_{2}\\beta \_{2}+\\eta \\,,} ![{\\displaystyle M\_{1}y=M\_{1}X\_{2}\\beta \_{2}+\\eta \\,,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/03086726c6b5ea95bff0c85bfdd81789e0c229ad) where *M*1 is the [annihilator matrix](https://en.wikipedia.org/wiki/Annihilator_matrix "Annihilator matrix") for regressors *X*1. The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term. ### Large sample properties \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=17 "Edit section: Large sample properties")\] The least squares estimators are [point estimates](https://en.wikipedia.org/wiki/Point_estimate "Point estimate") of the linear regression model parameters *β*. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [interval estimates](https://en.wikipedia.org/wiki/Interval_estimate "Interval estimate"). Since we have not made any assumption about the distribution of error term *εi*, it is impossible to infer the distribution of the estimators β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) and σ ^ 2 {\\displaystyle {\\hat {\\sigma }}^{2}} ![{\\displaystyle {\\hat {\\sigma }}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/1ad9d89160c9e63c0aa4c158282cb75a894de56f). Nevertheless, we can apply the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem "Central limit theorem") to derive their *asymptotic* properties as sample size *n* goes to infinity. While the sample size is necessarily finite, it is customary to assume that *n* is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit. We can show that under the model assumptions, the least squares estimator for *β* is [consistent](https://en.wikipedia.org/wiki/Consistent_estimator "Consistent estimator") (that is β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) [converges in probability](https://en.wikipedia.org/wiki/Convergence_of_random_variables#Convergence_in_probability "Convergence of random variables") to *β*) and asymptotically normal:[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Consistency_and_asymptotic_normality_of_.CE.B2.CC.82 "Proofs involving ordinary least squares") ( β ^ − β ) → d N ( 0 , σ 2 Q x x − 1 ) , {\\displaystyle ({\\hat {\\beta }}-\\beta )\\ {\\xrightarrow {d}}\\ {\\mathcal {N}}{\\big (}0,\\;\\sigma ^{2}Q\_{xx}^{-1}{\\big )},} ![{\\displaystyle ({\\hat {\\beta }}-\\beta )\\ \\xrightarrow {d} \\ {\\mathcal {N}}{\\big (}0,\\;\\sigma ^{2}Q\_{xx}^{-1}{\\big )},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4e1f0415269cea909e27a5e628594cd011db546a) where Q x x \= X T X . {\\displaystyle Q\_{xx}=X^{\\operatorname {T} }X.} ![{\\displaystyle Q\_{xx}=X^{\\operatorname {T} }X.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3d8267e0743c4f55e4517e77e5f35807f2229e6d) #### Inference \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=18 "Edit section: Inference")\] Main articles: [Confidence interval](https://en.wikipedia.org/wiki/Confidence_interval "Confidence interval") and [Prediction interval](https://en.wikipedia.org/wiki/Prediction_interval "Prediction interval") Using this asymptotic distribution, approximate two-sided confidence intervals for the *j*\-th component of the vector β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) can be constructed as β j ∈ \[ β ^ j ± q 1 − α 2 N ( 0 , 1 ) σ ^ 2 \[ Q x x − 1 \] j j \] {\\displaystyle \\beta \_{j}\\in {\\bigg \[}\\ {\\hat {\\beta }}\_{j}\\pm q\_{1-{\\frac {\\alpha }{2}}}^{{\\mathcal {N}}(0,1)}\\!{\\sqrt {{\\hat {\\sigma }}^{2}\\left\[Q\_{xx}^{-1}\\right\]\_{jj}}}\\ {\\bigg \]}} ![{\\displaystyle \\beta \_{j}\\in {\\bigg \[}\\ {\\hat {\\beta }}\_{j}\\pm q\_{1-{\\frac {\\alpha }{2}}}^{{\\mathcal {N}}(0,1)}\\!{\\sqrt {{\\hat {\\sigma }}^{2}\\left\[Q\_{xx}^{-1}\\right\]\_{jj}}}\\ {\\bigg \]}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cf79688aac9f662ff39253fbfb0d234246d370e5) at the 1 − *α* confidence level, where *q* denotes the [quantile function](https://en.wikipedia.org/wiki/Quantile_function "Quantile function") of standard normal distribution, and \[·\]*jj* is the *j*\-th diagonal element of a matrix. Similarly, the least squares estimator for *σ*2 is also consistent and asymptotically normal (provided that the fourth moment of *εi* exists) with limiting distribution ( σ ^ 2 − σ 2 ) → d N ( 0 , E ⁡ \[ ε i 4 \] − σ 4 ) . {\\displaystyle ({\\hat {\\sigma }}^{2}-\\sigma ^{2})\\ {\\xrightarrow {d}}\\ {\\mathcal {N}}\\left(0,\\;\\operatorname {E} \\left\[\\varepsilon \_{i}^{4}\\right\]-\\sigma ^{4}\\right).} ![{\\displaystyle ({\\hat {\\sigma }}^{2}-\\sigma ^{2})\\ \\xrightarrow {d} \\ {\\mathcal {N}}\\left(0,\\;\\operatorname {E} \\left\[\\varepsilon \_{i}^{4}\\right\]-\\sigma ^{4}\\right).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7c909dea2a4f0bf40e253680b953d1bfbb66298f) These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose x 0 {\\displaystyle x\_{0}} ![{\\displaystyle x\_{0}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/86f21d0e31751534cd6584264ecf864a6aa792cf) is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [mean response](https://en.wikipedia.org/wiki/Mean_response "Mean response") is the quantity y 0 \= x 0 T β {\\displaystyle y\_{0}=x\_{0}^{\\mathrm {T} }\\beta } ![{\\displaystyle y\_{0}=x\_{0}^{\\mathrm {T} }\\beta }](https://wikimedia.org/api/rest_v1/media/math/render/svg/6eda29d7b45f0754da5a0bd365ba6df87c81306c), whereas the [predicted response](https://en.wikipedia.org/wiki/Predicted_response "Predicted response") is y ^ 0 \= x 0 T β ^ {\\displaystyle {\\hat {y}}\_{0}=x\_{0}^{\\mathrm {T} }{\\hat {\\beta }}} ![{\\displaystyle {\\hat {y}}\_{0}=x\_{0}^{\\mathrm {T} }{\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/69652d0e8a9bacc094b1dc11803721f7bcf3e22d). Clearly the predicted response is a random variable, its distribution can be derived from that of β ^ {\\displaystyle {\\hat {\\beta }}} ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799): ( y ^ 0 − y 0 ) → d N ( 0 , σ 2 x 0 T Q x x − 1 x 0 ) , {\\displaystyle \\left({\\hat {y}}\_{0}-y\_{0}\\right)\\ {\\xrightarrow {d}}\\ {\\mathcal {N}}\\left(0,\\;\\sigma ^{2}x\_{0}^{\\mathrm {T} }Q\_{xx}^{-1}x\_{0}\\right),} ![{\\displaystyle \\left({\\hat {y}}\_{0}-y\_{0}\\right)\\ \\xrightarrow {d} \\ {\\mathcal {N}}\\left(0,\\;\\sigma ^{2}x\_{0}^{\\mathrm {T} }Q\_{xx}^{-1}x\_{0}\\right),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ea91e2b81cd0251f9bc1fce42e7ebfc78ceca045) which allows construct confidence intervals for mean response y 0 {\\displaystyle y\_{0}} ![{\\displaystyle y\_{0}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6d943dbbb0b56ca750c4d62c5b54b4ae29a773da) to be constructed: y 0 ∈ \[ x 0 T β ^ ± q 1 − α 2 N ( 0 , 1 ) σ ^ 2 x 0 T Q x x − 1 x 0 \] {\\displaystyle y\_{0}\\in \\left\[\\ x\_{0}^{\\mathrm {T} }{\\hat {\\beta }}\\pm q\_{1-{\\frac {\\alpha }{2}}}^{{\\mathcal {N}}(0,1)}\\!{\\sqrt {{\\hat {\\sigma }}^{2}x\_{0}^{\\mathrm {T} }Q\_{xx}^{-1}x\_{0}}}\\ \\right\]} ![{\\displaystyle y\_{0}\\in \\left\[\\ x\_{0}^{\\mathrm {T} }{\\hat {\\beta }}\\pm q\_{1-{\\frac {\\alpha }{2}}}^{{\\mathcal {N}}(0,1)}\\!{\\sqrt {{\\hat {\\sigma }}^{2}x\_{0}^{\\mathrm {T} }Q\_{xx}^{-1}x\_{0}}}\\ \\right\]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cf86d7a311c97d35fb6e039c3cd74bc9f3e752bf) at the 1 − *α* confidence level. #### Hypothesis testing \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=19 "Edit section: Hypothesis testing")\] Main article: [Hypothesis testing](https://en.wikipedia.org/wiki/Hypothesis_testing "Hypothesis testing") | | | |---|---| | [![\[icon\]](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Wiki_letter_w_cropped.svg/20px-Wiki_letter_w_cropped.svg.png)](https://en.wikipedia.org/wiki/File:Wiki_letter_w_cropped.svg) | This section **needs expansion**. You can help by [adding missing information](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=). *(February 2017)* | Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") of no explanatory value of the estimated regression is tested using an [F-test](https://en.wikipedia.org/wiki/F-test "F-test"). If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the [alternative hypothesis](https://en.wikipedia.org/wiki/Alternative_hypothesis "Alternative hypothesis"), that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted. Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's [t-statistic](https://en.wikipedia.org/wiki/T-statistic "T-statistic"), as the ratio of the coefficient estimate to its [standard error](https://en.wikipedia.org/wiki/Standard_error "Standard error"). If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted. In addition, the [Chow test](https://en.wikipedia.org/wiki/Chow_test "Chow test") is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted. ### Violations of assumptions \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=20 "Edit section: Violations of assumptions")\] #### Time series model \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=21 "Edit section: Time series model")\] In a [time series](https://en.wikipedia.org/wiki/Time_series "Time series") model, we require the [stochastic process](https://en.wikipedia.org/wiki/Stochastic_process "Stochastic process") {*xi*, *yi*} to be [stationary](https://en.wikipedia.org/wiki/Stationary_process "Stationary process") and [ergodic](https://en.wikipedia.org/wiki/Ergodic_process "Ergodic process"); if {*xi*, *yi*} is nonstationary, OLS results are often biased unless {*xi*, *yi*} is [co-integrating](https://en.wikipedia.org/wiki/Cointegration "Cointegration").[\[33\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-33) We still require the regressors to be *strictly exogenous*: E\[*xiεi*\] = 0 for all *i* = 1, ..., *n*. If they are only [predetermined](https://en.wikipedia.org/wiki/Weak_exogeneity "Weak exogeneity"), OLS is biased in finite sample; Finally, the assumptions on the variance take the form of requiring that {*xiεi*} is a [martingale difference sequence](https://en.wikipedia.org/wiki/Martingale_difference_sequence "Martingale difference sequence"), with a finite matrix of second moments *Q**xxε*² = E\[ *εi*2*xi xi*T \]. #### Constrained estimation \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=22 "Edit section: Constrained estimation")\] Main article: [Ridge regression](https://en.wikipedia.org/wiki/Ridge_regression "Ridge regression") Suppose it is known that the coefficients in the regression satisfy a system of linear equations A : Q T β \= c , {\\displaystyle A\\colon \\quad Q^{\\operatorname {T} }\\beta =c,\\,} ![{\\displaystyle A\\colon \\quad Q^{\\operatorname {T} }\\beta =c,\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/057e184c58e6c378d20d00aac2f8d5f8003f77ae) where *Q* is a *p*×*q* matrix of full rank, and *c* is a *q*×1 vector of known constants, where *q \< p*. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint *A*. The **constrained least squares (CLS)** estimator can be given by an explicit formula:[\[34\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-34) β ^ c \= β ^ − ( X T X ) − 1 Q ( Q T ( X T X ) − 1 Q ) − 1 ( Q T β ^ − c ) . {\\displaystyle {\\hat {\\beta }}^{c}={\\hat {\\beta }}-(X^{\\operatorname {T} }X)^{-1}Q{\\Big (}Q^{\\operatorname {T} }(X^{\\operatorname {T} }X)^{-1}Q{\\Big )}^{-1}(Q^{\\operatorname {T} }{\\hat {\\beta }}-c).} ![{\\displaystyle {\\hat {\\beta }}^{c}={\\hat {\\beta }}-(X^{\\operatorname {T} }X)^{-1}Q{\\Big (}Q^{\\operatorname {T} }(X^{\\operatorname {T} }X)^{-1}Q{\\Big )}^{-1}(Q^{\\operatorname {T} }{\\hat {\\beta }}-c).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d7cebc5bf8357f7e792c566f18bcae6c7582b9ae) This expression for the constrained estimator is valid as long as the matrix *XTX* is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, *β* will not be identifiable. However it may happen that adding the restriction *A* makes *β* identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [\[35\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Amemiya22-35) β ^ c \= R ( R T X T X R ) − 1 R T X T y \+ ( I p − R ( R T X T X R ) − 1 R T X T X ) Q ( Q T Q ) − 1 c , {\\displaystyle {\\hat {\\beta }}^{c}=R(R^{\\operatorname {T} }X^{\\operatorname {T} }XR)^{-1}R^{\\operatorname {T} }X^{\\operatorname {T} }y+{\\Big (}I\_{p}-R(R^{\\operatorname {T} }X^{\\operatorname {T} }XR)^{-1}R^{\\operatorname {T} }X^{\\operatorname {T} }X{\\Big )}Q(Q^{\\operatorname {T} }Q)^{-1}c,} ![{\\displaystyle {\\hat {\\beta }}^{c}=R(R^{\\operatorname {T} }X^{\\operatorname {T} }XR)^{-1}R^{\\operatorname {T} }X^{\\operatorname {T} }y+{\\Big (}I\_{p}-R(R^{\\operatorname {T} }X^{\\operatorname {T} }XR)^{-1}R^{\\operatorname {T} }X^{\\operatorname {T} }X{\\Big )}Q(Q^{\\operatorname {T} }Q)^{-1}c,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fd1b301b75120d4aae50fab438e36d600343652b) where *R* is a *p*×(*p* − *q*) matrix such that the matrix \[*Q R*\] is non-singular, and *RTQ* = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when *XTX* is invertible.[\[35\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Amemiya22-35) ## Example with real data \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=23 "Edit section: Example with real data")\] See also: [Simple linear regression § Example](https://en.wikipedia.org/wiki/Simple_linear_regression#Example "Simple linear regression"), and [Linear least squares § Example](https://en.wikipedia.org/wiki/Linear_least_squares#Example "Linear least squares") The following data set gives average heights and weights for American women aged 30–39 (source: *The World Almanac and Book of Facts, 1975*). | | | | | | | | |---|---|---|---|---|---|---| | Height (m) | 1\.47 | 1\.50 | 1\.52 | 1\.55 | 1\.57 | [![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/OLS_example_weight_vs_height_scatterplot.svg/250px-OLS_example_weight_vs_height_scatterplot.svg.png)](https://en.wikipedia.org/wiki/File:OLS_example_weight_vs_height_scatterplot.svg) [Scatterplot](https://en.wikipedia.org/wiki/Scatterplot "Scatterplot") of the data, the relationship is slightly curved but close to linear | | Weight (kg) | 52\.21 | 53\.12 | 54\.48 | 55\.84 | 57\.20 | | | Height (m) | 1\.60 | 1\.63 | 1\.65 | 1\.68 | 1\.70 | | | Weight (kg) | 58\.57 | 59\.93 | 61\.29 | 63\.11 | 64\.47 | | | Height (m) | 1\.73 | 1\.75 | 1\.78 | 1\.80 | 1\.83 | | | Weight (kg) | 66\.28 | 68\.10 | 69\.92 | 72\.19 | 74\.46 | | When only one dependent variable is being modeled, a [scatterplot](https://en.wikipedia.org/wiki/Scatterplot "Scatterplot") will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model: w i \= β 1 \+ β 2 h i \+ β 3 h i 2 \+ ε i . {\\displaystyle w\_{i}=\\beta \_{1}+\\beta \_{2}h\_{i}+\\beta \_{3}h\_{i}^{2}+\\varepsilon \_{i}.} ![{\\displaystyle w\_{i}=\\beta \_{1}+\\beta \_{2}h\_{i}+\\beta \_{3}h\_{i}^{2}+\\varepsilon \_{i}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/509e6766a1b3a1d7f431d9f9dae780d20f9b59d5) [![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/OLS_example_weight_vs_height_fitted_line.svg/330px-OLS_example_weight_vs_height_fitted_line.svg.png)](https://en.wikipedia.org/wiki/File:OLS_example_weight_vs_height_fitted_line.svg) Fitted regression The output from most popular [statistical packages](https://en.wikipedia.org/wiki/List_of_statistical_packages "List of statistical packages") will look similar to this: | | | | | | |---|---|---|---|---| | Method | Least squares | | | | | Dependent variable | WEIGHT | | | | | Observations | 15 | | | | | Parameter | Value | [Std error](https://en.wikipedia.org/wiki/Standard_error "Standard error") | [t-statistic](https://en.wikipedia.org/wiki/T-statistic "T-statistic") | [p-value](https://en.wikipedia.org/wiki/P-value "P-value") | | β 1 {\\displaystyle \\beta \_{1}} ![{\\displaystyle \\beta \_{1}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) | | | | | In this table: - The *Value* column gives the least squares estimates of parameters *βj* - The *Std error* column shows [standard errors](https://en.wikipedia.org/wiki/Standard_error_$statistics$ "Standard error (statistics)") of each coefficient estimate: σ ^ j \= ( σ ^ 2 \[ Q x x − 1 \] j j ) 1 2 {\\displaystyle {\\hat {\\sigma }}\_{j}=\\left({\\hat {\\sigma }}^{2}\\left\[Q\_{xx}^{-1}\\right\]\_{jj}\\right)^{\\frac {1}{2}}} ![{\\displaystyle {\\hat {\\sigma }}\_{j}=\\left({\\hat {\\sigma }}^{2}\\left\[Q\_{xx}^{-1}\\right\]\_{jj}\\right)^{\\frac {1}{2}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5087e66171bf3ef9ad3ac75decdd715274919669) - The *[t-statistic](https://en.wikipedia.org/wiki/T-statistic "T-statistic")* and *p-value* columns are testing whether any of the coefficients might be equal to zero. The *t*\-statistic is calculated simply as t \= β ^ j / σ ^ j {\\displaystyle t={\\hat {\\beta }}\_{j}/{\\hat {\\sigma }}\_{j}} ![{\\displaystyle t={\\hat {\\beta }}\_{j}/{\\hat {\\sigma }}\_{j}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/706d1c514396be8e7301a23ab369cdcf5b1c5096) . If the errors ε follow a normal distribution, *t* follows a Student-t distribution. Under weaker conditions, *t* is asymptotically normal. Large values of *t* indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, [*p*\-value](https://en.wikipedia.org/wiki/P-value "P-value"), expresses the results of the hypothesis test as a [significance level](https://en.wikipedia.org/wiki/Statistical_significance "Statistical significance"). Conventionally, *p*\-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero. - *R-squared* is the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination "Coefficient of determination") indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors *X* have no explanatory power whatsoever. This is a biased estimate of the population *R-squared*, and will never decrease if additional regressors are added, even if they are irrelevant. - *Adjusted R-squared* is a slightly modified version of R 2 {\\displaystyle R^{2}} ![{\\displaystyle R^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5ce07e278be3e058a6303de8359f8b4a4288264a) , designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than R 2 {\\displaystyle R^{2}} ![{\\displaystyle R^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5ce07e278be3e058a6303de8359f8b4a4288264a) , can decrease as new regressors are added, and even be negative for poorly fitting models: R ¯ 2 \= 1 − n − 1 n − p ( 1 − R 2 ) {\\displaystyle {\\overline {R}}^{2}=1-{\\frac {n-1}{n-p}}(1-R^{2})} ![{\\displaystyle {\\overline {R}}^{2}=1-{\\frac {n-1}{n-p}}(1-R^{2})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7ec4559807623b855036fce5201f9e8b6c7aca4b) - *Log-likelihood* is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests. - *[Durbin–Watson statistic](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic "Durbin–Watson statistic")* tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation. - *[Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion "Akaike information criterion")* and *[Schwarz criterion](https://en.wikipedia.org/wiki/Schwarz_criterion "Schwarz criterion")* are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[\[36\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-36) - *Standard error of regression* is an estimate of *σ*, standard error of the error term. - *Total sum of squares*, *model sum of squared*, and *residual sum of squares* tell us how much of the initial variation in the sample were explained by the regression. - *F-statistic* tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has *F*(*p–1*,*n–p*) distribution under the null hypothesis and normality assumption, and its *p-value* indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as [Wald test](https://en.wikipedia.org/wiki/Wald_test "Wald test") or [LR test](https://en.wikipedia.org/wiki/Likelihood_ratio_test "Likelihood ratio test") should be used. [![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/OLS_example_weight_vs_height_residuals.svg/330px-OLS_example_weight_vs_height_residuals.svg.png)](https://en.wikipedia.org/wiki/File:OLS_example_weight_vs_height_residuals.svg) Residuals plot Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots: - Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity. - Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model. - Residuals against the fitted values, y ^ {\\displaystyle {\\hat {y}}} ![{\\displaystyle {\\hat {y}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3dc8de3d8ea01304329ef9518fad7a6d196c4c01) . - Residuals against the preceding residual. This plot may identify serial correlations in the residuals. An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height. ### Sensitivity to rounding \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=24 "Edit section: Sensitivity to rounding")\] Main article: [Errors-in-variables models](https://en.wikipedia.org/wiki/Errors-in-variables_models "Errors-in-variables models") See also: [Quantization error model](https://en.wikipedia.org/wiki/Quantization_error_model "Quantization error model") This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is *not* an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become: | | Const | Height | Height2 | |---|---|---|---| | Converted to metric with rounding. | 128\.8128 | −143.162 | 61\.96033 | | Converted to metric without rounding. | 119\.0205 | −131.5076 | 58\.5046 | [![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/HeightWeightResiduals.jpg/500px-HeightWeightResiduals.jpg)](https://en.wikipedia.org/wiki/File:HeightWeightResiduals.jpg) Residuals to a quadratic fit for correctly and incorrectly converted data. Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation. While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range ([extrapolation](https://en.wikipedia.org/wiki/Extrapolation "Extrapolation")). This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the *x* and *y* errors. ## Another example with less real data \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=25 "Edit section: Another example with less real data")\] ### Problem statement \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=26 "Edit section: Problem statement")\] We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is r ( θ ) \= p 1 − e cos ⁡ ( θ ) {\\displaystyle r(\\theta )={\\frac {p}{1-e\\cos(\\theta )}}} ![{\\displaystyle r(\\theta )={\\frac {p}{1-e\\cos(\\theta )}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/864b5d331af517af4843d824394763d8d58bdb06) where r ( θ ) {\\displaystyle r(\\theta )} ![{\\displaystyle r(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/53df06e661affb927fb93a95195225e3455cf572) is the radius of how far the object is from one of the bodies. In the equation the parameters p {\\displaystyle p} ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) and e {\\displaystyle e} ![{\\displaystyle e}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cd253103f0876afc68ebead27a5aa9867d927467) are used to determine the path of the orbit. We have measured the following data. | θ {\\displaystyle \\theta } ![{\\displaystyle \\theta }](https://wikimedia.org/api/rest_v1/media/math/render/svg/6e5ab2664b422d53eb0c7df3b87e1360d75ad9af) (in degrees) | |---| We need to find the least-squares approximation of e {\\displaystyle e} ![{\\displaystyle e}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cd253103f0876afc68ebead27a5aa9867d927467) and p {\\displaystyle p} ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) for the given data. ### Solution \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=27 "Edit section: Solution")\] First we need to represent e and p in a linear form. So we are going to rewrite the equation r ( θ ) {\\displaystyle r(\\theta )} ![{\\displaystyle r(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/53df06e661affb927fb93a95195225e3455cf572) as 1 r ( θ ) \= 1 p − e p cos ⁡ ( θ ) {\\displaystyle {\\frac {1}{r(\\theta )}}={\\frac {1}{p}}-{\\frac {e}{p}}\\cos(\\theta )} ![{\\displaystyle {\\frac {1}{r(\\theta )}}={\\frac {1}{p}}-{\\frac {e}{p}}\\cos(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ed0a5c7386fb20bcca2ffa143a944ad12986a06e). Furthermore, one could fit for [apsides](https://en.wikipedia.org/wiki/Apsides "Apsides") by expanding cos ⁡ ( θ ) {\\displaystyle \\cos(\\theta )} ![{\\displaystyle \\cos(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/aaac7b75cda6d5570780075aa2622d27b21117cd) with an extra parameter as cos ⁡ ( θ − θ 0 ) \= cos ⁡ ( θ ) cos ⁡ ( θ 0 ) \+ sin ⁡ ( θ ) sin ⁡ ( θ 0 ) {\\displaystyle \\cos(\\theta -\\theta \_{0})=\\cos(\\theta )\\cos(\\theta \_{0})+\\sin(\\theta )\\sin(\\theta \_{0})} ![{\\displaystyle \\cos(\\theta -\\theta \_{0})=\\cos(\\theta )\\cos(\\theta \_{0})+\\sin(\\theta )\\sin(\\theta \_{0})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3a8a6dfeb8c3f569c5e5bdc53eae13cd7396ddb3), which is linear in both cos ⁡ ( θ ) {\\displaystyle \\cos(\\theta )} ![{\\displaystyle \\cos(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/aaac7b75cda6d5570780075aa2622d27b21117cd) and in the extra basis function sin ⁡ ( θ ) {\\displaystyle \\sin(\\theta )} ![{\\displaystyle \\sin(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/acafc444aea85d63a40dabf84f035a6b4955a948). We use the original two-parameter form to represent our observational data as: A T A ( x y ) \= A T b , {\\displaystyle A^{T}A{\\binom {x}{y}}=A^{T}b,} ![{\\displaystyle A^{T}A{\\binom {x}{y}}=A^{T}b,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/986a7b37a9e8f4d01f04b0c1edfab0891e5c3981) where: x \= 1 / p {\\displaystyle x=1/p\\,} ![{\\displaystyle x=1/p\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/da0dc5103de42df59c7fdf7c0e1a25b2d230c2ab); y \= e / p {\\displaystyle y=e/p\\,} ![{\\displaystyle y=e/p\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/225807344e8ed6a108fb0f244389cd99b2176af6); A {\\displaystyle A} ![{\\displaystyle A}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7daff47fa58cdfd29dc333def748ff5fa4c923e3) contains the coefficients of 1 / p {\\displaystyle 1/p} ![{\\displaystyle 1/p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2cdfd6eb8d2c6f424b698d06aa99d31895c47e91) in the first column, which are all 1, and the coefficients of e / p {\\displaystyle e/p} ![{\\displaystyle e/p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e1413df0fc595b107560741c651139104ddc2b3a) in the second column, given by cos ⁡ ( θ ) {\\displaystyle \\cos(\\theta )\\,} ![{\\displaystyle \\cos(\\theta )\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7579c63af86d580d73d9e47b823ef1f5df5d1e7c); and b \= 1 / r ( θ ) {\\displaystyle b=1/r(\\theta )} ![{\\displaystyle b=1/r(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2e66f8886331852341e28a72b691800bd161f01d), such that: A \= \[ 1 − 0\.731354 1 − 0\.707107 1 − 0\.615661 1 0\.052336 1 0\.309017 1 0\.438371 \] , b \= \[ 0\.21220 0\.21958 0\.24741 0\.45071 0\.52883 0\.56820 \] . {\\displaystyle A={\\begin{bmatrix}1&-0.731354\\\\1&-0.707107\\\\1&-0.615661\\\\1&\\ 0.052336\\\\1&0.309017\\\\1&0.438371\\end{bmatrix}},\\quad b={\\begin{bmatrix}0.21220\\\\0.21958\\\\0.24741\\\\0.45071\\\\0.52883\\\\0.56820\\end{bmatrix}}.} ![{\\displaystyle A={\\begin{bmatrix}1&-0.731354\\\\1&-0.707107\\\\1&-0.615661\\\\1&\\ 0.052336\\\\1&0.309017\\\\1&0.438371\\end{bmatrix}},\\quad b={\\begin{bmatrix}0.21220\\\\0.21958\\\\0.24741\\\\0.45071\\\\0.52883\\\\0.56820\\end{bmatrix}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2fa33854f055be22aac53f6543b51595d2cb6ab5) On solving we get ( x y ) \= ( 0\.43478 0\.30435 ) {\\displaystyle {\\binom {x}{y}}={\\binom {0.43478}{0.30435}}\\,} ![{\\displaystyle {\\binom {x}{y}}={\\binom {0.43478}{0.30435}}\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f2754076c0b4c74864feffeff3bb26a9288d57a), so p \= 1 x \= 2\.3000 {\\displaystyle p={\\frac {1}{x}}=2.3000} ![{\\displaystyle p={\\frac {1}{x}}=2.3000}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ab356b2fc5bd7f3043c4681d1ee8e078310a40a3) and e \= p ⋅ y \= 0\.70001 {\\displaystyle e=p\\cdot y=0.70001} ![{\\displaystyle e=p\\cdot y=0.70001}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7c3dfe0bd497476da093d39a831bc135fce2e725) ## See also \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=28 "Edit section: See also")\] - [Bayesian least squares](https://en.wikipedia.org/wiki/Minimum_mean_square_error "Minimum mean square error") - [Fama–MacBeth regression](https://en.wikipedia.org/wiki/Fama%E2%80%93MacBeth_regression "Fama–MacBeth regression") - [Nonlinear least squares](https://en.wikipedia.org/wiki/Non-linear_least_squares "Non-linear least squares") - [Numerical methods for linear least squares](https://en.wikipedia.org/wiki/Numerical_methods_for_linear_least_squares "Numerical methods for linear least squares") - [Nonlinear system identification](https://en.wikipedia.org/wiki/Nonlinear_system_identification "Nonlinear system identification") ## References \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=29 "Edit section: References")\] 1. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-1)** ["The Origins of Ordinary Least Squares Assumptions"](https://mathvoices.ams.org/featurecolumn/2022/03/01/ordinary-least-squares/). *Feature Column*. 2022-03-01. Retrieved 2024-05-16. 2. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-2)** ["What is a complete list of the usual assumptions for linear regression?"](https://stats.stackexchange.com/q/16381). *Cross Validated*. Retrieved 2022-09-28. 3. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-3)** [Goldberger, Arthur S.](https://en.wikipedia.org/wiki/Arthur_Goldberger "Arthur Goldberger") (1964). ["Classical Linear Regression"](https://books.google.com/books?id=KZq5AAAAIAAJ&pg=PA156). [*Econometric Theory*](https://archive.org/details/econometrictheor0000gold/page/158). New York: John Wiley & Sons. pp. [158](https://archive.org/details/econometrictheor0000gold/page/158). [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-471-31101-4](https://en.wikipedia.org/wiki/Special:BookSources/0-471-31101-4 "Special:BookSources/0-471-31101-4") . `{{cite book}}`: ISBN / Date incompatibility ([help](https://en.wikipedia.org/wiki/Help:CS1_errors#invalid_isbn_date "Help:CS1 errors")) 4. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-4)** [Hayashi, Fumio](https://en.wikipedia.org/wiki/Fumio_Hayashi "Fumio Hayashi") (2000). *Econometrics*. Princeton University Press. p. 15. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9780691010182](https://en.wikipedia.org/wiki/Special:BookSources/9780691010182 "Special:BookSources/9780691010182") . 5. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-5)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 18). 6. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-6)** Ghilani, Charles D.; Wolf, Paul R. (12 June 2006). [*Adjustment Computations: Spatial Data Analysis*](https://books.google.com/books?id=hZ4mAOXVowoC&pg=PA160). John Wiley & Sons. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9780471697282](https://en.wikipedia.org/wiki/Special:BookSources/9780471697282 "Special:BookSources/9780471697282") . 7. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-7)** Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). [*GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more*](https://books.google.com/books?id=Np7y43HU_m8C&pg=PA263). Springer. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9783211730171](https://en.wikipedia.org/wiki/Special:BookSources/9783211730171 "Special:BookSources/9783211730171") . 8. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-8)** Xu, Guochang (5 October 2007). [*GPS: Theory, Algorithms and Applications*](https://books.google.com/books?id=peYFZ69HqEsC&pg=PA134). Springer. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9783540727156](https://en.wikipedia.org/wiki/Special:BookSources/9783540727156 "Special:BookSources/9783540727156") . 9. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-1) [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 19) 10. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-q011_10-0)** Hoaglin, David C.; Welsch, Roy E. (1978). ["The Hat Matrix in Regression and ANOVA"](https://doi.org/10.1080%2F00031305.1978.10479237). *The American Statistician*. **32** (1): 17–22\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.1080/00031305.1978.10479237](https://doi.org/10.1080%2F00031305.1978.10479237). [hdl](https://en.wikipedia.org/wiki/Hdl_$identifier$ "Hdl (identifier)"):[1721\.1/1920](https://hdl.handle.net/1721.1%2F1920). [ISSN](https://en.wikipedia.org/wiki/ISSN_$identifier$ "ISSN (identifier)") [0003-1305](https://search.worldcat.org/issn/0003-1305). 11. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-11)** [Julian Faraway (2000), *Practical Regression and Anova using R*](https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf) 12. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-12)** Kenney, J.; Keeping, E. S. (1963). *Mathematics of Statistics*. van Nostrand. p. 187. 13. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-13)** Zwillinger, Daniel (1995). [*Standard Mathematical Tables and Formulae*](https://en.wikipedia.org/wiki/CRC_Standard_Mathematical_Tables "CRC Standard Mathematical Tables"). Chapman\&Hall/CRC. p. 626. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-8493-2479-3](https://en.wikipedia.org/wiki/Special:BookSources/0-8493-2479-3 "Special:BookSources/0-8493-2479-3") . 14. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-14)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 20) 15. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-15)** Akbarzadeh, Vahab (7 May 2014). ["Line Estimation"](https://mlmadesimple.wordpress.com/2014/05/07/line-estimation/). 16. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-16)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 49) 17. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-1) [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 52) 18. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_10_18-0)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 10) 19. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Tibshirani-1996_19-0)** Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso". *Journal of the Royal Statistical Society, Series B*. **58** (1): 267–288\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.1111/j.2517-6161.1996.tb02080.x](https://doi.org/10.1111%2Fj.2517-6161.1996.tb02080.x). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [2346178](https://www.jstor.org/stable/2346178). 20. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Efron-2004_20-0)** Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). "Least Angle Regression". *The Annals of Statistics*. **32** (2): 407–451\. [arXiv](https://en.wikipedia.org/wiki/ArXiv_$identifier$ "ArXiv (identifier)"):[math/0406456](https://arxiv.org/abs/math/0406456). [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.1214/009053604000000067](https://doi.org/10.1214%2F009053604000000067). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [3448465](https://www.jstor.org/stable/3448465). [S2CID](https://en.wikipedia.org/wiki/S2CID_$identifier$ "S2CID (identifier)") [204004121](https://api.semanticscholar.org/CorpusID:204004121). 21. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hawkins-1973_21-0)** Hawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis". *Journal of the Royal Statistical Society, Series C*. **22** (3): 275–286\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.2307/2346776](https://doi.org/10.2307%2F2346776). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [2346776](https://www.jstor.org/stable/2346776). 22. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Jolliffe-1982_22-0)** Jolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression". *Journal of the Royal Statistical Society, Series C*. **31** (3): 300–303\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.2307/2348005](https://doi.org/10.2307%2F2348005). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [2348005](https://www.jstor.org/stable/2348005). 23. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-23)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), pages 27, 30) 24. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-HayashiFSP_24-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-HayashiFSP_24-1) [***c***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-HayashiFSP_24-2) [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 27) 25. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-25)** [Amemiya, Takeshi](https://en.wikipedia.org/wiki/Takeshi_Amemiya "Takeshi Amemiya") (1985). [*Advanced Econometrics*](https://archive.org/details/advancedeconomet00amem). Harvard University Press. p. [13](https://archive.org/details/advancedeconomet00amem/page/13). [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9780674005600](https://en.wikipedia.org/wiki/Special:BookSources/9780674005600 "Special:BookSources/9780674005600") . 26. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-26)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 14) 27. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-27)** [Rao, C. R.](https://en.wikipedia.org/wiki/C._R._Rao "C. R. Rao") (1973). *Linear Statistical Inference and its Applications* (Second ed.). New York: J. Wiley & Sons. p. 319. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-471-70823-2](https://en.wikipedia.org/wiki/Special:BookSources/0-471-70823-2 "Special:BookSources/0-471-70823-2") . 28. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-28)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 20) 29. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-29)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 27) 30. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-DvdMck33_30-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-DvdMck33_30-1) Davidson, Russell; [MacKinnon, James G.](https://en.wikipedia.org/wiki/James_G._MacKinnon "James G. MacKinnon") (1993). *Estimation and Inference in Econometrics*. New York: Oxford University Press. p. 33. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-19-506011-3](https://en.wikipedia.org/wiki/Special:BookSources/0-19-506011-3 "Special:BookSources/0-19-506011-3") . 31. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-31)** [Davidson & MacKinnon (1993](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 36) 32. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-32)** [Davidson & MacKinnon (1993](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 20) 33. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-33)** ["Memento on EViews Output"](https://scholar.harvard.edu/files/jbenchimol/files/memento-eviews.pdf) (PDF). Retrieved 28 December 2020. 34. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-34)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 21) 35. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Amemiya22_35-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Amemiya22_35-1) [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 22) 36. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-36)** Burnham, Kenneth P.; Anderson, David R. (2002). [*Model Selection and Multi-Model Inference*](https://archive.org/details/modelselectionmu0000burn) (2nd ed.). Springer. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-387-95364-7](https://en.wikipedia.org/wiki/Special:BookSources/0-387-95364-7 "Special:BookSources/0-387-95364-7") . ## Further reading \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=30 "Edit section: Further reading")\] - [Dougherty, Christopher](https://en.wikipedia.org/wiki/Christopher_Dougherty "Christopher Dougherty") (2002). *Introduction to Econometrics* (2nd ed.). New York: Oxford University Press. pp. 48–113\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-19-877643-8](https://en.wikipedia.org/wiki/Special:BookSources/0-19-877643-8 "Special:BookSources/0-19-877643-8") . - [Gujarati, Damodar N.](https://en.wikipedia.org/wiki/Damodar_N._Gujarati "Damodar N. Gujarati"); [Porter, Dawn C.](https://en.wikipedia.org/wiki/Dawn_C._Porter "Dawn C. Porter") (2009). *Basic Econometics* (Fifth ed.). Boston: McGraw-Hill Irwin. pp. 55–96\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-07-337577-9](https://en.wikipedia.org/wiki/Special:BookSources/978-0-07-337577-9 "Special:BookSources/978-0-07-337577-9") . - [Heij, Christiaan](https://en.wikipedia.org/wiki/Christiaan_Heij "Christiaan Heij"); Boer, Paul; [Franses, Philip H.](https://en.wikipedia.org/wiki/Philip_Hans_Franses "Philip Hans Franses"); [Kloek, Teun](https://en.wikipedia.org/wiki/Teun_Kloek "Teun Kloek"); [van Dijk, Herman K.](https://en.wikipedia.org/wiki/Herman_K._van_Dijk "Herman K. van Dijk") (2004). *Econometric Methods with Applications in Business and Economics* (1st ed.). Oxford: Oxford University Press. pp. 76–115\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-19-926801-6](https://en.wikipedia.org/wiki/Special:BookSources/978-0-19-926801-6 "Special:BookSources/978-0-19-926801-6") . - Hill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008). *Principles of Econometrics* (3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. 8–47\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-471-72360-8](https://en.wikipedia.org/wiki/Special:BookSources/978-0-471-72360-8 "Special:BookSources/978-0-471-72360-8") . - [Wooldridge, Jeffrey](https://en.wikipedia.org/wiki/Jeffrey_Wooldridge "Jeffrey Wooldridge") (2008). ["The Simple Regression Model"](https://books.google.com/books?id=64vt5TDBNLwC&pg=PA22). *Introductory Econometrics: A Modern Approach* (4th ed.). Mason, OH: Cengage Learning. pp. 22–67\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-324-58162-1](https://en.wikipedia.org/wiki/Special:BookSources/978-0-324-58162-1 "Special:BookSources/978-0-324-58162-1") . | [v](https://en.wikipedia.org/wiki/Template:Least_squares_and_regression_analysis "Template:Least squares and regression analysis") [t](https://en.wikipedia.org/wiki/Template_talk:Least_squares_and_regression_analysis "Template talk:Least squares and regression analysis") [e](https://en.wikipedia.org/wiki/Special:EditPage/Template:Least_squares_and_regression_analysis "Special:EditPage/Template:Least squares and regression analysis")[Least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares") and [regression analysis](https://en.wikipedia.org/wiki/Regression_analysis "Regression analysis") | | |---|---| | [Computational statistics](https://en.wikipedia.org/wiki/Computational_statistics "Computational statistics") | [Least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares") [Linear least squares](https://en.wikipedia.org/wiki/Linear_least_squares_$mathematics$ "Linear least squares (mathematics)") [Non-linear least squares](https://en.wikipedia.org/wiki/Non-linear_least_squares "Non-linear least squares") [Iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares "Iteratively reweighted least squares") | | [Correlation and dependence](https://en.wikipedia.org/wiki/Correlation_and_dependence "Correlation and dependence") | [Pearson product-moment correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient "Pearson product-moment correlation coefficient") [Rank correlation](https://en.wikipedia.org/wiki/Rank_correlation "Rank correlation") ([Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient "Spearman's rank correlation coefficient") [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient "Kendall tau rank correlation coefficient")) [Partial correlation](https://en.wikipedia.org/wiki/Partial_correlation "Partial correlation") [Confounding variable](https://en.wikipedia.org/wiki/Confounding "Confounding") | | [Regression analysis](https://en.wikipedia.org/wiki/Regression_analysis "Regression analysis") | [Ordinary least squares]() [Partial least squares](https://en.wikipedia.org/wiki/Partial_least_squares_regression "Partial least squares regression") [Total least squares](https://en.wikipedia.org/wiki/Total_least_squares "Total least squares") [Ridge regression](https://en.wikipedia.org/wiki/Tikhonov_regularization "Tikhonov regularization") | | Regression as a [statistical model](https://en.wikipedia.org/wiki/Statistical_model "Statistical model") | | | | | | [Linear regression](https://en.wikipedia.org/wiki/Linear_regression "Linear regression") | [Simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression "Simple linear regression") [Ordinary least squares]() [Generalized least squares](https://en.wikipedia.org/wiki/Generalized_least_squares "Generalized least squares") [Weighted least squares](https://en.wikipedia.org/wiki/Weighted_least_squares "Weighted least squares") [General linear model](https://en.wikipedia.org/wiki/General_linear_model "General linear model") | | Predictor structure | [Polynomial regression](https://en.wikipedia.org/wiki/Polynomial_regression "Polynomial regression") [Growth curve (statistics)](https://en.wikipedia.org/wiki/Growth_curve_$statistics$ "Growth curve (statistics)") [Segmented regression](https://en.wikipedia.org/wiki/Segmented_regression "Segmented regression") [Local regression](https://en.wikipedia.org/wiki/Local_regression "Local regression") | | Non-standard | [Nonlinear regression](https://en.wikipedia.org/wiki/Nonlinear_regression "Nonlinear regression") [Nonparametric](https://en.wikipedia.org/wiki/Nonparametric_regression "Nonparametric regression") [Semiparametric](https://en.wikipedia.org/wiki/Semiparametric_regression "Semiparametric regression") [Robust](https://en.wikipedia.org/wiki/Robust_regression "Robust regression") [Quantile](https://en.wikipedia.org/wiki/Quantile_regression "Quantile regression") [Isotonic](https://en.wikipedia.org/wiki/Isotonic_regression "Isotonic regression") | | Non-normal errors | [Generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model "Generalized linear model") [Binomial](https://en.wikipedia.org/wiki/Binomial_regression "Binomial regression") [Poisson](https://en.wikipedia.org/wiki/Poisson_regression "Poisson regression") [Logistic](https://en.wikipedia.org/wiki/Logistic_regression "Logistic regression") | | [Decomposition of variance](https://en.wikipedia.org/wiki/Partition_of_sums_of_squares "Partition of sums of squares") | [Analysis of variance](https://en.wikipedia.org/wiki/Analysis_of_variance "Analysis of variance") [Analysis of covariance](https://en.wikipedia.org/wiki/Analysis_of_covariance "Analysis of covariance") [Multivariate AOV](https://en.wikipedia.org/wiki/Multivariate_analysis_of_variance "Multivariate analysis of variance") | | Model exploration | [Stepwise regression](https://en.wikipedia.org/wiki/Stepwise_regression "Stepwise regression") [Model selection](https://en.wikipedia.org/wiki/Model_selection "Model selection") [Mallows's *Cp*](https://en.wikipedia.org/wiki/Mallows%27s_Cp "Mallows's Cp") [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion "Akaike information criterion") [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion "Bayesian information criterion") [Model specification](https://en.wikipedia.org/wiki/Model_specification "Model specification") [Regression validation](https://en.wikipedia.org/wiki/Regression_validation "Regression validation") | | Background | [Mean and predicted response](https://en.wikipedia.org/wiki/Mean_and_predicted_response "Mean and predicted response") [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem") [Errors and residuals](https://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics "Errors and residuals in statistics") [Goodness of fit](https://en.wikipedia.org/wiki/Goodness_of_fit "Goodness of fit") [Studentized residual](https://en.wikipedia.org/wiki/Studentized_residual "Studentized residual") [Minimum mean-square error](https://en.wikipedia.org/wiki/Minimum_mean-square_error "Minimum mean-square error") [Frisch–Waugh–Lovell theorem](https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem "Frisch–Waugh–Lovell theorem") | | [Design of experiments](https://en.wikipedia.org/wiki/Design_of_experiments "Design of experiments") | [Response surface methodology](https://en.wikipedia.org/wiki/Response_surface_methodology "Response surface methodology") [Optimal design](https://en.wikipedia.org/wiki/Optimal_design "Optimal design") [Bayesian design](https://en.wikipedia.org/wiki/Bayesian_experimental_design "Bayesian experimental design") | | [Numerical](https://en.wikipedia.org/wiki/Numerical_analysis "Numerical analysis") [approximation](https://en.wikipedia.org/wiki/Approximation_theory "Approximation theory") | [Numerical analysis](https://en.wikipedia.org/wiki/Numerical_analysis "Numerical analysis") [Approximation theory](https://en.wikipedia.org/wiki/Approximation_theory "Approximation theory") [Numerical integration](https://en.wikipedia.org/wiki/Numerical_integration "Numerical integration") [Gaussian quadrature](https://en.wikipedia.org/wiki/Gaussian_quadrature "Gaussian quadrature") [Orthogonal polynomials](https://en.wikipedia.org/wiki/Orthogonal_polynomials "Orthogonal polynomials") [Chebyshev polynomials](https://en.wikipedia.org/wiki/Chebyshev_polynomials "Chebyshev polynomials") [Chebyshev nodes](https://en.wikipedia.org/wiki/Chebyshev_nodes "Chebyshev nodes") | | Applications | [Curve fitting](https://en.wikipedia.org/wiki/Curve_fitting "Curve fitting") [Calibration curve](https://en.wikipedia.org/wiki/Calibration_curve "Calibration curve") [Numerical smoothing and differentiation](https://en.wikipedia.org/wiki/Numerical_smoothing_and_differentiation "Numerical smoothing and differentiation") [System identification](https://en.wikipedia.org/wiki/System_identification "System identification") [Moving least squares](https://en.wikipedia.org/wiki/Moving_least_squares "Moving least squares") | | [Regression analysis category](https://en.wikipedia.org/wiki/Category:Regression_analysis "Category:Regression analysis") [Statistics category](https://en.wikipedia.org/wiki/Category:Statistics "Category:Statistics") [![icon](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Nuvola_apps_edu_mathematics_blue-p.svg/20px-Nuvola_apps_edu_mathematics_blue-p.svg.png)](https://en.wikipedia.org/wiki/File:Nuvola_apps_edu_mathematics_blue-p.svg) [Mathematics portal](https://en.wikipedia.org/wiki/Portal:Mathematics "Portal:Mathematics") [Statistics outline](https://en.wikipedia.org/wiki/Outline_of_statistics "Outline of statistics") [Statistics topics](https://en.wikipedia.org/wiki/List_of_statistics_articles "List of statistics articles") | | ![](https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?useformat=desktop&type=1x1&usesul3=1) Retrieved from "<https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&oldid=1345164414>" [Categories](https://en.wikipedia.org/wiki/Help:Category "Help:Category"): - [Parametric statistics](https://en.wikipedia.org/wiki/Category:Parametric_statistics "Category:Parametric statistics") - [Least squares](https://en.wikipedia.org/wiki/Category:Least_squares "Category:Least squares") Hidden categories: - [CS1 errors: ISBN date](https://en.wikipedia.org/wiki/Category:CS1_errors:_ISBN_date "Category:CS1 errors: ISBN date") - [Articles with short description](https://en.wikipedia.org/wiki/Category:Articles_with_short_description "Category:Articles with short description") - [Short description matches Wikidata](https://en.wikipedia.org/wiki/Category:Short_description_matches_Wikidata "Category:Short description matches Wikidata") - [Wikipedia articles needing clarification from December 2023](https://en.wikipedia.org/wiki/Category:Wikipedia_articles_needing_clarification_from_December_2023 "Category:Wikipedia articles needing clarification from December 2023") - [All articles with unsourced statements](https://en.wikipedia.org/wiki/Category:All_articles_with_unsourced_statements "Category:All articles with unsourced statements") - [Articles with unsourced statements from February 2010](https://en.wikipedia.org/wiki/Category:Articles_with_unsourced_statements_from_February_2010 "Category:Articles with unsourced statements from February 2010") - [Articles to be expanded from February 2017](https://en.wikipedia.org/wiki/Category:Articles_to_be_expanded_from_February_2017 "Category:Articles to be expanded from February 2017") - [All articles to be expanded](https://en.wikipedia.org/wiki/Category:All_articles_to_be_expanded "Category:All articles to be expanded") - This page was last edited on 24 March 2026, at 17:11 (UTC). - Text is available under the [Creative Commons Attribution-ShareAlike 4.0 License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_4.0_International_License "Wikipedia:Text of the Creative Commons Attribution-ShareAlike 4.0 International License"); additional terms may apply. By using this site, you agree to the [Terms of Use](https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Terms_of_Use "foundation:Special:MyLanguage/Policy:Terms of Use") and [Privacy Policy](https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Privacy_policy "foundation:Special:MyLanguage/Policy:Privacy policy"). Wikipedia® is a registered trademark of the [Wikimedia Foundation, Inc.](https://wikimediafoundation.org/), a non-profit organization. - [Privacy policy](https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Privacy_policy) - [About Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:About) - [Disclaimers](https://en.wikipedia.org/wiki/Wikipedia:General_disclaimer) - [Contact Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Contact_us) - [Legal & safety contacts](https://foundation.wikimedia.org/wiki/Special:MyLanguage/Legal:Wikimedia_Foundation_Legal_and_Safety_Contact_Information) - [Code of Conduct](https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Universal_Code_of_Conduct) - [Developers](https://developer.wikimedia.org/) - [Statistics](https://stats.wikimedia.org/#/en.wikipedia.org) - [Cookie statement](https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Cookie_statement) - [Mobile view](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&mobileaction=toggle_view_mobile) - [![Wikimedia Foundation](https://en.wikipedia.org/static/images/footer/wikimedia.svg)](https://www.wikimedia.org/) - [![Powered by MediaWiki](https://en.wikipedia.org/w/resources/assets/mediawiki_compact.svg)](https://www.mediawiki.org/) Search Toggle the table of contents Ordinary least squares 12 languages [Add topic](https://en.wikipedia.org/wiki/Ordinary_least_squares)

Readable Markdown

[![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Okuns_law_quarterly_differences.svg/330px-Okuns_law_quarterly_differences.svg.png)](https://en.wikipedia.org/wiki/File:Okuns_law_quarterly_differences.svg) [Okun's law](https://en.wikipedia.org/wiki/Okun%27s_law "Okun's law") in [macroeconomics](https://en.wikipedia.org/wiki/Macroeconomics "Macroeconomics") states that in an economy the [GDP](https://en.wikipedia.org/wiki/GDP "GDP") growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law. In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), **ordinary least squares** (**OLS**) is a type of [linear least squares](https://en.wikipedia.org/wiki/Linear_least_squares "Linear least squares") method for choosing the unknown [parameters](https://en.wikipedia.org/wiki/Statistical_parameter "Statistical parameter") in a [linear regression](https://en.wikipedia.org/wiki/Linear_regression "Linear regression") model (with fixed level-one\[*[clarification needed](https://en.wikipedia.org/wiki/Wikipedia:Please_clarify "Wikipedia:Please clarify")*\] effects of a [linear function](https://en.wikipedia.org/wiki/Linear_function "Linear function") of a set of [explanatory variables](https://en.wikipedia.org/wiki/Explanatory_variable "Explanatory variable")) by the principle of [least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares"): minimizing the sum of the squares of the differences between the observed [dependent variable](https://en.wikipedia.org/wiki/Dependent_variable "Dependent variable") (values of the variable being observed) in the input [dataset](https://en.wikipedia.org/wiki/Dataset "Dataset") and the output of the (linear) function of the [independent variable](https://en.wikipedia.org/wiki/Independent_variable "Independent variable"). Some sources consider OLS to be linear regression.[\[1\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-1) Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting [estimator](https://en.wikipedia.org/wiki/Statistical_estimation "Statistical estimation") can be expressed by a simple formula, especially in the case of a [simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression "Simple linear regression"), in which there is a single [regressor](https://en.wikipedia.org/wiki/Regressor "Regressor") on the right side of the regression equation. The OLS estimator is [consistent](https://en.wikipedia.org/wiki/Consistent_estimator "Consistent estimator") for the level-one fixed effects when the regressors are [exogenous](https://en.wikipedia.org/wiki/Exogenous "Exogenous") and forms perfect [colinearity](https://en.wikipedia.org/wiki/Collinearity "Collinearity") (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[\[2\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-2) and—by the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem")—[optimal in the class of linear unbiased estimators](https://en.wikipedia.org/wiki/Best_linear_unbiased_estimator "Best linear unbiased estimator") when the [errors](https://en.wikipedia.org/wiki/Statistical_error "Statistical error") are [homoscedastic](https://en.wikipedia.org/wiki/Homoscedastic "Homoscedastic") and [serially uncorrelated](https://en.wikipedia.org/wiki/Autocorrelation "Autocorrelation"). Under these conditions, the method of OLS provides [minimum-variance mean-unbiased](https://en.wikipedia.org/wiki/UMVU "UMVU") estimation when the errors have finite [variances](https://en.wikipedia.org/wiki/Variance "Variance"). Under the additional assumption that the errors are [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution "Normal distribution") with zero mean, OLS is the [maximum likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimator "Maximum likelihood estimator") that outperforms any non-linear unbiased estimator. Suppose the data consists of ![{\\displaystyle n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [observations](https://en.wikipedia.org/wiki/Statistical_unit "Statistical unit") ![{\\displaystyle \\left\\{\\mathbf {x} \_{i},y\_{i}\\right\\}\_{i=1}^{n}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a30b4d40ab94a43e79c39dab82a36f8d19bdc798). Each observation ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20) includes a scalar response ![{\\displaystyle y\_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67d30d30b6c2dbe4d6f150d699de040937ecc95f) and a column vector ![{\\displaystyle \\mathbf {x} \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/57d2ef3df60acdb53bdf90535264041fea7231cd) of ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) parameters (regressors), i.e., ![{\\displaystyle \\mathbf {x} \_{i}=\\left\[x\_{i1},x\_{i2},\\dots ,x\_{ip}\\right\]^{\\operatorname {T} }}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3278872b5bdb53e6af3474d92e9926c0238e8935). In a [linear regression model](https://en.wikipedia.org/wiki/Linear_regression_model "Linear regression model"), the response variable, ![{\\displaystyle y\_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67d30d30b6c2dbe4d6f150d699de040937ecc95f), is a linear function of the regressors: ![{\\displaystyle y\_{i}=\\beta \_{1}\\ x\_{i1}+\\beta \_{2}\\ x\_{i2}+\\cdots +\\beta \_{p}\\ x\_{ip}+\\varepsilon \_{i},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7870b85fe5f4127c19581eb03c7c0a26a76035a0) or in [vector](https://en.wikipedia.org/wiki/Row_and_column_vectors "Row and column vectors") form, ![{\\displaystyle y\_{i}=\\mathbf {x} \_{i}^{\\operatorname {T} }{\\boldsymbol {\\beta }}+\\varepsilon \_{i},\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/b32cb3791f0e061f7d9930fa093e77c44de5df54) where ![{\\displaystyle \\mathbf {x} \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/57d2ef3df60acdb53bdf90535264041fea7231cd), as introduced previously, is a column vector of the ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20)\-th observation of all the explanatory variables; ![{\\displaystyle {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/702cafc420cc00c54896f6d125112820956aaf6b) is a ![{\\displaystyle p\\times 1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a9b3ff9128b8bc9ccf1c3b9a3ba1d253b95f5754) vector of unknown parameters; and the scalar ![{\\displaystyle \\varepsilon \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) represents unobserved random variables ([errors](https://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics "Errors and residuals in statistics")) of the ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20)\-th observation. ![{\\displaystyle \\varepsilon \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/00e1b6ad3cbad4af49bf21a3ad2dc379ff045079) accounts for the influences upon the responses ![{\\displaystyle y\_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67d30d30b6c2dbe4d6f150d699de040937ecc95f) from sources other than the explanatory variables ![{\\displaystyle \\mathbf {x} \_{i}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/57d2ef3df60acdb53bdf90535264041fea7231cd). This model can also be written in matrix notation as ![{\\displaystyle \\mathbf {y} =\\mathbf {X} {\\boldsymbol {\\beta }}+{\\boldsymbol {\\varepsilon }},\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e12b25ef53459d71b4879d9bd23b4387defe4aef) where ![{\\displaystyle \\mathbf {y} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/bb25a040b592282dc2a254c3117e792c3c81161f) and ![{\\displaystyle {\\boldsymbol {\\varepsilon }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8445af5ff7da70714382bc35e78bedcacf68e825) are ![{\\displaystyle n\\times 1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d24148f103e1cccb60addeeb0a64cb1c3d5622e0) vectors of the response variables and the errors of the ![{\\displaystyle n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a601995d55609f2d9f5e233e36fbe9ea26011b3b) observations, and ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd) is an ![{\\displaystyle n\\times p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/43ad58cdd60e9b0ab2bec828151c740accf92028) matrix of regressors, also sometimes called the [design matrix](https://en.wikipedia.org/wiki/Design_matrix "Design matrix"), whose row ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20) is ![{\\displaystyle \\mathbf {x} \_{i}^{\\operatorname {T} }}](https://wikimedia.org/api/rest_v1/media/math/render/svg/71723da20a9d144f526b5f42f8bce496c157d34d) and contains the ![{\\displaystyle i}](https://wikimedia.org/api/rest_v1/media/math/render/svg/add78d8608ad86e54951b8c8bd6c8d8416533d20)\-th observations on all the explanatory variables. Typically, a constant term is included in the set of regressors ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd), say, by taking ![{\\displaystyle x\_{i1}=1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e29706788693b7f680e4d6dd2cfd96078cd968d8) for all ![{\\displaystyle i=1,\\dots ,n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f3f269b2f3b2f87fec0168426652a5ea80b56112). The coefficient ![{\\displaystyle \\beta \_{1}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) corresponding to this regressor is called the *intercept*. Without the intercept, the fitted line is forced to cross the origin when ![{\\displaystyle x\_{i}={\\vec {0}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d46f3b2caf53f86b9bb27baa47fa68fa138d61be). Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent). As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be *quadratic* in the second regressor, but none-the-less is still considered a *linear* model because the model *is* still linear in the parameters (![{\\displaystyle {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/702cafc420cc00c54896f6d125112820956aaf6b)). ### Matrix/vector formulation \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=2 "Edit section: Matrix/vector formulation")\] Consider an [overdetermined system](https://en.wikipedia.org/wiki/Overdetermined_system "Overdetermined system") ![{\\displaystyle \\sum \_{j=1}^{p}x\_{ij}\\beta \_{j}=y\_{i},\\ (i=1,2,\\dots ,n),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7080f1307d382773f003a69c9df7f481720c1fd2) of ![{\\displaystyle n}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a601995d55609f2d9f5e233e36fbe9ea26011b3b) [linear equations](https://en.wikipedia.org/wiki/Linear_equation "Linear equation") in ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) unknown [coefficients](https://en.wikipedia.org/wiki/Coefficients "Coefficients"), ![{\\displaystyle \\beta \_{1},\\beta \_{2},\\dots ,\\beta \_{p}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3a420907c6b68e22d9f37306b5837bdafd46ae1e), with ![{\\displaystyle n\>p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7002b8ec5aefb63e588cf894804ba7cb8401fc95). This can be written in [matrix](https://en.wikipedia.org/wiki/Matrix_$mathematics$ "Matrix (mathematics)") form as ![{\\displaystyle \\mathbf {X} {\\boldsymbol {\\beta }}=\\mathbf {y} ,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3b00013b54c8f88aeeb146a12584541c6e934529) where ![{\\displaystyle \\mathbf {X} ={\\begin{bmatrix}X\_{11}\&X\_{12}&\\cdots \&X\_{1p}\\\\X\_{21}\&X\_{22}&\\cdots \&X\_{2p}\\\\\\vdots &\\vdots &\\ddots &\\vdots \\\\X\_{n1}\&X\_{n2}&\\cdots \&X\_{np}\\end{bmatrix}},\\qquad {\\boldsymbol {\\beta }}={\\begin{bmatrix}\\beta \_{1}\\\\\\beta \_{2}\\\\\\vdots \\\\\\beta \_{p}\\end{bmatrix}},\\qquad \\mathbf {y} ={\\begin{bmatrix}y\_{1}\\\\y\_{2}\\\\\\vdots \\\\y\_{n}\\end{bmatrix}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9da3326ca4d8536e1e444d2cea03ec869d734e6f) (Note: for a linear model as above, not all elements in ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd) contains information on the data points. The first column is populated with ones, ![{\\displaystyle X\_{i1}=1}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3b81dfc9a5c55d20b4edfdc14c4ea4fc4e666bb0). Only the other columns contain actual data. So here ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) is equal to the number of regressors plus one). Such a system usually has no exact solution, so the goal is instead to find the coefficients ![{\\displaystyle {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/702cafc420cc00c54896f6d125112820956aaf6b) which fit the equations "best", in the sense of solving the [quadratic](https://en.wikipedia.org/wiki/Quadratic_form_$statistics$ "Quadratic form (statistics)") [minimization](https://en.wikipedia.org/wiki/Mathematical_optimization "Mathematical optimization") problem ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}={\\underset {\\boldsymbol {\\beta }}{\\operatorname {arg\\,min} }}\\,S({\\boldsymbol {\\beta }}),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/bf227a8af979e716d08f7c82dd95b17440e33a15) where the objective function ![{\\displaystyle S}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4611d85173cd3b508e67077d4a1252c9c05abca2) is given by ![{\\displaystyle S({\\boldsymbol {\\beta }})=\\sum \_{i=1}^{n}\\left\|y\_{i}-\\sum \_{j=1}^{p}X\_{ij}\\beta \_{j}\\right\|^{2}=\\left\\\|\\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}\\right\\\|^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cd9b3380cc6d4c170f743bb84a1878cac86ce009) A justification for choosing this criterion is given in [Properties](https://en.wikipedia.org/wiki/Ordinary_least_squares#Properties) below. This minimization problem has a unique solution, provided that the ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) columns of the matrix ![{\\displaystyle \\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f75966a2f9d5672136fa9401ee1e75008f95ffd) are [linearly independent](https://en.wikipedia.org/wiki/Linearly_independent "Linearly independent"), given by solving the so-called *normal equations*: ![{\\displaystyle \\left(\\mathbf {X} ^{\\operatorname {T} }\\mathbf {X} \\right){\\hat {\\boldsymbol {\\beta }}}=\\mathbf {X} ^{\\operatorname {T} }\\mathbf {y} \\ .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/74ccf054aed29744d095c445b7aaa7a84729db17) The matrix ![{\\displaystyle \\mathbf {X} ^{\\operatorname {T} }\\mathbf {X} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/a8b826366f6f9df8cf2d0ea4fe3eda3c760d2fc8) is known as the *normal matrix* or [Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix "Gram matrix") and the matrix ![{\\displaystyle \\mathbf {X} ^{\\operatorname {T} }\\mathbf {y} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/9fdb62e41d505bc511e1f626946b2775aa00c01b) is known as the [moment matrix](https://en.wikipedia.org/wiki/Moment_matrix "Moment matrix") of regressand by regressors.[\[3\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-3) Finally, ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4a29ed56e80ee92ae1ef81b8ee8b7ffb4a16b614) is the coefficient vector of the least-squares [hyperplane](https://en.wikipedia.org/wiki/Hyperplane "Hyperplane"), expressed as ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}=\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }\\mathbf {y} .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/96ef11beb26be8b3f28df0d43e0810694ea980d1) or ![{\\displaystyle {\\hat {\\boldsymbol {\\beta }}}={\\boldsymbol {\\beta }}+\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }{\\boldsymbol {\\varepsilon }}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/dd9aa3940939f16090f954e6f8b76d914f3d9f18) Suppose *b* is a "candidate" value for the parameter vector *β*. The quantity *yi* − *xi*T*b*, called the *[residual](https://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics "Errors and residuals in statistics")* for the *i*\-th observation, measures the vertical distance between the data point (*xi*, *yi*) and the hyperplane *y* = *x*T*b*, and thus assesses the degree of fit between the actual data and the model. The *[sum of squared residuals](https://en.wikipedia.org/wiki/Sum_of_squared_residuals "Sum of squared residuals")* (*SSR*) (also called the *error sum of squares* (*ESS*) or *residual sum of squares* (*RSS*))[\[4\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-4) is a measure of the overall model fit: ![{\\displaystyle S(b)=\\sum \_{i=1}^{n}(y\_{i}-x\_{i}^{\\operatorname {T} }b)^{2}=(y-Xb)^{\\operatorname {T} }(y-Xb),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/db93b1fe27230d86895999a749498e7180f26acd) where *T* denotes the matrix [transpose](https://en.wikipedia.org/wiki/Transpose "Transpose"), and the rows of *X*, denoting the values of all the independent variables associated with a particular value of the dependent variable, are *Xi = xi*T. The value of *b* which minimizes this sum is called the **OLS estimator for *β***. The function *S*(*b*) is quadratic in *b* with positive-definite [Hessian](https://en.wikipedia.org/wiki/Hessian_matrix "Hessian matrix"), and therefore this function possesses a unique global minimum at ![{\\displaystyle b={\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5b0af26c563f2ee2ad28a14e01ee712c5ab69d63), which can be given by the explicit formula[\[5\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-5)[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Least_squares_estimator_for_.CE.B2 "Proofs involving ordinary least squares") ![{\\displaystyle {\\hat {\\beta }}=\\operatorname {argmin} \_{b\\in \\mathbb {R} ^{p}}S(b)=(X^{\\operatorname {T} }X)^{-1}X^{\\operatorname {T} }y\\ .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/89476bedc0a2bdb8bdffee080e1cf7595eb09404) The product *N* = *X*T *X* is a [Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix "Gram matrix"), and its inverse, *Q* = *N*−1, is the *cofactor matrix* of *β*,[\[6\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-6)[\[7\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-7)[\[8\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-8) closely related to its [covariance matrix](https://en.wikipedia.org/wiki/Ordinary_least_squares#Covariance_matrix), *C**β*. The matrix (*X*T *X*)−1 *X*T = *Q* *X*T is called the [Moore–Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse "Moore–Penrose pseudoinverse") matrix of *X*. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity "Multicollinearity") between the explanatory variables (which would cause the Gram matrix to have no inverse). After we have estimated *β*, the *[fitted values](https://en.wikipedia.org/wiki/Fitted_value "Fitted value")* (or *predicted values*) from the regression will be ![{\\displaystyle {\\hat {y}}=X{\\hat {\\beta }}=Py,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d480dd9ab57371fb43d863a1cf345c71fb9ed7dd) where *P* = *X*(*X*T*X*)−1*X*T is the *[projection matrix](https://en.wikipedia.org/wiki/Projection_matrix "Projection matrix")* onto the space *V* spanned by the columns of *X*. This matrix *P* is also sometimes called the *[hat matrix](https://en.wikipedia.org/wiki/Hat_matrix "Hat matrix")* because it "puts a hat" onto the variable *y*. Another matrix, closely related to *P* is the *annihilator* matrix *M* = *In* − *P*; this is a projection matrix onto the space orthogonal to *V*. Both matrices *P* and *M* are [symmetric](https://en.wikipedia.org/wiki/Symmetric_matrix "Symmetric matrix") and [idempotent](https://en.wikipedia.org/wiki/Idempotent_matrix "Idempotent matrix") (meaning that *P*2 = *P* and *M*2 = *M*), and relate to the data matrix *X* via identities *PX* = *X* and *MX* = 0.[\[9\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9) Matrix *M* creates the *residuals* from the regression: ![{\\displaystyle {\\hat {\\varepsilon }}=y-{\\hat {y}}=y-X{\\hat {\\beta }}=My=M(X\\beta +\\varepsilon )=(MX)\\beta +M\\varepsilon =M\\varepsilon .}](https://wikimedia.org/api/rest_v1/media/math/render/svg/49f1e5bb8d4cd6e0e3b2b5a4f54ec0964721882e) The variances of the predicted values ![{\\displaystyle s\_{{\\hat {y}}\_{i}}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/800e6b83f5c87178d1e03b2cf3592caee2ce515a) are found in the main diagonal of the [variance-covariance matrix](https://en.wikipedia.org/wiki/Variance-covariance_matrix "Variance-covariance matrix") of predicted values: ![{\\displaystyle C\_{\\hat {y}}=s^{2}P,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/be9078c7868b7d295fde435739ee63bc5f7f3cc2) where *P* is the projection matrix and *s*2 is the sample variance.[\[10\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-q011-10) The full matrix is very large; its diagonal elements can be calculated individually as: ![{\\displaystyle s\_{{\\hat {y}}\_{i}}^{2}=s^{2}X\_{i}(X^{T}X)^{-1}X\_{i}^{T},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/abdd97a20f265471ebb770d822cab8b8732786e0) where *X*i is the *i*\-th row of matrix *X*. Using these residuals we can estimate the sample variance *s*2 using the *[reduced chi-squared](https://en.wikipedia.org/wiki/Reduced_chi-squared "Reduced chi-squared")* statistic: ![{\\displaystyle s^{2}={\\frac {{\\hat {\\varepsilon }}^{\\mathrm {T} }{\\hat {\\varepsilon }}}{n-p}}={\\frac {(My)^{\\mathrm {T} }My}{n-p}}={\\frac {y^{\\mathrm {T} }M^{\\mathrm {T} }My}{n-p}}={\\frac {y^{\\mathrm {T} }My}{n-p}}={\\frac {S({\\hat {\\beta }})}{n-p}},\\qquad {\\hat {\\sigma }}^{2}={\\frac {n-p}{n}}\\;s^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/02aa29dec64f9231a7b9b6b2ac7d4889bbe006b2) The denominator, *n*−*p*, is the [statistical degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_$statistics$ "Degrees of freedom (statistics)"). The first quantity, *s*2, is the OLS estimate for *σ*2, whereas the second, ![{\\displaystyle \\scriptstyle {\\hat {\\sigma }}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6314fcc00b383711f6a82277c41ac3aa39112bbc), is the MLE estimate for *σ*2. The two estimators are quite similar in large samples; the first estimator is always [unbiased](https://en.wikipedia.org/wiki/Estimator_bias "Estimator bias"), while the second estimator is biased but has a smaller [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error "Mean squared error"). In practice *s*2 is used more often, since it is more convenient for the hypothesis testing. The square root of *s*2 is called the *[regression standard error](https://en.wikipedia.org/wiki/Regression_standard_error "Regression standard error")*,[\[11\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-11) *standard error of the regression*,[\[12\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-12)[\[13\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-13) or *standard error of the equation*.[\[9\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_19-9) It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto *X*. The *[coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination "Coefficient of determination")* *R*2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable *y*, in the cases where the regression sum of squares equals the sum of squares of residuals:[\[14\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-14) ![{\\displaystyle R^{2}={\\frac {\\sum ({\\hat {y}}\_{i}-{\\overline {y}})^{2}}{\\sum (y\_{i}-{\\overline {y}})^{2}}}={\\frac {y^{\\mathrm {T} }P^{\\mathrm {T} }LPy}{y^{\\mathrm {T} }Ly}}=1-{\\frac {y^{\\mathrm {T} }My}{y^{\\mathrm {T} }Ly}}=1-{\\frac {\\rm {RSS}}{\\rm {TSS}}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/402af3a43381655d4ef206850bdb41f55b3d6124) where TSS is the *[total sum of squares](https://en.wikipedia.org/wiki/Total_sum_of_squares "Total sum of squares")* for the dependent variable, ![{\\textstyle L=I\_{n}-{\\frac {1}{n}}J\_{n}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/22262bc6f943dcc9bc10a932608ba89ea19476ba), and ![{\\textstyle J\_{n}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/60e444a99197edac686bd1a68cfa011b1ffc8559) is an *n*×*n* matrix of ones. (![{\\displaystyle L}](https://wikimedia.org/api/rest_v1/media/math/render/svg/103168b86f781fe6e9a4a87b8ea1cebe0ad4ede8) is a [centering matrix](https://en.wikipedia.org/wiki/Centering_matrix "Centering matrix") which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for *R*2 to be meaningful, the matrix *X* of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, *R*2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit. ### Simple linear regression model \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=6 "Edit section: Simple linear regression model")\] If the data matrix *X* contains only two variables, a constant and a scalar regressor *xi*, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as (*α*, *β*): ![{\\displaystyle y\_{i}=\\alpha +\\beta x\_{i}+\\varepsilon \_{i}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/968be557dd22b1a2e536b8d22369cfdb37f58703) The least squares estimates in this case are given by simple formulas ![{\\displaystyle {\\begin{aligned}{\\widehat {\\beta }}&={\\frac {\\sum \_{i=1}^{n}{(x\_{i}-{\\bar {x}})(y\_{i}-{\\bar {y}})}}{\\sum \_{i=1}^{n}{(x\_{i}-{\\bar {x}})^{2}}}}\\\\\[2pt\]{\\widehat {\\alpha }}&={\\bar {y}}-{\\widehat {\\beta }}\\,{\\bar {x}}\\ ,\\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/932c6407f7ceba533fef69961fe504fc3b565e1e) ## Alternative derivations \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=7 "Edit section: Alternative derivations")\] In the previous section the least squares estimator ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^*β* = (*X*T*X*)−1*X*T*y*; the only difference is in how we interpret this result. [![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/OLS_geometric_interpretation.svg/250px-OLS_geometric_interpretation.svg.png)](https://en.wikipedia.org/wiki/File:OLS_geometric_interpretation.svg) OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of ![{\\displaystyle X\_{1}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f70b2694445a5901b24338a2e7a7e58f02a72a32) and ![{\\displaystyle X\_{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2ad47c14b8a092f182512e76c96638aea6e3bea1) refers to a column of the data matrix.) [![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Geometric_interpretation_of_least_squares_%28three_observations%29.png/250px-Geometric_interpretation_of_least_squares_%28three_observations%29.png)](https://en.wikipedia.org/wiki/File:Geometric_interpretation_of_least_squares_$three_observations$.png) Least squares as projection of y onto col(X) for three observations; ŷ = Xβ gives fitted values and y − ŷ is the residual. For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations *Xβ* ≈ *y*, where *β* is the unknown. Assuming the system cannot be solved exactly (the number of equations *n* is much larger than the number of unknowns *p*), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies ![{\\displaystyle {\\hat {\\beta }}={\\rm {arg}}\\min \_{\\beta }\\,\\lVert \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}\\rVert ^{2},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd635ce3922e11053e13871626396a73e87db79f) where ‖·‖ is the standard [*L*2 norm](https://en.wikipedia.org/wiki/Norm_$mathematics$#Euclidean_norm "Norm (mathematics)") in the *n*\-dimensional [Euclidean space](https://en.wikipedia.org/wiki/Euclidean_space "Euclidean space") **R***n*. The predicted quantity *Xβ* is just a certain linear combination of the vectors of regressors. Thus, the residual vector *y* − *Xβ* will have the smallest length when *y* is [projected orthogonally](https://en.wikipedia.org/wiki/Projection_$linear_algebra$ "Projection (linear algebra)") onto the [linear subspace](https://en.wikipedia.org/wiki/Linear_subspace "Linear subspace") [spanned](https://en.wikipedia.org/wiki/Linear_span "Linear span") by the columns of *X*. The OLS estimator ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) in this case can be interpreted as the coefficients of [vector decomposition](https://en.wikipedia.org/wiki/Vector_decomposition "Vector decomposition") of ^*y* = *Py* along the basis of *X*. In other words, the gradient equations at the minimum can be written as: ![{\\displaystyle (\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}})^{\\top }\\mathbf {X} =0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f61db732765501ffe06d27ed044ec962b5836eb9) A geometrical interpretation of these equations is that the vector of residuals, ![{\\displaystyle \\mathbf {y} -X{\\hat {\\boldsymbol {\\beta }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d28701ed83da946fc9829429e8a174f3623ef1e1) is orthogonal to the [column space](https://en.wikipedia.org/wiki/Column_space "Column space") of *X*, since the [dot product](https://en.wikipedia.org/wiki/Dot_product "Dot product") ![{\\displaystyle (\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}})\\cdot \\mathbf {X} \\mathbf {v} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/b314a7f7223989068bab62ad471e21b653f95bcf) is equal to zero for *any* conformal vector, **v**. This means that ![{\\displaystyle \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\hat {\\beta }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/c27bab4fd405ceee972b12096dbcd827ed7af8cc) is the shortest of all possible vectors ![{\\displaystyle \\mathbf {y} -\\mathbf {X} {\\boldsymbol {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5189feb32b7fee0a242670c7182342cfb014ae29), that is, the variance of the residuals is the minimum possible. This is illustrated at the right. Introducing ![{\\displaystyle {\\hat {\\boldsymbol {\\gamma }}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2d450058b0adc1e50ab7cf56639e7234b8098f1a) and a matrix *K* with the assumption that a matrix ![{\\displaystyle \[\\mathbf {X} \\ \\mathbf {K} \]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2ac770308f79814997ffbdfd971621c67b76aef6) is non-singular and *K*T *X* = 0 (cf. [Orthogonal projections](https://en.wikipedia.org/wiki/Linear_projection#Orthogonal_projections "Linear projection")), the residual vector should satisfy the following equation: ![{\\displaystyle {\\hat {\\mathbf {r} }}:=\\mathbf {y} -\\mathbf {X} {\\hat {\\boldsymbol {\\beta }}}=\\mathbf {K} {\\hat {\\boldsymbol {\\gamma }}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/52b3dd4358358a8e91ca5f3ebc7aab78c0d6fdbd) The equation and solution of linear least squares are thus described as follows: ![{\\displaystyle {\\begin{aligned}\\mathbf {y} &={\\begin{bmatrix}\\mathbf {X} &\\mathbf {K} \\end{bmatrix}}{\\begin{bmatrix}{\\hat {\\boldsymbol {\\beta }}}\\\\{\\hat {\\boldsymbol {\\gamma }}}\\end{bmatrix}},\\\\{}\\Rightarrow {\\begin{bmatrix}{\\hat {\\boldsymbol {\\beta }}}\\\\{\\hat {\\boldsymbol {\\gamma }}}\\end{bmatrix}}&={\\begin{bmatrix}\\mathbf {X} &\\mathbf {K} \\end{bmatrix}}^{-1}\\mathbf {y} ={\\begin{bmatrix}\\left(\\mathbf {X} ^{\\top }\\mathbf {X} \\right)^{-1}\\mathbf {X} ^{\\top }\\\\\\left(\\mathbf {K} ^{\\top }\\mathbf {K} \\right)^{-1}\\mathbf {K} ^{\\top }\\end{bmatrix}}\\mathbf {y} .\\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/68f300fc2e982864b80ee1416ce010f0078e0e05) Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[\[15\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-15) Although this way of calculation is more computationally expensive, it provides a better intuition on OLS. The OLS estimator is identical to the [maximum likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimator "Maximum likelihood estimator") (MLE) under the normality assumption for the error terms.[\[16\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-16)[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Maximum_likelihood_approach "Proofs involving ordinary least squares") This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by [Yule](https://en.wikipedia.org/wiki/Udny_Yule "Udny Yule") and [Pearson](https://en.wikipedia.org/wiki/Karl_Pearson "Karl Pearson").\[*[citation needed](https://en.wikipedia.org/wiki/Wikipedia:Citation_needed "Wikipedia:Citation needed")*\] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [Cramér–Rao bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound "Cramér–Rao bound") for variance) if the normality assumption is satisfied.[\[17\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17) ### Generalized method of moments \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=10 "Edit section: Generalized method of moments")\] In [iid](https://en.wikipedia.org/wiki/Iid "Iid") case the OLS estimator can also be viewed as a [GMM](https://en.wikipedia.org/wiki/Generalized_method_of_moments "Generalized method of moments") estimator arising from the moment conditions ![{\\displaystyle \\mathrm {E} {\\big \[}\\,x\_{i}\\left(y\_{i}-x\_{i}^{\\operatorname {T} }\\beta \\right)\\,{\\big \]}=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/1d7894c141dad7e6dae3aed8bb708aada174daf2) These moment conditions state that the regressors should be uncorrelated with the errors. Since *xi* is a *p*\-vector, the number of moment conditions is equal to the dimension of the parameter vector *β*, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix. Note that the original strict exogeneity assumption E\[*εi* \| *xi*\] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E\[*ƒ*(*xi*)·*εi*\] = 0 will hold. However it can be shown using the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem") that the optimal choice of function ƒ is to take *ƒ*(*x*) = *x*, which results in the moment equation posted above. There are several different frameworks in which the [linear regression model](https://en.wikipedia.org/wiki/Linear_regression_model "Linear regression model") can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed. One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (**random design**) the regressors *xi* are random and sampled together with the *yi*'s from some [population](https://en.wikipedia.org/wiki/Statistical_population "Statistical population"), as in an [observational study](https://en.wikipedia.org/wiki/Observational_study "Observational study"). This approach allows for more natural study of the [asymptotic properties](https://en.wikipedia.org/wiki/Asymptotic_theory_$statistics$ "Asymptotic theory (statistics)") of the estimators. In the other interpretation (**fixed design**), the regressors *X* are treated as known constants set by a [design](https://en.wikipedia.org/wiki/Design_of_experiments "Design of experiments"), and *y* is sampled conditionally on the values of *X* as in an [experiment](https://en.wikipedia.org/wiki/Experiment "Experiment"). For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on *X*. All results stated in this article are within the random design framework. The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations *n* is fixed. This contrasts with the other approaches, which study the [asymptotic behavior](https://en.wikipedia.org/wiki/Asymptotic_theory_$statistics$ "Asymptotic theory (statistics)") of OLS, and in which the behavior at a large number of samples is studied. To prove finite sample unbiasedness of the OLS estimator, we require the following assumptions. [![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Polyreg_scheffe.svg/500px-Polyreg_scheffe.svg.png)](https://en.wikipedia.org/wiki/File:Polyreg_scheffe.svg) Example of a cubic polynomial regression, which is a type of linear regression. Although *polynomial regression* fits a curve model to the data, as a [statistical estimation](https://en.wikipedia.org/wiki/Estimation_theory "Estimation theory") problem it is linear, in the sense that the conditional expectation function ![{\\displaystyle \\mathbb {E} \[y\|x\]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/391e556522e79e05efc7dd6a5248cc0e5b0c4651) is linear in the unknown [parameters](https://en.wikipedia.org/wiki/Parameter "Parameter") that are estimated from the [data](https://en.wikipedia.org/wiki/Data "Data"). For this reason, polynomial regression is considered to be a special case of [multiple linear regression](https://en.wikipedia.org/wiki/Multiple_linear_regression "Multiple linear regression"). - **Exogeneity**. The regressors do not [covary](https://en.wikipedia.org/wiki/Covariance "Covariance") with the error term: ![{\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}x\_{i}\]=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/865dc6e3427f19d12afd3bed41c45bb7661ef289) This requires, for example, that there are no [omitted variables](https://en.wikipedia.org/wiki/Omitted_variable_bias "Omitted variable bias") that covary with observed variables and affect the response variable. An alternative (but stronger) statement that is often required when explaining linear regression in [mathematical statistics](https://en.wikipedia.org/wiki/Mathematical_statistics "Mathematical statistics") is that the predictor variables *x* can be treated as fixed values, rather than [random variables](https://en.wikipedia.org/wiki/Random_variable "Random variable"). This stronger form means, for example, that the predictor variables are assumed to be error-free, that is, not contaminated with measurement error. Although this assumption is not realistic in many settings, dropping it leads to more complex [errors-in-variables models](https://en.wikipedia.org/wiki/Errors-in-variables_models "Errors-in-variables models"), [instrumental variable models](https://en.wikipedia.org/wiki/Instrumental_variable "Instrumental variable") and the like. - **Linearity**, or **correct specification**. This means that the mean of the response variable is a [linear combination](https://en.wikipedia.org/wiki/Linear_combination "Linear combination") of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in [polynomial regression](https://en.wikipedia.org/wiki/Polynomial_regression "Polynomial regression"), which uses linear regression to fit the response variable as an arbitrary [polynomial](https://en.wikipedia.org/wiki/Polynomial "Polynomial") function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend to [overfit](https://en.wikipedia.org/wiki/Overfit "Overfit") the data. As a result, some kind of [regularization](https://en.wikipedia.org/wiki/Regularization_$mathematics$ "Regularization (mathematics)") must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are [ridge regression](https://en.wikipedia.org/wiki/Ridge_regression "Ridge regression") and [lasso regression](https://en.wikipedia.org/wiki/Lasso_regression "Lasso regression"). [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, [ridge regression](https://en.wikipedia.org/wiki/Ridge_regression "Ridge regression") and [lasso regression](https://en.wikipedia.org/wiki/Lasso_regression "Lasso regression") can both be viewed as special cases of Bayesian linear regression, with particular types of [prior distributions](https://en.wikipedia.org/wiki/Prior_distribution "Prior distribution") placed on the regression coefficients.) - [![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Heteroscedasticity_in_Linear_Regression.png/330px-Heteroscedasticity_in_Linear_Regression.png)](https://en.wikipedia.org/wiki/File:Heteroscedasticity_in_Linear_Regression.png) Visualization of heteroscedasticity in a scatter plot against 100 random fitted values using Matlab **Constant variance** or **[homoscedasticity](https://en.wikipedia.org/wiki/Homoscedasticity "Homoscedasticity")**. This means that the variance of the errors does not depend on the values of the predictor variables: ![{\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}^{2}\|x\_{i}\]=\\sigma ^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/93072a3d09cd2b89e6a544dfd4c7cbb017200cbd) Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be \$100,000 may easily have an actual income of \$80,000 or \$120,000—i.e., a [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation "Standard deviation") of around \$20,000—while another person with a predicted income of \$10,000 is unlikely to have the same \$20,000 standard deviation, since that would imply their actual income could vary anywhere between −\$10,000 and \$30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called [heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity "Heteroscedasticity"). In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see [Heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity "Heteroscedasticity"). The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of [ordinary least squares](), not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error "Mean squared error") for the model will also be wrong. Various estimation techniques including [weighted least squares](https://en.wikipedia.org/wiki/Weighted_least_squares "Weighted least squares") and the use of [heteroscedasticity-consistent standard errors](https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors "Heteroscedasticity-consistent standard errors") can handle heteroscedasticity in a quite general way. [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the [logarithm](https://en.wikipedia.org/wiki/Logarithm "Logarithm") of the response variable using a linear regression model, which implies that the response variable itself has a [log-normal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution "Log-normal distribution") rather than a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution "Normal distribution")). [![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Independence_of_Errors_Assumption_for_Linear_Regressions.png/500px-Independence_of_Errors_Assumption_for_Linear_Regressions.png)](https://en.wikipedia.org/wiki/File:Independence_of_Errors_Assumption_for_Linear_Regressions.png) To check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation "Autocorrelation") in the errors or their correlation with one or more covariates. - **Uncorrelatedness of errors**. This assumes that the errors of the response variables are uncorrelated with each other: ![{\\displaystyle \\mathbb {E} \[\\varepsilon \_{i}\\varepsilon \_{j}\|x\_{i},x\_{j}\]=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d2049f9a6b8ce702724b8271a5912158bf3f0fc3) Some methods such as [generalized least squares](https://en.wikipedia.org/wiki/Generalized_least_squares "Generalized least squares") are capable of handling correlated errors, although they typically require significantly more data unless some sort of [regularization](https://en.wikipedia.org/wiki/Regularization_$mathematics$ "Regularization (mathematics)") is used to bias the model towards assuming uncorrelated errors. [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression "Bayesian linear regression") is a general way of handling this issue. Full [statistical independence](https://en.wikipedia.org/wiki/Statistical_independence "Statistical independence") is a stronger condition than mere lack of correlation and is often not needed, although it implies mean-independence. - **Lack of perfect multicollinearity** in the predictors. For standard [least squares](https://en.wikipedia.org/wiki/Least_squares "Least squares") estimation methods, the design matrix *X* must have full [column rank](https://en.wikipedia.org/wiki/Column_rank "Column rank") *p*: [\[18\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_10-18)![{\\displaystyle \\Pr \\!{\\big \[}\\,\\operatorname {rank} (X)=p\\,{\\big \]}=1.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6a11be3b89ce51c6441155fddbe512a991132fbf) If this assumption is violated, perfect [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity "Multicollinearity") exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. Multicollinearity can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see [Variance inflation factor](https://en.wikipedia.org/wiki/Variance_inflation_factor "Variance inflation factor")). In the case of perfect multicollinearity, the parameter vector ***β*** will be [non-identifiable](https://en.wikipedia.org/wiki/Non-identifiable "Non-identifiable")—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space **R***p*). See [partial least squares regression](https://en.wikipedia.org/wiki/Partial_least_squares_regression "Partial least squares regression"). Methods for fitting linear models with multicollinearity have been developed,[\[19\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Tibshirani-1996-19)[\[20\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Efron-2004-20)[\[21\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hawkins-1973-21)[\[22\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Jolliffe-1982-22) some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model "Generalized linear model"), do not suffer from this problem. Violations of these assumptions can result in biased estimations of ***β***, biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods: - The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent. - The arrangement, or [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution "Probability distribution") of the predictor variables **x** has a major influence on the precision of estimates of ***β***. [Sampling](https://en.wikipedia.org/wiki/Sampling_$statistics$ "Sampling (statistics)") and [design of experiments](https://en.wikipedia.org/wiki/Design_of_experiments "Design of experiments") are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of ***β***. ### Finite sample properties \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=13 "Edit section: Finite sample properties")\] First of all, under the *strict exogeneity* assumption the OLS estimators ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [unbiased](https://en.wikipedia.org/wiki/Bias_of_an_estimator "Bias of an estimator"), meaning that their expected values coincide with the true values of the parameters:[\[23\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-23)[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Unbiasedness_of_.CE.B2.CC.82 "Proofs involving ordinary least squares") ![{\\displaystyle \\operatorname {E} \[\\,{\\hat {\\beta }}\\mid X\\,\]=\\beta ,\\quad \\operatorname {E} \[\\,s^{2}\\mid X\\,\]=\\sigma ^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/67bc2fd0f90c46da207712893fdcea01e729026c) If the strict exogeneity does not hold (as is the case with many [time series](https://en.wikipedia.org/wiki/Time_series "Time series") models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples. The *[variance-covariance matrix](https://en.wikipedia.org/wiki/Variance-covariance_matrix "Variance-covariance matrix")* (or simply *covariance matrix*) of ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is equal to[\[24\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-HayashiFSP-24) ![{\\displaystyle \\operatorname {Var} \[\\,{\\hat {\\beta }}\\mid X\\,\]=\\sigma ^{2}\\left(X^{\\operatorname {T} }X\\right)^{-1}=\\sigma ^{2}Q.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/08f6cb596d94073731ee47f4a2571dbbfc1d214a) In particular, the standard error of each coefficient ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}\_{j}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d376656c63f1577f2d1fcd2d680ccc48884ffda4) is equal to square root of the *j*\-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity *σ*2 with its estimate *s*2. Thus, ![{\\displaystyle {\\widehat {\\operatorname {s.\\!e.} }}({\\hat {\\beta }}\_{j})={\\sqrt {s^{2}\\left(X^{\\operatorname {T} }X\\right)\_{jj}^{-1}}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/203c72ed1175be84e6fdd19320f0e0e21acf66ec) It can also be easily shown that the estimator ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is uncorrelated with the residuals from the model:[\[24\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-HayashiFSP-24) ![{\\displaystyle \\operatorname {Cov} \[\\,{\\hat {\\beta }},{\\hat {\\varepsilon }}\\mid X\\,\]=0.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/664c1a5e37957a1aa2ae381b9bcb07350c2c816c) The *[Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem")* states that under the *spherical errors* assumption (that is, the errors should be [uncorrelated](https://en.wikipedia.org/wiki/Uncorrelated "Uncorrelated") and [homoscedastic](https://en.wikipedia.org/wiki/Homoscedastic "Homoscedastic")) the estimator ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is efficient in the class of linear unbiased estimators. This is called the *best linear unbiased estimator* (BLUE). Efficiency should be understood as if we were to find some other estimator ![{\\displaystyle \\scriptstyle {\\tilde {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3f0eb5a65676eeeea992903f3f93fdfd097a4d8d) which would be linear in *y* and unbiased, then [\[24\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-HayashiFSP-24) ![{\\displaystyle \\operatorname {Var} \[\\,{\\tilde {\\beta }}\\mid X\\,\]-\\operatorname {Var} \[\\,{\\hat {\\beta }}\\mid X\\,\]\\geq 0}](https://wikimedia.org/api/rest_v1/media/math/render/svg/53796c9205889cc4d675b9749a58eb97fcd998f1) in the sense that this is a [nonnegative-definite matrix](https://en.wikipedia.org/wiki/Nonnegative-definite_matrix "Nonnegative-definite matrix"). This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms *ε*, other, non-linear estimators may provide better results than OLS. The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the *normality assumption* holds (that is, that *ε* ~ *N*(0, *σ*2*In*)), then additional properties of the OLS estimators can be stated. The estimator ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) is normally distributed, with mean and variance as given before:[\[25\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-25) ![{\\displaystyle {\\hat {\\beta }}\\ \\sim \\ {\\mathcal {N}}{\\big (}\\beta ,\\ \\sigma ^{2}(X^{\\mathrm {T} }X)^{-1}{\\big )}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5931062c24b5e51ae732b5a07a8ceb45dbed1d9f) This estimator reaches the [Cramér–Rao bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound "Cramér–Rao bound") for the model, and thus is optimal in the class of all unbiased estimators.[\[17\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Hayashi_2000_loc=page_52-17) Note that unlike the [Gauss–Markov theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem "Gauss–Markov theorem"), this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms. The estimator *s*2 will be proportional to the [chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution "Chi-squared distribution"):[\[26\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-26) ![{\\displaystyle s^{2}\\ \\sim \\ {\\frac {\\sigma ^{2}}{n-p}}\\cdot \\chi \_{n-p}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e006b8d7551b05b350f5d56fe88fe51062088ca9) The variance of this estimator is equal to 2*σ*4/(*n* − *p*), which does not attain the [Cramér–Rao bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound "Cramér–Rao bound") of 2*σ*4/*n*. However it was shown that there are no unbiased estimators of *σ*2 with variance smaller than that of the estimator *s*2.[\[27\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-27) If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error "Mean squared error")) estimator in this class will be ~*σ*2 = SSR */* (*n* − *p* + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (*p* = 1).[\[28\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-28) Moreover, the estimators ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4490e8b3450c3c28b77e2f842b6d595c9acb824a) and *s*2 are [independent](https://en.wikipedia.org/wiki/Independent_random_variables "Independent random variables"),[\[29\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-29) the fact which comes in useful when constructing the t- and F-tests for the regression. #### Influential observations \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=15 "Edit section: Influential observations")\] As was mentioned before, the estimator ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) is linear in *y*, meaning that it represents a linear combination of the dependent variables *yi*. The weights in this linear combination are functions of the regressors *X*, and generally are unequal. The observations with high weights are called **influential** because they have a more pronounced effect on the value of the estimator. To analyze which observations are influential we remove a specific *j*\-th observation and consider how much the estimated quantities are going to change (similarly to the [jackknife method](https://en.wikipedia.org/wiki/Jackknife_method "Jackknife method")). It can be shown that the change in the OLS estimator for *β* will be equal to [\[30\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-DvdMck33-30) ![{\\displaystyle {\\hat {\\beta }}^{(j)}-{\\hat {\\beta }}=-{\\frac {1}{1-h\_{j}}}(X^{\\mathrm {T} }X)^{-1}x\_{j}^{\\mathrm {T} }{\\hat {\\varepsilon }}\_{j}\\,,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a71f1b4756b0af3027b8f9b9d5f2a75699433107) where *hj* = *xj*T (*X*T*X*)−1*xj* is the *j*\-th diagonal element of the hat matrix *P*, and *xj* is the vector of regressors corresponding to the *j*\-th observation. Similarly, the change in the predicted value for *j*\-th observation resulting from omitting that observation from the dataset will be equal to [\[30\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-DvdMck33-30) ![{\\displaystyle {\\hat {y}}\_{j}^{(j)}-{\\hat {y}}\_{j}=x\_{j}^{\\mathrm {T} }{\\hat {\\beta }}^{(j)}-x\_{j}^{\\operatorname {T} }{\\hat {\\beta }}=-{\\frac {h\_{j}}{1-h\_{j}}}\\,{\\hat {\\varepsilon }}\_{j}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/902c8e32ba7ad698e4fbd9891ed85705ecaebb6c) From the properties of the hat matrix, 0 ≤ *hj* ≤ 1, and they sum up to *p*, so that on average *hj* ≈ *p/n*. These quantities *hj* are called the **leverages**, and observations with high *hj* are called **leverage points**.[\[31\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-31) Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset. #### Partitioned regression \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=16 "Edit section: Partitioned regression")\] Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form ![{\\displaystyle y=X\_{1}\\beta \_{1}+X\_{2}\\beta \_{2}+\\varepsilon ,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8c8fb7149efa2da253167c26839f1207ccbb4f70) where *X*1 and *X*2 have dimensions *n*×*p*1, *n*×*p*2, and *β*1, *β*2 are *p*1×1 and *p*2×1 vectors, with *p*1 + *p*2 = *p*. The **[Frisch–Waugh–Lovell theorem](https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem "Frisch–Waugh–Lovell theorem")** states that in this regression the residuals ![{\\displaystyle {\\hat {\\varepsilon }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f59e08078de1fc0acc5c2a08448127049373875d) and the OLS estimate ![{\\displaystyle \\scriptstyle {\\hat {\\beta }}\_{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/925f884381ce10f61d6bff6d36527621643e62b0) will be numerically identical to the residuals and the OLS estimate for *β*2 in the following regression:[\[32\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-32) ![{\\displaystyle M\_{1}y=M\_{1}X\_{2}\\beta \_{2}+\\eta \\,,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/03086726c6b5ea95bff0c85bfdd81789e0c229ad) where *M*1 is the [annihilator matrix](https://en.wikipedia.org/wiki/Annihilator_matrix "Annihilator matrix") for regressors *X*1. The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term. ### Large sample properties \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=17 "Edit section: Large sample properties")\] The least squares estimators are [point estimates](https://en.wikipedia.org/wiki/Point_estimate "Point estimate") of the linear regression model parameters *β*. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [interval estimates](https://en.wikipedia.org/wiki/Interval_estimate "Interval estimate"). Since we have not made any assumption about the distribution of error term *εi*, it is impossible to infer the distribution of the estimators ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) and ![{\\displaystyle {\\hat {\\sigma }}^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/1ad9d89160c9e63c0aa4c158282cb75a894de56f). Nevertheless, we can apply the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem "Central limit theorem") to derive their *asymptotic* properties as sample size *n* goes to infinity. While the sample size is necessarily finite, it is customary to assume that *n* is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit. We can show that under the model assumptions, the least squares estimator for *β* is [consistent](https://en.wikipedia.org/wiki/Consistent_estimator "Consistent estimator") (that is ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) [converges in probability](https://en.wikipedia.org/wiki/Convergence_of_random_variables#Convergence_in_probability "Convergence of random variables") to *β*) and asymptotically normal:[\[proof\]](https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Consistency_and_asymptotic_normality_of_.CE.B2.CC.82 "Proofs involving ordinary least squares") ![{\\displaystyle ({\\hat {\\beta }}-\\beta )\\ \\xrightarrow {d} \\ {\\mathcal {N}}{\\big (}0,\\;\\sigma ^{2}Q\_{xx}^{-1}{\\big )},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4e1f0415269cea909e27a5e628594cd011db546a) where ![{\\displaystyle Q\_{xx}=X^{\\operatorname {T} }X.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3d8267e0743c4f55e4517e77e5f35807f2229e6d) Using this asymptotic distribution, approximate two-sided confidence intervals for the *j*\-th component of the vector ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799) can be constructed as ![{\\displaystyle \\beta \_{j}\\in {\\bigg \[}\\ {\\hat {\\beta }}\_{j}\\pm q\_{1-{\\frac {\\alpha }{2}}}^{{\\mathcal {N}}(0,1)}\\!{\\sqrt {{\\hat {\\sigma }}^{2}\\left\[Q\_{xx}^{-1}\\right\]\_{jj}}}\\ {\\bigg \]}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cf79688aac9f662ff39253fbfb0d234246d370e5) at the 1 − *α* confidence level, where *q* denotes the [quantile function](https://en.wikipedia.org/wiki/Quantile_function "Quantile function") of standard normal distribution, and \[·\]*jj* is the *j*\-th diagonal element of a matrix. Similarly, the least squares estimator for *σ*2 is also consistent and asymptotically normal (provided that the fourth moment of *εi* exists) with limiting distribution ![{\\displaystyle ({\\hat {\\sigma }}^{2}-\\sigma ^{2})\\ \\xrightarrow {d} \\ {\\mathcal {N}}\\left(0,\\;\\operatorname {E} \\left\[\\varepsilon \_{i}^{4}\\right\]-\\sigma ^{4}\\right).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7c909dea2a4f0bf40e253680b953d1bfbb66298f) These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose ![{\\displaystyle x\_{0}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/86f21d0e31751534cd6584264ecf864a6aa792cf) is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [mean response](https://en.wikipedia.org/wiki/Mean_response "Mean response") is the quantity ![{\\displaystyle y\_{0}=x\_{0}^{\\mathrm {T} }\\beta }](https://wikimedia.org/api/rest_v1/media/math/render/svg/6eda29d7b45f0754da5a0bd365ba6df87c81306c), whereas the [predicted response](https://en.wikipedia.org/wiki/Predicted_response "Predicted response") is ![{\\displaystyle {\\hat {y}}\_{0}=x\_{0}^{\\mathrm {T} }{\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/69652d0e8a9bacc094b1dc11803721f7bcf3e22d). Clearly the predicted response is a random variable, its distribution can be derived from that of ![{\\displaystyle {\\hat {\\beta }}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/efdb50e00928e4013750a476dab75eeb3cbd5799): ![{\\displaystyle \\left({\\hat {y}}\_{0}-y\_{0}\\right)\\ \\xrightarrow {d} \\ {\\mathcal {N}}\\left(0,\\;\\sigma ^{2}x\_{0}^{\\mathrm {T} }Q\_{xx}^{-1}x\_{0}\\right),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ea91e2b81cd0251f9bc1fce42e7ebfc78ceca045) which allows construct confidence intervals for mean response ![{\\displaystyle y\_{0}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6d943dbbb0b56ca750c4d62c5b54b4ae29a773da) to be constructed: ![{\\displaystyle y\_{0}\\in \\left\[\\ x\_{0}^{\\mathrm {T} }{\\hat {\\beta }}\\pm q\_{1-{\\frac {\\alpha }{2}}}^{{\\mathcal {N}}(0,1)}\\!{\\sqrt {{\\hat {\\sigma }}^{2}x\_{0}^{\\mathrm {T} }Q\_{xx}^{-1}x\_{0}}}\\ \\right\]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cf86d7a311c97d35fb6e039c3cd74bc9f3e752bf) at the 1 − *α* confidence level. Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") of no explanatory value of the estimated regression is tested using an [F-test](https://en.wikipedia.org/wiki/F-test "F-test"). If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the [alternative hypothesis](https://en.wikipedia.org/wiki/Alternative_hypothesis "Alternative hypothesis"), that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted. Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's [t-statistic](https://en.wikipedia.org/wiki/T-statistic "T-statistic"), as the ratio of the coefficient estimate to its [standard error](https://en.wikipedia.org/wiki/Standard_error "Standard error"). If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted. In addition, the [Chow test](https://en.wikipedia.org/wiki/Chow_test "Chow test") is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted. ### Violations of assumptions \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=20 "Edit section: Violations of assumptions")\] In a [time series](https://en.wikipedia.org/wiki/Time_series "Time series") model, we require the [stochastic process](https://en.wikipedia.org/wiki/Stochastic_process "Stochastic process") {*xi*, *yi*} to be [stationary](https://en.wikipedia.org/wiki/Stationary_process "Stationary process") and [ergodic](https://en.wikipedia.org/wiki/Ergodic_process "Ergodic process"); if {*xi*, *yi*} is nonstationary, OLS results are often biased unless {*xi*, *yi*} is [co-integrating](https://en.wikipedia.org/wiki/Cointegration "Cointegration").[\[33\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-33) We still require the regressors to be *strictly exogenous*: E\[*xiεi*\] = 0 for all *i* = 1, ..., *n*. If they are only [predetermined](https://en.wikipedia.org/wiki/Weak_exogeneity "Weak exogeneity"), OLS is biased in finite sample; Finally, the assumptions on the variance take the form of requiring that {*xiεi*} is a [martingale difference sequence](https://en.wikipedia.org/wiki/Martingale_difference_sequence "Martingale difference sequence"), with a finite matrix of second moments *Q**xxε*² = E\[ *εi*2*xi xi*T \]. #### Constrained estimation \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=22 "Edit section: Constrained estimation")\] Suppose it is known that the coefficients in the regression satisfy a system of linear equations ![{\\displaystyle A\\colon \\quad Q^{\\operatorname {T} }\\beta =c,\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/057e184c58e6c378d20d00aac2f8d5f8003f77ae) where *Q* is a *p*×*q* matrix of full rank, and *c* is a *q*×1 vector of known constants, where *q \< p*. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint *A*. The **constrained least squares (CLS)** estimator can be given by an explicit formula:[\[34\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-34) ![{\\displaystyle {\\hat {\\beta }}^{c}={\\hat {\\beta }}-(X^{\\operatorname {T} }X)^{-1}Q{\\Big (}Q^{\\operatorname {T} }(X^{\\operatorname {T} }X)^{-1}Q{\\Big )}^{-1}(Q^{\\operatorname {T} }{\\hat {\\beta }}-c).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d7cebc5bf8357f7e792c566f18bcae6c7582b9ae) This expression for the constrained estimator is valid as long as the matrix *XTX* is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, *β* will not be identifiable. However it may happen that adding the restriction *A* makes *β* identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [\[35\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Amemiya22-35) ![{\\displaystyle {\\hat {\\beta }}^{c}=R(R^{\\operatorname {T} }X^{\\operatorname {T} }XR)^{-1}R^{\\operatorname {T} }X^{\\operatorname {T} }y+{\\Big (}I\_{p}-R(R^{\\operatorname {T} }X^{\\operatorname {T} }XR)^{-1}R^{\\operatorname {T} }X^{\\operatorname {T} }X{\\Big )}Q(Q^{\\operatorname {T} }Q)^{-1}c,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fd1b301b75120d4aae50fab438e36d600343652b) where *R* is a *p*×(*p* − *q*) matrix such that the matrix \[*Q R*\] is non-singular, and *RTQ* = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when *XTX* is invertible.[\[35\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-Amemiya22-35) ## Example with real data \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=23 "Edit section: Example with real data")\] The following data set gives average heights and weights for American women aged 30–39 (source: *The World Almanac and Book of Facts, 1975*). | | | | | | | | |---|---|---|---|---|---|---| | Height (m) | 1\.47 | 1\.50 | 1\.52 | 1\.55 | 1\.57 | [![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/OLS_example_weight_vs_height_scatterplot.svg/250px-OLS_example_weight_vs_height_scatterplot.svg.png)](https://en.wikipedia.org/wiki/File:OLS_example_weight_vs_height_scatterplot.svg) [Scatterplot](https://en.wikipedia.org/wiki/Scatterplot "Scatterplot") of the data, the relationship is slightly curved but close to linear | | Weight (kg) | 52\.21 | 53\.12 | 54\.48 | 55\.84 | 57\.20 | | | Height (m) | 1\.60 | 1\.63 | 1\.65 | 1\.68 | 1\.70 | | | Weight (kg) | 58\.57 | 59\.93 | 61\.29 | 63\.11 | 64\.47 | | | Height (m) | 1\.73 | 1\.75 | 1\.78 | 1\.80 | 1\.83 | | | Weight (kg) | 66\.28 | 68\.10 | 69\.92 | 72\.19 | 74\.46 | | When only one dependent variable is being modeled, a [scatterplot](https://en.wikipedia.org/wiki/Scatterplot "Scatterplot") will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model: ![{\\displaystyle w\_{i}=\\beta \_{1}+\\beta \_{2}h\_{i}+\\beta \_{3}h\_{i}^{2}+\\varepsilon \_{i}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/509e6766a1b3a1d7f431d9f9dae780d20f9b59d5) [![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/OLS_example_weight_vs_height_fitted_line.svg/330px-OLS_example_weight_vs_height_fitted_line.svg.png)](https://en.wikipedia.org/wiki/File:OLS_example_weight_vs_height_fitted_line.svg) Fitted regression The output from most popular [statistical packages](https://en.wikipedia.org/wiki/List_of_statistical_packages "List of statistical packages") will look similar to this: | | | | | | |---|---|---|---|---| | Method | Least squares | | | | | Dependent variable | WEIGHT | | | | | Observations | 15 | | | | | Parameter | Value | [Std error](https://en.wikipedia.org/wiki/Standard_error "Standard error") | [t-statistic](https://en.wikipedia.org/wiki/T-statistic "T-statistic") | [p-value](https://en.wikipedia.org/wiki/P-value "P-value") | | ![{\\displaystyle \\beta \_{1}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/eeeccd8b585b819e38f9c1fe5e9816a3ea01804c) | | | | | In this table: - The *Value* column gives the least squares estimates of parameters *βj* - The *Std error* column shows [standard errors](https://en.wikipedia.org/wiki/Standard_error_$statistics$ "Standard error (statistics)") of each coefficient estimate: ![{\\displaystyle {\\hat {\\sigma }}\_{j}=\\left({\\hat {\\sigma }}^{2}\\left\[Q\_{xx}^{-1}\\right\]\_{jj}\\right)^{\\frac {1}{2}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5087e66171bf3ef9ad3ac75decdd715274919669) - The *[t-statistic](https://en.wikipedia.org/wiki/T-statistic "T-statistic")* and *p-value* columns are testing whether any of the coefficients might be equal to zero. The *t*\-statistic is calculated simply as ![{\\displaystyle t={\\hat {\\beta }}\_{j}/{\\hat {\\sigma }}\_{j}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/706d1c514396be8e7301a23ab369cdcf5b1c5096). If the errors ε follow a normal distribution, *t* follows a Student-t distribution. Under weaker conditions, *t* is asymptotically normal. Large values of *t* indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, [*p*\-value](https://en.wikipedia.org/wiki/P-value "P-value"), expresses the results of the hypothesis test as a [significance level](https://en.wikipedia.org/wiki/Statistical_significance "Statistical significance"). Conventionally, *p*\-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero. - *R-squared* is the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination "Coefficient of determination") indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors *X* have no explanatory power whatsoever. This is a biased estimate of the population *R-squared*, and will never decrease if additional regressors are added, even if they are irrelevant. - *Adjusted R-squared* is a slightly modified version of ![{\\displaystyle R^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5ce07e278be3e058a6303de8359f8b4a4288264a), designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than ![{\\displaystyle R^{2}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/5ce07e278be3e058a6303de8359f8b4a4288264a), can decrease as new regressors are added, and even be negative for poorly fitting models: ![{\\displaystyle {\\overline {R}}^{2}=1-{\\frac {n-1}{n-p}}(1-R^{2})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7ec4559807623b855036fce5201f9e8b6c7aca4b) - *Log-likelihood* is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests. - *[Durbin–Watson statistic](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic "Durbin–Watson statistic")* tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation. - *[Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion "Akaike information criterion")* and *[Schwarz criterion](https://en.wikipedia.org/wiki/Schwarz_criterion "Schwarz criterion")* are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[\[36\]](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_note-36) - *Standard error of regression* is an estimate of *σ*, standard error of the error term. - *Total sum of squares*, *model sum of squared*, and *residual sum of squares* tell us how much of the initial variation in the sample were explained by the regression. - *F-statistic* tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has *F*(*p–1*,*n–p*) distribution under the null hypothesis and normality assumption, and its *p-value* indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as [Wald test](https://en.wikipedia.org/wiki/Wald_test "Wald test") or [LR test](https://en.wikipedia.org/wiki/Likelihood_ratio_test "Likelihood ratio test") should be used. [![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/OLS_example_weight_vs_height_residuals.svg/330px-OLS_example_weight_vs_height_residuals.svg.png)](https://en.wikipedia.org/wiki/File:OLS_example_weight_vs_height_residuals.svg) Residuals plot Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots: - Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity. - Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model. - Residuals against the fitted values, ![{\\displaystyle {\\hat {y}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3dc8de3d8ea01304329ef9518fad7a6d196c4c01). - Residuals against the preceding residual. This plot may identify serial correlations in the residuals. An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height. ### Sensitivity to rounding \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=24 "Edit section: Sensitivity to rounding")\] This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is *not* an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become: | | Const | Height | Height2 | |---|---|---|---| | Converted to metric with rounding. | 128\.8128 | −143.162 | 61\.96033 | | Converted to metric without rounding. | 119\.0205 | −131.5076 | 58\.5046 | [![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/HeightWeightResiduals.jpg/500px-HeightWeightResiduals.jpg)](https://en.wikipedia.org/wiki/File:HeightWeightResiduals.jpg) Residuals to a quadratic fit for correctly and incorrectly converted data. Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation. While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range ([extrapolation](https://en.wikipedia.org/wiki/Extrapolation "Extrapolation")). This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the *x* and *y* errors. ## Another example with less real data \[[edit](https://en.wikipedia.org/w/index.php?title=Ordinary_least_squares&action=edit&section=25 "Edit section: Another example with less real data")\] We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is ![{\\displaystyle r(\\theta )={\\frac {p}{1-e\\cos(\\theta )}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/864b5d331af517af4843d824394763d8d58bdb06) where ![{\\displaystyle r(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/53df06e661affb927fb93a95195225e3455cf572) is the radius of how far the object is from one of the bodies. In the equation the parameters ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) and ![{\\displaystyle e}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cd253103f0876afc68ebead27a5aa9867d927467) are used to determine the path of the orbit. We have measured the following data. | ![{\\displaystyle \\theta }](https://wikimedia.org/api/rest_v1/media/math/render/svg/6e5ab2664b422d53eb0c7df3b87e1360d75ad9af) (in degrees) | |---| We need to find the least-squares approximation of ![{\\displaystyle e}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cd253103f0876afc68ebead27a5aa9867d927467) and ![{\\displaystyle p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/81eac1e205430d1f40810df36a0edffdc367af36) for the given data. First we need to represent e and p in a linear form. So we are going to rewrite the equation ![{\\displaystyle r(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/53df06e661affb927fb93a95195225e3455cf572) as ![{\\displaystyle {\\frac {1}{r(\\theta )}}={\\frac {1}{p}}-{\\frac {e}{p}}\\cos(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ed0a5c7386fb20bcca2ffa143a944ad12986a06e). Furthermore, one could fit for [apsides](https://en.wikipedia.org/wiki/Apsides "Apsides") by expanding ![{\\displaystyle \\cos(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/aaac7b75cda6d5570780075aa2622d27b21117cd) with an extra parameter as ![{\\displaystyle \\cos(\\theta -\\theta \_{0})=\\cos(\\theta )\\cos(\\theta \_{0})+\\sin(\\theta )\\sin(\\theta \_{0})}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3a8a6dfeb8c3f569c5e5bdc53eae13cd7396ddb3), which is linear in both ![{\\displaystyle \\cos(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/aaac7b75cda6d5570780075aa2622d27b21117cd) and in the extra basis function ![{\\displaystyle \\sin(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/acafc444aea85d63a40dabf84f035a6b4955a948). We use the original two-parameter form to represent our observational data as: ![{\\displaystyle A^{T}A{\\binom {x}{y}}=A^{T}b,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/986a7b37a9e8f4d01f04b0c1edfab0891e5c3981) where: ![{\\displaystyle x=1/p\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/da0dc5103de42df59c7fdf7c0e1a25b2d230c2ab); ![{\\displaystyle y=e/p\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/225807344e8ed6a108fb0f244389cd99b2176af6); ![{\\displaystyle A}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7daff47fa58cdfd29dc333def748ff5fa4c923e3) contains the coefficients of ![{\\displaystyle 1/p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2cdfd6eb8d2c6f424b698d06aa99d31895c47e91) in the first column, which are all 1, and the coefficients of ![{\\displaystyle e/p}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e1413df0fc595b107560741c651139104ddc2b3a) in the second column, given by ![{\\displaystyle \\cos(\\theta )\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7579c63af86d580d73d9e47b823ef1f5df5d1e7c); and ![{\\displaystyle b=1/r(\\theta )}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2e66f8886331852341e28a72b691800bd161f01d), such that: ![{\\displaystyle A={\\begin{bmatrix}1&-0.731354\\\\1&-0.707107\\\\1&-0.615661\\\\1&\\ 0.052336\\\\1&0.309017\\\\1&0.438371\\end{bmatrix}},\\quad b={\\begin{bmatrix}0.21220\\\\0.21958\\\\0.24741\\\\0.45071\\\\0.52883\\\\0.56820\\end{bmatrix}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2fa33854f055be22aac53f6543b51595d2cb6ab5) On solving we get ![{\\displaystyle {\\binom {x}{y}}={\\binom {0.43478}{0.30435}}\\,}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9f2754076c0b4c74864feffeff3bb26a9288d57a), so ![{\\displaystyle p={\\frac {1}{x}}=2.3000}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ab356b2fc5bd7f3043c4681d1ee8e078310a40a3) and ![{\\displaystyle e=p\\cdot y=0.70001}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7c3dfe0bd497476da093d39a831bc135fce2e725) - [Bayesian least squares](https://en.wikipedia.org/wiki/Minimum_mean_square_error "Minimum mean square error") - [Fama–MacBeth regression](https://en.wikipedia.org/wiki/Fama%E2%80%93MacBeth_regression "Fama–MacBeth regression") - [Nonlinear least squares](https://en.wikipedia.org/wiki/Non-linear_least_squares "Non-linear least squares") - [Numerical methods for linear least squares](https://en.wikipedia.org/wiki/Numerical_methods_for_linear_least_squares "Numerical methods for linear least squares") - [Nonlinear system identification](https://en.wikipedia.org/wiki/Nonlinear_system_identification "Nonlinear system identification") 1. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-1)** ["The Origins of Ordinary Least Squares Assumptions"](https://mathvoices.ams.org/featurecolumn/2022/03/01/ordinary-least-squares/). *Feature Column*. 2022-03-01. Retrieved 2024-05-16. 2. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-2)** ["What is a complete list of the usual assumptions for linear regression?"](https://stats.stackexchange.com/q/16381). *Cross Validated*. Retrieved 2022-09-28. 3. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-3)** [Goldberger, Arthur S.](https://en.wikipedia.org/wiki/Arthur_Goldberger "Arthur Goldberger") (1964). ["Classical Linear Regression"](https://books.google.com/books?id=KZq5AAAAIAAJ&pg=PA156). [*Econometric Theory*](https://archive.org/details/econometrictheor0000gold/page/158). New York: John Wiley & Sons. pp. [158](https://archive.org/details/econometrictheor0000gold/page/158). [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-471-31101-4](https://en.wikipedia.org/wiki/Special:BookSources/0-471-31101-4 "Special:BookSources/0-471-31101-4") . 4. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-4)** [Hayashi, Fumio](https://en.wikipedia.org/wiki/Fumio_Hayashi "Fumio Hayashi") (2000). *Econometrics*. Princeton University Press. p. 15. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9780691010182](https://en.wikipedia.org/wiki/Special:BookSources/9780691010182 "Special:BookSources/9780691010182") . 5. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-5)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 18). 6. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-6)** Ghilani, Charles D.; Wolf, Paul R. (12 June 2006). [*Adjustment Computations: Spatial Data Analysis*](https://books.google.com/books?id=hZ4mAOXVowoC&pg=PA160). John Wiley & Sons. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9780471697282](https://en.wikipedia.org/wiki/Special:BookSources/9780471697282 "Special:BookSources/9780471697282") . 7. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-7)** Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). [*GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more*](https://books.google.com/books?id=Np7y43HU_m8C&pg=PA263). Springer. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9783211730171](https://en.wikipedia.org/wiki/Special:BookSources/9783211730171 "Special:BookSources/9783211730171") . 8. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-8)** Xu, Guochang (5 October 2007). [*GPS: Theory, Algorithms and Applications*](https://books.google.com/books?id=peYFZ69HqEsC&pg=PA134). Springer. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9783540727156](https://en.wikipedia.org/wiki/Special:BookSources/9783540727156 "Special:BookSources/9783540727156") . 9. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_19_9-1) [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 19) 10. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-q011_10-0)** Hoaglin, David C.; Welsch, Roy E. (1978). ["The Hat Matrix in Regression and ANOVA"](https://doi.org/10.1080%2F00031305.1978.10479237). *The American Statistician*. **32** (1): 17–22\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.1080/00031305.1978.10479237](https://doi.org/10.1080%2F00031305.1978.10479237). [hdl](https://en.wikipedia.org/wiki/Hdl_$identifier$ "Hdl (identifier)"):[1721\.1/1920](https://hdl.handle.net/1721.1%2F1920). [ISSN](https://en.wikipedia.org/wiki/ISSN_$identifier$ "ISSN (identifier)") [0003-1305](https://search.worldcat.org/issn/0003-1305). 11. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-11)** [Julian Faraway (2000), *Practical Regression and Anova using R*](https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf) 12. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-12)** Kenney, J.; Keeping, E. S. (1963). *Mathematics of Statistics*. van Nostrand. p. 187. 13. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-13)** Zwillinger, Daniel (1995). [*Standard Mathematical Tables and Formulae*](https://en.wikipedia.org/wiki/CRC_Standard_Mathematical_Tables "CRC Standard Mathematical Tables"). Chapman\&Hall/CRC. p. 626. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-8493-2479-3](https://en.wikipedia.org/wiki/Special:BookSources/0-8493-2479-3 "Special:BookSources/0-8493-2479-3") . 14. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-14)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 20) 15. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-15)** Akbarzadeh, Vahab (7 May 2014). ["Line Estimation"](https://mlmadesimple.wordpress.com/2014/05/07/line-estimation/). 16. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-16)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 49) 17. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_52_17-1) [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 52) 18. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hayashi_2000_loc=page_10_18-0)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 10) 19. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Tibshirani-1996_19-0)** Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso". *Journal of the Royal Statistical Society, Series B*. **58** (1): 267–288\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.1111/j.2517-6161.1996.tb02080.x](https://doi.org/10.1111%2Fj.2517-6161.1996.tb02080.x). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [2346178](https://www.jstor.org/stable/2346178). 20. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Efron-2004_20-0)** Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). "Least Angle Regression". *The Annals of Statistics*. **32** (2): 407–451\. [arXiv](https://en.wikipedia.org/wiki/ArXiv_$identifier$ "ArXiv (identifier)"):[math/0406456](https://arxiv.org/abs/math/0406456). [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.1214/009053604000000067](https://doi.org/10.1214%2F009053604000000067). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [3448465](https://www.jstor.org/stable/3448465). [S2CID](https://en.wikipedia.org/wiki/S2CID_$identifier$ "S2CID (identifier)") [204004121](https://api.semanticscholar.org/CorpusID:204004121). 21. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Hawkins-1973_21-0)** Hawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis". *Journal of the Royal Statistical Society, Series C*. **22** (3): 275–286\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.2307/2346776](https://doi.org/10.2307%2F2346776). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [2346776](https://www.jstor.org/stable/2346776). 22. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Jolliffe-1982_22-0)** Jolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression". *Journal of the Royal Statistical Society, Series C*. **31** (3): 300–303\. [doi](https://en.wikipedia.org/wiki/Doi_$identifier$ "Doi (identifier)"):[10\.2307/2348005](https://doi.org/10.2307%2F2348005). [JSTOR](https://en.wikipedia.org/wiki/JSTOR_$identifier$ "JSTOR (identifier)") [2348005](https://www.jstor.org/stable/2348005). 23. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-23)** [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), pages 27, 30) 24. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-HayashiFSP_24-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-HayashiFSP_24-1) [***c***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-HayashiFSP_24-2) [Hayashi (2000](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFHayashi2000), page 27) 25. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-25)** [Amemiya, Takeshi](https://en.wikipedia.org/wiki/Takeshi_Amemiya "Takeshi Amemiya") (1985). [*Advanced Econometrics*](https://archive.org/details/advancedeconomet00amem). Harvard University Press. p. [13](https://archive.org/details/advancedeconomet00amem/page/13). [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [9780674005600](https://en.wikipedia.org/wiki/Special:BookSources/9780674005600 "Special:BookSources/9780674005600") . 26. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-26)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 14) 27. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-27)** [Rao, C. R.](https://en.wikipedia.org/wiki/C._R._Rao "C. R. Rao") (1973). *Linear Statistical Inference and its Applications* (Second ed.). New York: J. Wiley & Sons. p. 319. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-471-70823-2](https://en.wikipedia.org/wiki/Special:BookSources/0-471-70823-2 "Special:BookSources/0-471-70823-2") . 28. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-28)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 20) 29. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-29)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 27) 30. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-DvdMck33_30-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-DvdMck33_30-1) Davidson, Russell; [MacKinnon, James G.](https://en.wikipedia.org/wiki/James_G._MacKinnon "James G. MacKinnon") (1993). *Estimation and Inference in Econometrics*. New York: Oxford University Press. p. 33. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-19-506011-3](https://en.wikipedia.org/wiki/Special:BookSources/0-19-506011-3 "Special:BookSources/0-19-506011-3") . 31. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-31)** [Davidson & MacKinnon (1993](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 36) 32. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-32)** [Davidson & MacKinnon (1993](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFDavidsonMacKinnon1993), page 20) 33. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-33)** ["Memento on EViews Output"](https://scholar.harvard.edu/files/jbenchimol/files/memento-eviews.pdf) (PDF). Retrieved 28 December 2020. 34. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-34)** [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 21) 35. ^ [***a***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Amemiya22_35-0) [***b***](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-Amemiya22_35-1) [Amemiya (1985](https://en.wikipedia.org/wiki/Ordinary_least_squares#CITEREFAmemiya1985), page 22) 36. **[^](https://en.wikipedia.org/wiki/Ordinary_least_squares#cite_ref-36)** Burnham, Kenneth P.; Anderson, David R. (2002). [*Model Selection and Multi-Model Inference*](https://archive.org/details/modelselectionmu0000burn) (2nd ed.). Springer. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-387-95364-7](https://en.wikipedia.org/wiki/Special:BookSources/0-387-95364-7 "Special:BookSources/0-387-95364-7") . - [Dougherty, Christopher](https://en.wikipedia.org/wiki/Christopher_Dougherty "Christopher Dougherty") (2002). *Introduction to Econometrics* (2nd ed.). New York: Oxford University Press. pp. 48–113\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [0-19-877643-8](https://en.wikipedia.org/wiki/Special:BookSources/0-19-877643-8 "Special:BookSources/0-19-877643-8") . - [Gujarati, Damodar N.](https://en.wikipedia.org/wiki/Damodar_N._Gujarati "Damodar N. Gujarati"); [Porter, Dawn C.](https://en.wikipedia.org/wiki/Dawn_C._Porter "Dawn C. Porter") (2009). *Basic Econometics* (Fifth ed.). Boston: McGraw-Hill Irwin. pp. 55–96\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-07-337577-9](https://en.wikipedia.org/wiki/Special:BookSources/978-0-07-337577-9 "Special:BookSources/978-0-07-337577-9") . - [Heij, Christiaan](https://en.wikipedia.org/wiki/Christiaan_Heij "Christiaan Heij"); Boer, Paul; [Franses, Philip H.](https://en.wikipedia.org/wiki/Philip_Hans_Franses "Philip Hans Franses"); [Kloek, Teun](https://en.wikipedia.org/wiki/Teun_Kloek "Teun Kloek"); [van Dijk, Herman K.](https://en.wikipedia.org/wiki/Herman_K._van_Dijk "Herman K. van Dijk") (2004). *Econometric Methods with Applications in Business and Economics* (1st ed.). Oxford: Oxford University Press. pp. 76–115\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-19-926801-6](https://en.wikipedia.org/wiki/Special:BookSources/978-0-19-926801-6 "Special:BookSources/978-0-19-926801-6") . - Hill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008). *Principles of Econometrics* (3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. 8–47\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-471-72360-8](https://en.wikipedia.org/wiki/Special:BookSources/978-0-471-72360-8 "Special:BookSources/978-0-471-72360-8") . - [Wooldridge, Jeffrey](https://en.wikipedia.org/wiki/Jeffrey_Wooldridge "Jeffrey Wooldridge") (2008). ["The Simple Regression Model"](https://books.google.com/books?id=64vt5TDBNLwC&pg=PA22). *Introductory Econometrics: A Modern Approach* (4th ed.). Mason, OH: Cengage Learning. pp. 22–67\. [ISBN](https://en.wikipedia.org/wiki/ISBN_$identifier$ "ISBN (identifier)") [978-0-324-58162-1](https://en.wikipedia.org/wiki/Special:BookSources/978-0-324-58162-1 "Special:BookSources/978-0-324-58162-1") .

ML Classification

ML Categories

/Science		89.0%
/Science/Mathematics		86.5%
/Science/Mathematics/Statistics		84.1%

Raw JSON

{
    "/Science": 890,
    "/Science/Mathematics": 865,
    "/Science/Mathematics/Statistics": 841
}

ML Page Types

/Article		99.7%
/Article/Wiki		60.6%

Raw JSON

{
    "/Article": 997,
    "/Article/Wiki": 606
}

ML Intent Types

Informational

99.9%

Raw JSON

{
    "Informational": 999
}

Content Metadata

Language

Author

null

Publish Time

not set

Original Publish Time

2013-08-24 22:28:28 (12 years ago)

Republished

Word Count (Total)

11,876

Word Count (Content)

8,168

Links

External Links

Internal Links

430

Technical SEO

Meta Nofollow

Meta Noarchive

JS Rendered

Redirect Target

null

Performance

Download Time (ms)

TTFB (ms)

Download Size (bytes)

74,827

Shard

152 (laksa)

Root Hash

17790707453426894952

Unparsed URL

org,wikipedia!en,/wiki/Ordinary_least_squares s443