ā¹ļø Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ |
| Last Crawled | 2026-04-14 11:09:10 (14 hours ago) |
| First Indexed | 2015-10-31 19:12:34 (10 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Cosine similarity, Pearson correlation, and OLS coefficients | AI and Social Science ā Brendan O'Connor |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product ā tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that).
Details:
You have two vectors \(x\) and \(y\) and want to measure similarity between them. A basic similarity function is the
inner product
\[ Inner(x,y) = \sum_i x_i y_i = \langle x, y \rangle \]
If x tends to be high where y is also high, and low where y is low, the inner product will be high ā the vectors are more similar.
The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectorsā L2 norms, giving the
cosine similarity
\[ CosSim(x,y) = \frac{\sum_i x_i y_i}{ \sqrt{ \sum_i x_i^2} \sqrt{ \sum_i y_i^2 } }
= \frac{ \langle x,y \rangle }{ ||x||\ ||y|| }
\]
This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \(\mathbb{R}^2\) (e.g.
here
).
Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the
Pearson correlation
. Let \(\bar{x}\) and \(\bar{y}\) be the respective means:
\begin{align}
Corr(x,y) &= \frac{ \sum_i (x_i-\bar{x}) (y_i-\bar{y}) }{
\sqrt{\sum (x_i-\bar{x})^2} \sqrt{ \sum (y_i-\bar{y})^2 } }
\\
& = \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{
||x-\bar{x}||\ ||y-\bar{y}||} \\
& = CosSim(x-\bar{x}, y-\bar{y})
\end{align}
Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y.
This isnāt the usual way to derive the Pearson correlation; usually itās presented as a normalized form of the
covariance
, which is a centered average inner product (no normalization)
\[ Cov(x,y) = \frac{\sum (x_i-\bar{x})(y_i-\bar{y}) }{n}
= \frac{ \langle x-\bar{x},\ y-\bar{y} \rangle }{n} \]
Finally, these are all related to the coefficient in a
one-variable linear regression
. For the OLS model \(y_i \approx ax_i\) with Gaussian noise, whose MLE is the least-squares problem \(\arg\min_a \sum (y_i ā ax_i)^2\), a few lines of calculus shows \(a\) is
\begin{align}
OLSCoef(x,y) &= \frac{ \sum x_i y_i }{ \sum x_i^2 }
= \frac{ \langle x, y \rangle}{ ||x||^2 }
\end{align}
This looks like another normalized inner product. But unlike cosine similarity, we arenāt normalizing by \(y\)ās norm ā instead we only use \(x\)ās norm (and use it twice): denominator of \(||x||\ ||y||\) versus \(||x||^2\).
Not normalizing for \(y\) is what you want for the linear regression: if \(y\) was stretched to span a larger range, you would need to increase \(a\) to match, to get your predictions spread out too.
Often itās desirable to do the OLS model with an intercept term: \(\min_{a,b} \sum (y ā ax_i ā b)^2\). Then \(a\) is
\begin{align}
OLSCoefWithIntercept(x,y) &= \frac
{ \sum (x_i ā \bar{x}) y_i }
{ \sum (x_i ā \bar{x})^2 }
= \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2}
\\
&= OLSCoef(x-\bar{x}, y)
\end{align}
Itās different because the intercept term picks up the slack associated with where xās center is. So OLSCoefWithIntercept is invariant to shifts of x. Itās still different than cosine similarity since itās still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isnāt obvious in the equation, but with a little arithmetic itās easy to derive that \(
\langle x-\bar{x},\ y \rangle = \langle x-\bar{x},\ y+c \rangle \) for any constant \(c\). (There must be a nice geometric interpretation of this.)
Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. Iām not sure what this means or if itās a useful fact, but:
\[ OLSCoef\left(
\sqrt{n}\frac{x-\bar{x}}{||x-\bar{x}||},
\sqrt{n}\frac{y-\bar{y}}{||y-\bar{y}||} \right) = Corr(x,y) \]
Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, itās centered.
Of course we need a summary table. āSymmetricā means, if you swap the inputs, do you get the same answer. āInvariant to shift in inputā means, if you add an arbitrary constant to either input, do you get the same answer.
Function
Equation
Symmetric?
Output range
Invariant to shift in input?
Pithy explanation in terms of something else
Inner(x,y)
\[ \langle x, y\rangle\]
Yes
\(\mathbb{R}\)
No
CosSim(x,y)
\[ \frac{\langle x,y \rangle}{||x||\ ||y||} \]
Yes
[-1,1]
or [0,1] if inputs non-neg
No
normalized inner product
Corr(x,y)
\[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle }{||x-\bar{x}||\ ||y-\bar{y}||} \]
Yes
[-1,1]
Yes
centered cosine;
or
normalized covariance
Cov(x,y)
\[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{n} \]
Yes
\(\mathbb{R}\)
Yes
centered inner product
OLSCoefNoIntcpt(x,y)
\[\frac{ \langle x, y \rangle}{ ||x||^2 }\]
No
\(\mathbb{R}\)
No
(compare to CosSim)
OLSCoefWithIntcpt(x,y)
\[ \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2} \]
No
\(\mathbb{R}\)
Yes
Are there any implications? Iāve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when thereās high-dimensional sparse data ā the
Friedman et al. 2010 glmnet
paper talks about this in the context of coordinate descent text regression. Iāve heard
Dhillon et al., NIPS 2011
applies
LSH
in a similar setting (but havenāt read it yet). And thereās lots of work using LSH for cosine similarity; e.g.
van Durme and Lall 2010 [slides]
.
Any other cool identities? Any corrections to the above?
References: I use
Hastie et al 2009, chapter 3
to look up linear regression, but itās covered in zillions of other places. I linked to a nice chapter in
Tufteās little 1974 book
that he wrote before he went off and did all that visualization stuff. (He calls it ātwo-variable regressionā, but I think āone-variable regressionā is a better term. āone-featureā or āone-covariateā might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts. |
| Markdown | [AI and Social Science ā Brendan O'Connor](https://brenocon.com/blog/ "AI and Social Science ā Brendan O'Connor")
cognition, language, social systems; statistics, visualization, computation
![]()
[ā I donāt get this web parsing shared task](https://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/)
[F-scores, Dice, and Jaccard set similarity ā](https://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/)
# Cosine similarity, Pearson correlation, and OLS coefficients
Posted on [March 13, 2012](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ "6:01 pm")
Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product ā tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that).
Details:
You have two vectors \\(x\\) and \\(y\\) and want to measure similarity between them. A basic similarity function is the **[inner product](http://en.wikipedia.org/wiki/Dot_product)**
\\\[ Inner(x,y) = \\sum\_i x\_i y\_i = \\langle x, y \\rangle \\\]
If x tends to be high where y is also high, and low where y is low, the inner product will be high ā the vectors are more similar.
The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectorsā L2 norms, giving the **[cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity)**
\\\[ CosSim(x,y) = \\frac{\\sum\_i x\_i y\_i}{ \\sqrt{ \\sum\_i x\_i^2} \\sqrt{ \\sum\_i y\_i^2 } }
= \\frac{ \\langle x,y \\rangle }{ \|\|x\|\|\\ \|\|y\|\| }
\\\]
This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \\(\\mathbb{R}^2\\) (e.g. [here](http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html)).
Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the **[Pearson correlation](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)**. Let \\(\\bar{x}\\) and \\(\\bar{y}\\) be the respective means:
\\begin{align}
Corr(x,y) &= \\frac{ \\sum\_i (x\_i-\\bar{x}) (y\_i-\\bar{y}) }{
\\sqrt{\\sum (x\_i-\\bar{x})^2} \\sqrt{ \\sum (y\_i-\\bar{y})^2 } }
\\\\
& = \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{
\|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\\
& = CosSim(x-\\bar{x}, y-\\bar{y})
\\end{align}
Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y.
This isnāt the usual way to derive the Pearson correlation; usually itās presented as a normalized form of the **[covariance](http://en.wikipedia.org/wiki/Covariance)**, which is a centered average inner product (no normalization)
\\\[ Cov(x,y) = \\frac{\\sum (x\_i-\\bar{x})(y\_i-\\bar{y}) }{n}
= \\frac{ \\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{n} \\\]
Finally, these are all related to the coefficient in a **[one-variable linear regression](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html)**. For the OLS model \\(y\_i \\approx ax\_i\\) with Gaussian noise, whose MLE is the least-squares problem \\(\\arg\\min\_a \\sum (y\_i ā ax\_i)^2\\), a few lines of calculus shows \\(a\\) is
\\begin{align}
OLSCoef(x,y) &= \\frac{ \\sum x\_i y\_i }{ \\sum x\_i^2 }
= \\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 }
\\end{align}
This looks like another normalized inner product. But unlike cosine similarity, we arenāt normalizing by \\(y\\)ās norm ā instead we only use \\(x\\)ās norm (and use it twice): denominator of \\(\|\|x\|\|\\ \|\|y\|\|\\) versus \\(\|\|x\|\|^2\\).
Not normalizing for \\(y\\) is what you want for the linear regression: if \\(y\\) was stretched to span a larger range, you would need to increase \\(a\\) to match, to get your predictions spread out too.
Often itās desirable to do the OLS model with an intercept term: \\(\\min\_{a,b} \\sum (y ā ax\_i ā b)^2\\). Then \\(a\\) is
\\begin{align}
OLSCoefWithIntercept(x,y) &= \\frac
{ \\sum (x\_i ā \\bar{x}) y\_i }
{ \\sum (x\_i ā \\bar{x})^2 }
= \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2}
\\\\
&= OLSCoef(x-\\bar{x}, y)
\\end{align}
Itās different because the intercept term picks up the slack associated with where xās center is. So OLSCoefWithIntercept is invariant to shifts of x. Itās still different than cosine similarity since itās still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isnāt obvious in the equation, but with a little arithmetic itās easy to derive that \\(
\\langle x-\\bar{x},\\ y \\rangle = \\langle x-\\bar{x},\\ y+c \\rangle \\) for any constant \\(c\\). (There must be a nice geometric interpretation of this.)
Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. Iām not sure what this means or if itās a useful fact, but:
\\\[ OLSCoef\\left(
\\sqrt{n}\\frac{x-\\bar{x}}{\|\|x-\\bar{x}\|\|},
\\sqrt{n}\\frac{y-\\bar{y}}{\|\|y-\\bar{y}\|\|} \\right) = Corr(x,y) \\\]
Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, itās centered.
Of course we need a summary table. āSymmetricā means, if you swap the inputs, do you get the same answer. āInvariant to shift in inputā means, if you add an arbitrary constant to either input, do you get the same answer.
| | | | | | |
|---|---|---|---|---|---|
| Function | Equation | Symmetric? | Output range | Invariant to shift in input? | Pithy explanation in terms of something else |
| Inner(x,y) | \\\[ \\langle x, y\\rangle\\\] | Yes | \\(\\mathbb{R}\\) | No | |
| CosSim(x,y) | \\\[ \\frac{\\langle x,y \\rangle}{\|\|x\|\|\\ \|\|y\|\|} \\\] | Yes | \[-1,1\] or \[0,1\] if inputs non-neg | No | normalized inner product |
| Corr(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{\|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\] | Yes | \[-1,1\] | Yes | centered cosine; *or* normalized covariance |
| Cov(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{n} \\\] | Yes | \\(\\mathbb{R}\\) | Yes | centered inner product |
| OLSCoefNoIntcpt(x,y) | \\\[\\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 }\\\] | No | \\(\\mathbb{R}\\) | No | (compare to CosSim) |
| OLSCoefWithIntcpt(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2} \\\] | No | \\(\\mathbb{R}\\) | Yes | |
Are there any implications? Iāve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when thereās high-dimensional sparse data ā the [Friedman et al. 2010 glmnet](http://www.jstatsoft.org/v33/i01) paper talks about this in the context of coordinate descent text regression. Iāve heard [Dhillon et al., NIPS 2011](http://www.cs.utexas.edu/~pradeepr/paperz/coord_nips.pdf) applies [LSH](http://en.wikipedia.org/wiki/Locality-sensitive_hashing) in a similar setting (but havenāt read it yet). And thereās lots of work using LSH for cosine similarity; e.g. [van Durme and Lall 2010 \[slides\]](http://cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf).
Any other cool identities? Any corrections to the above?
References: I use [Hastie et al 2009, chapter 3](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) to look up linear regression, but itās covered in zillions of other places. I linked to a nice chapter in [Tufteās little 1974 book](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html) that he wrote before he went off and did all that visualization stuff. (He calls it ātwo-variable regressionā, but I think āone-variable regressionā is a better term. āone-featureā or āone-covariateā might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts.
This entry was posted in [Uncategorized](https://brenocon.com/blog/category/uncategorized/ "View all posts in Uncategorized"). Bookmark the [permalink](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ "Permalink to Cosine similarity, Pearson correlation, and OLS coefficients").
[ā I donāt get this web parsing shared task](https://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/)
[F-scores, Dice, and Jaccard set similarity ā](https://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/)
### 23 Responses to *Cosine similarity, Pearson correlation, and OLS coefficients*
1. 
[Victor Chahuneau](http://victor.chahuneau.fr/)
says:
[March 15, 2012 at 1:21 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129691)
I think your OLSCoefWithIntercept is wrong unless y is centered: the right part of the dot product should be (y-)
Then the invariance by translation is obviousā¦
Otherwise you would get \<x-, y+c\> = \<x-,y\> + c(n-1)
See [Wikipedia](http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line) for the equation
- 
[Victor Chahuneau](http://victor.chahuneau.fr/)
says:
[March 15, 2012 at 1:24 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129692)
⦠but of course WordPress doesnāt like my bracketsā¦
Line 1:\$(y-\\bar y)\$
Line 3: \$ = + c(n-1)\\bar x\$
2. 
[brendano](http://brenocon.com/)
says:
[March 15, 2012 at 5:05 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129735)
Nope, you donāt need to center y if youāre centering x. The Wikipedia equation isnāt as correct as Hastie :) I actually didnāt believe this when I was writing the post, but if you write out the arithmetic like I said you can derive it.
Example:
\$ R
\> x=c(1,2,3); y=c(5,6,10)
\> inner\_and\_xnorm=function(x,y) sum(x\*y) / sum(x\*\*2)
\> inner\_and\_xnorm(x-mean(x),y)
\[1\] 2.5
\> inner\_and\_xnorm(x-mean(x),y+5)
\[1\] 2.5
⦠if you donāt center x, then shifting y matters.
3. 
[Victor Chahuneau](http://victor.chahuneau.fr/)
says:
[March 15, 2012 at 4:15 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129911)
Oops⦠I was wrong about the invariance\!
It turns out that we were both right on the formula for the coefficient⦠thanks to this same invariance.
Here is the full derivation:
<http://dl.dropbox.com/u/2803234/ols.pdf>
Wikipedia & Hastie can be reconciled nowā¦
4. 
Mike
says:
[March 26, 2012 at 8:40 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-133347)
Nice breakdown Brendan.
Iāve been working recently with high-dimensional sparse data. The covariance/correlation matrices can be calculated without losing sparsity after rearranging some terms.
<http://stackoverflow.com/a/9626089/1257542>
- 
Mike
says:
[March 26, 2012 at 12:17 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-133439)
for instance, with two sparse vectors, you can get the correlation and covariance without subtracting the means
cov(x,y) = ( inner(x,y) ā n mean(x) mean(y)) / (n-1)
cor(x,y) = ( inner(x,y) ā n mean(x) mean(y)) / (sd(x) sd(y) (n-1))
- 
[Brendan O'Connor](http://brenocon.com/)
says:
[March 26, 2012 at 1:18 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-133486)
Oh awesome, thanks\!
5. 
Kat
says:
[April 24, 2012 at 11:12 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143747)
Hey Brendan! Maybe you are the right person to ask this to ā if I want to figure out how similar two sets of paired vectors are (both angle AND magnitude) how would I do that? I originally started by looking at cosine similarity (well, I started them all from 0,0 so I guess now I know it was correlation?) but of course that doesnāt look at magnitude at all. Is there a way that people usually weight direction and magnitude, or is that arbitrary?
- 
[Brendan O'Connor](http://brenocon.com/)
says:
[April 24, 2012 at 11:25 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143748)
Why not inner product?
- 
Kat
says:
[April 24, 2012 at 11:43 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143751)
I would like and to be more similar than and , for example
- 
Kat
says:
[April 24, 2012 at 11:44 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143752)
ok no tags this time ā 1,1 and 1,1 to be more similar than 1,1 and 5,5
6. Pingback: [Triangle problem ā finding height with given area and angles. Ā« Math World ā etidhor](http://etidhor.wordpress.com/2013/01/04/triangle-problem-finding-height-with-given-area-and-angles/)
7. 
[Adam](http://www.designandanalytics.com/)
says:
[February 1, 2013 at 5:57 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-270730)
This is one of the best technical summary blog posts that I can remember seeing. Iāve just started in NLP and was confused at first seeing cosine appear as the de facto relatedness measureāthis really helped me mentally reconcile it with the alternatives. Very interesting and great post.
8. 
[Paul Moore](http://people.maths.ox.ac.uk/moorep/)
says:
[March 18, 2013 at 5:14 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-304935)
A very helpful discussion ā thanks.
Have you seen ā āThirteen Ways to Look at the Correlation Coefficientā by Joseph Lee Rodgers; W. Alan Nicewander, The American Statistician, Vol. 42, No. 1. (Feb., 1988), pp. 59-66. It covers a related discussion.
9. 
[brendano](http://brenocon.com/)
says:
[March 18, 2013 at 5:17 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-304941)
Great tip ā I remember seeing that once but totally forgot about it.
Hereās a link, [http://data.psych.udel.edu/laurenceau/PSYC861Regression%20Spring%202012/READINGS/rodgers-nicewander-1988-r-13-ways.pdf](http://data.psych.udel.edu/laurenceau/PSYC861Regression%20Spring%202012/READINGS/rodgers-nicewander-1988-r-13-ways.pdf)
10. Pingback: [Correlation picture \| AI and Social Science ā Brendan O'Connor](http://brenocon.com/blog/2013/03/correlation-picture/)
11. 
Peter
says:
[March 29, 2013 at 3:24 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-312261)
Useful info:
I have a few questions (i am pretty new to that field). You say correlation is invariant of shifts.
i guess you just mean if the x-axis is not 1 2 3 4 but 10 20 30 or 30 20 10.. then it doesnāt change anything.
but you doesnāt mean that if i shift the signal i will get the same correlation right?
ex: \[1 2 1 2 1\] and \[1 2 1 2 1\], corr = 1
but if i cyclically shift \[1 2 1 2 1\] and \[2 1 2 1 2\], corr = -1
or if i just shift by padding zeros \[1 2 1 2 1 0\] and \[0 1 2 1 2 1\] then corr = -0.0588
Please elaborate on that.
Also could we say that distance correlation (1-correlation) can be considered as norm\_1 or norm\_2 distance somehow? for example when we want to minimize the squared errors, usually we need to use euclidean distance, but could pearsonās correlation also be used?
Ans last, OLSCoef(x,y) can be considered as scale invariant? is very correlated to cosine similarity which is not scale invariant (Pearsonās correlation is right?). Look at: āPatterns of Temporal Variation in Online Mediaā and āFast time-series searching with scaling and shiftingā. That confuses me.. but maybe i am missing something.
- 
[Brendan O'Connor](http://brenocon.com/)
says:
[April 1, 2013 at 10:27 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-314504)
Hi Peter ā
By āinvariant to shift in inputā, I mean, if you \*add\* to the input. That is,
f(x, y) = f(x+a, y) for any scalar āaā.
By āscale invariantā, I mean, if you \*multiply\* the input by something.
For (1-corr), the problem is negative correlations. I think maximizing the squared correlation is the same thing as minimizing squared error .. thatās why itās called R^2, the explained variance ratio.
I donāt understand your question about OLSCoef and have not seen the papers youāre talking about.
12. Pingback: [Machine learning literary genres from 19th century seafaring, horror and western novels \| Sub-Sub Algorithm](http://subsubalgorithm.wordpress.com/2013/12/07/machine-learning-literary-genres-from-19th-century-seafaring-horror-and-western-novels/)
13. Pingback: [Machine learning literary genres from 19th century seafaring, horror and western novels \| Sub-Subroutine](http://subsubroutine.wordpress.com/2013/12/08/machine-learning-literary-genres-from-19th-century-seafaring-horror-and-western-novels/)
14. 
[Waylon Flinn](http://www.crunchmagic.com/)
says:
[December 11, 2013 at 4:51 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-758397)
Wonderful post. The more I investigate it the more it looks like every relatedness measure around is just a different normalization of the inner product.
Similar analyses reveal that Lift, Jaccard Index and even the standard Euclidean metric can be viewed as different corrections to the dot product. Itās not a viewpoint Iāve seen a lot of. It was this post that started my investigation of this phenomenon. For that, Iām grateful to you.
The fact that the basic dot product can be seen to underlie all these similarity measures turns out to be convenient. If you stack all the vectors in your space on top of each other to create a matrix, you can produce all the inner products simply by multiplying the matrix by itās transpose. Furthermore, the extra ingredient in every similarity measure Iāve looked at so far involves the magnitudes (or squared magnitudes) of the individual vectors. These drop out of this matrix multiplication as well. Just extract the diagonal.
Because of itās exceptional utility, Iāve dubbed the symmetric matrix that results from this product the base similarity matrix. I havenāt been able to find many other references which formulate these metrics in terms of this matrix, or the inner product as youāve done. Known mathematics is both broad and deep, so it seems likely that Iām stumbling upon something thatās already been investigated.
Do you know of other work that explores this underlying structure of similarity measures? Is the construction of this base similarity matrix a standard technique in the calculation of these measures? Does it have a common name?
Thanks again for sharing your explorations of this topic.
P.S. Hereās the other reference Iāve found that does similar work:
<http://arxiv.org/pdf/1308.3740.pdf>
15. Pingback: [Building the connection between cosine similarity and correlation in R \| Question and Answer](http://qandasys.info/building-the-connection-between-cosine-similarity-and-correlation-in-r/)
16. Pingback: [ēøä¼¼ę§åŗ¦é - CSerä¹å£°](http://www.cserzs.com/similarity-measure)
- ### About
This is a blog on artificial intelligence and "Social Science++", with an emphasis on computation and statistics. My website is [brenocon.com](http://brenocon.com/).
- ### Blogroll
- [NLPers (Daume)](http://nlpers.blogspot.com/)
- [ML Theory (Langford)](http://hunch.net/)
- [SMCISS (~Gelman)](http://andrewgelman.com/)
- [Normal Deviate (Wasserman)](http://normaldeviate.wordpress.com/)
- [LingPipe (~Carpenter)](http://lingpipe-blog.com/)
- [Three-Toed Sloth (Shalizi)](http://cscs.umich.edu/~crshalizi/weblog/)
- [Smola](http://blog.smola.org/)
- [R-Bloggers](http://www.r-bloggers.com/)
- [FiveThirtyEight](http://fivethirtyeight.blogs.nytimes.com/)
- [Marginal Revolution](http://marginalrevolution.com/)
- ### Blog Search
- ### [Archives](https://brenocon.com/blog/archives/)
[AI and Social Science ā Brendan O'Connor](https://brenocon.com/blog/ "AI and Social Science ā Brendan O'Connor")
[Proudly powered by WordPress.](http://wordpress.org/ "Semantic Personal Publishing Platform") |
| Readable Markdown | Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product ā tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that).
Details:
You have two vectors \\(x\\) and \\(y\\) and want to measure similarity between them. A basic similarity function is the **[inner product](http://en.wikipedia.org/wiki/Dot_product)**
\\\[ Inner(x,y) = \\sum\_i x\_i y\_i = \\langle x, y \\rangle \\\]
If x tends to be high where y is also high, and low where y is low, the inner product will be high ā the vectors are more similar.
The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectorsā L2 norms, giving the **[cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity)**
\\\[ CosSim(x,y) = \\frac{\\sum\_i x\_i y\_i}{ \\sqrt{ \\sum\_i x\_i^2} \\sqrt{ \\sum\_i y\_i^2 } }
= \\frac{ \\langle x,y \\rangle }{ \|\|x\|\|\\ \|\|y\|\| }
\\\]
This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \\(\\mathbb{R}^2\\) (e.g. [here](http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html)).
Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the **[Pearson correlation](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)**. Let \\(\\bar{x}\\) and \\(\\bar{y}\\) be the respective means:
\\begin{align}
Corr(x,y) &= \\frac{ \\sum\_i (x\_i-\\bar{x}) (y\_i-\\bar{y}) }{
\\sqrt{\\sum (x\_i-\\bar{x})^2} \\sqrt{ \\sum (y\_i-\\bar{y})^2 } }
\\\\
& = \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{
\|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\\
& = CosSim(x-\\bar{x}, y-\\bar{y})
\\end{align}
Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y.
This isnāt the usual way to derive the Pearson correlation; usually itās presented as a normalized form of the **[covariance](http://en.wikipedia.org/wiki/Covariance)**, which is a centered average inner product (no normalization)
\\\[ Cov(x,y) = \\frac{\\sum (x\_i-\\bar{x})(y\_i-\\bar{y}) }{n}
= \\frac{ \\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{n} \\\]
Finally, these are all related to the coefficient in a **[one-variable linear regression](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html)**. For the OLS model \\(y\_i \\approx ax\_i\\) with Gaussian noise, whose MLE is the least-squares problem \\(\\arg\\min\_a \\sum (y\_i ā ax\_i)^2\\), a few lines of calculus shows \\(a\\) is
\\begin{align}
OLSCoef(x,y) &= \\frac{ \\sum x\_i y\_i }{ \\sum x\_i^2 }
= \\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 }
\\end{align}
This looks like another normalized inner product. But unlike cosine similarity, we arenāt normalizing by \\(y\\)ās norm ā instead we only use \\(x\\)ās norm (and use it twice): denominator of \\(\|\|x\|\|\\ \|\|y\|\|\\) versus \\(\|\|x\|\|^2\\).
Not normalizing for \\(y\\) is what you want for the linear regression: if \\(y\\) was stretched to span a larger range, you would need to increase \\(a\\) to match, to get your predictions spread out too.
Often itās desirable to do the OLS model with an intercept term: \\(\\min\_{a,b} \\sum (y ā ax\_i ā b)^2\\). Then \\(a\\) is
\\begin{align}
OLSCoefWithIntercept(x,y) &= \\frac
{ \\sum (x\_i ā \\bar{x}) y\_i }
{ \\sum (x\_i ā \\bar{x})^2 }
= \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2}
\\\\
&= OLSCoef(x-\\bar{x}, y)
\\end{align}
Itās different because the intercept term picks up the slack associated with where xās center is. So OLSCoefWithIntercept is invariant to shifts of x. Itās still different than cosine similarity since itās still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isnāt obvious in the equation, but with a little arithmetic itās easy to derive that \\(
\\langle x-\\bar{x},\\ y \\rangle = \\langle x-\\bar{x},\\ y+c \\rangle \\) for any constant \\(c\\). (There must be a nice geometric interpretation of this.)
Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. Iām not sure what this means or if itās a useful fact, but:
\\\[ OLSCoef\\left(
\\sqrt{n}\\frac{x-\\bar{x}}{\|\|x-\\bar{x}\|\|},
\\sqrt{n}\\frac{y-\\bar{y}}{\|\|y-\\bar{y}\|\|} \\right) = Corr(x,y) \\\]
Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, itās centered.
Of course we need a summary table. āSymmetricā means, if you swap the inputs, do you get the same answer. āInvariant to shift in inputā means, if you add an arbitrary constant to either input, do you get the same answer.
| | | | | | |
|---|---|---|---|---|---|
| Function | Equation | Symmetric? | Output range | Invariant to shift in input? | Pithy explanation in terms of something else |
| Inner(x,y) | \\\[ \\langle x, y\\rangle\\\] | Yes | \\(\\mathbb{R}\\) | No | |
| CosSim(x,y) | \\\[ \\frac{\\langle x,y \\rangle}{\|\|x\|\|\\ \|\|y\|\|} \\\] | Yes | \[-1,1\] or \[0,1\] if inputs non-neg | No | normalized inner product |
| Corr(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{\|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\] | Yes | \[-1,1\] | Yes | centered cosine; *or* normalized covariance |
| Cov(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{n} \\\] | Yes | \\(\\mathbb{R}\\) | Yes | centered inner product |
| OLSCoefNoIntcpt(x,y) | \\\[\\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 }\\\] | No | \\(\\mathbb{R}\\) | No | (compare to CosSim) |
| OLSCoefWithIntcpt(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2} \\\] | No | \\(\\mathbb{R}\\) | Yes | |
Are there any implications? Iāve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when thereās high-dimensional sparse data ā the [Friedman et al. 2010 glmnet](http://www.jstatsoft.org/v33/i01) paper talks about this in the context of coordinate descent text regression. Iāve heard [Dhillon et al., NIPS 2011](http://www.cs.utexas.edu/~pradeepr/paperz/coord_nips.pdf) applies [LSH](http://en.wikipedia.org/wiki/Locality-sensitive_hashing) in a similar setting (but havenāt read it yet). And thereās lots of work using LSH for cosine similarity; e.g. [van Durme and Lall 2010 \[slides\]](http://cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf).
Any other cool identities? Any corrections to the above?
References: I use [Hastie et al 2009, chapter 3](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) to look up linear regression, but itās covered in zillions of other places. I linked to a nice chapter in [Tufteās little 1974 book](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html) that he wrote before he went off and did all that visualization stuff. (He calls it ātwo-variable regressionā, but I think āone-variable regressionā is a better term. āone-featureā or āone-covariateā might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts. |
| Shard | 34 (laksa) |
| Root Hash | 11773947750911624434 |
| Unparsed URL | com,brenocon!/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ s443 |