šŸ•·ļø Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 34 (from laksa067)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ā„¹ļø Skipped - page is already crawled

šŸ“„
INDEXABLE
āœ…
CRAWLED
14 hours ago
šŸ¤–
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/
Last Crawled2026-04-14 11:09:10 (14 hours ago)
First Indexed2015-10-31 19:12:34 (10 years ago)
HTTP Status Code200
Meta TitleCosine similarity, Pearson correlation, and OLS coefficients | AI and Social Science – Brendan O'Connor
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors \(x\) and \(y\) and want to measure similarity between them. A basic similarity function is the inner product \[ Inner(x,y) = \sum_i x_i y_i = \langle x, y \rangle \] If x tends to be high where y is also high, and low where y is low, the inner product will be high — the vectors are more similar. The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectors’ L2 norms, giving the cosine similarity \[ CosSim(x,y) = \frac{\sum_i x_i y_i}{ \sqrt{ \sum_i x_i^2} \sqrt{ \sum_i y_i^2 } } = \frac{ \langle x,y \rangle }{ ||x||\ ||y|| } \] This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \(\mathbb{R}^2\) (e.g. here ). Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the Pearson correlation . Let \(\bar{x}\) and \(\bar{y}\) be the respective means: \begin{align} Corr(x,y) &= \frac{ \sum_i (x_i-\bar{x}) (y_i-\bar{y}) }{ \sqrt{\sum (x_i-\bar{x})^2} \sqrt{ \sum (y_i-\bar{y})^2 } } \\ & = \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{ ||x-\bar{x}||\ ||y-\bar{y}||} \\ & = CosSim(x-\bar{x}, y-\bar{y}) \end{align} Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y. This isn’t the usual way to derive the Pearson correlation; usually it’s presented as a normalized form of the covariance , which is a centered average inner product (no normalization) \[ Cov(x,y) = \frac{\sum (x_i-\bar{x})(y_i-\bar{y}) }{n} = \frac{ \langle x-\bar{x},\ y-\bar{y} \rangle }{n} \] Finally, these are all related to the coefficient in a one-variable linear regression . For the OLS model \(y_i \approx ax_i\) with Gaussian noise, whose MLE is the least-squares problem \(\arg\min_a \sum (y_i – ax_i)^2\), a few lines of calculus shows \(a\) is \begin{align} OLSCoef(x,y) &= \frac{ \sum x_i y_i }{ \sum x_i^2 } = \frac{ \langle x, y \rangle}{ ||x||^2 } \end{align} This looks like another normalized inner product. But unlike cosine similarity, we aren’t normalizing by \(y\)’s norm — instead we only use \(x\)’s norm (and use it twice): denominator of \(||x||\ ||y||\) versus \(||x||^2\). Not normalizing for \(y\) is what you want for the linear regression: if \(y\) was stretched to span a larger range, you would need to increase \(a\) to match, to get your predictions spread out too. Often it’s desirable to do the OLS model with an intercept term: \(\min_{a,b} \sum (y – ax_i – b)^2\). Then \(a\) is \begin{align} OLSCoefWithIntercept(x,y) &= \frac { \sum (x_i – \bar{x}) y_i } { \sum (x_i – \bar{x})^2 } = \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2} \\ &= OLSCoef(x-\bar{x}, y) \end{align} It’s different because the intercept term picks up the slack associated with where x’s center is. So OLSCoefWithIntercept is invariant to shifts of x. It’s still different than cosine similarity since it’s still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isn’t obvious in the equation, but with a little arithmetic it’s easy to derive that \( \langle x-\bar{x},\ y \rangle = \langle x-\bar{x},\ y+c \rangle \) for any constant \(c\). (There must be a nice geometric interpretation of this.) Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. I’m not sure what this means or if it’s a useful fact, but: \[ OLSCoef\left( \sqrt{n}\frac{x-\bar{x}}{||x-\bar{x}||}, \sqrt{n}\frac{y-\bar{y}}{||y-\bar{y}||} \right) = Corr(x,y) \] Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, it’s centered. Of course we need a summary table. ā€œSymmetricā€ means, if you swap the inputs, do you get the same answer. ā€œInvariant to shift in inputā€ means, if you add an arbitrary constant to either input, do you get the same answer. Function Equation Symmetric? Output range Invariant to shift in input? Pithy explanation in terms of something else Inner(x,y) \[ \langle x, y\rangle\] Yes \(\mathbb{R}\) No CosSim(x,y) \[ \frac{\langle x,y \rangle}{||x||\ ||y||} \] Yes [-1,1] or [0,1] if inputs non-neg No normalized inner product Corr(x,y) \[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle }{||x-\bar{x}||\ ||y-\bar{y}||} \] Yes [-1,1] Yes centered cosine; or normalized covariance Cov(x,y) \[ \frac{\langle x-\bar{x},\ y-\bar{y} \rangle}{n} \] Yes \(\mathbb{R}\) Yes centered inner product OLSCoefNoIntcpt(x,y) \[\frac{ \langle x, y \rangle}{ ||x||^2 }\] No \(\mathbb{R}\) No (compare to CosSim) OLSCoefWithIntcpt(x,y) \[ \frac{\langle x-\bar{x},\ y \rangle}{||x-\bar{x}||^2} \] No \(\mathbb{R}\) Yes Are there any implications? I’ve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when there’s high-dimensional sparse data — the Friedman et al. 2010 glmnet paper talks about this in the context of coordinate descent text regression. I’ve heard Dhillon et al., NIPS 2011 applies LSH in a similar setting (but haven’t read it yet). And there’s lots of work using LSH for cosine similarity; e.g. van Durme and Lall 2010 [slides] . Any other cool identities? Any corrections to the above? References: I use Hastie et al 2009, chapter 3 to look up linear regression, but it’s covered in zillions of other places. I linked to a nice chapter in Tufte’s little 1974 book that he wrote before he went off and did all that visualization stuff. (He calls it ā€œtwo-variable regressionā€, but I think ā€œone-variable regressionā€ is a better term. ā€œone-featureā€ or ā€œone-covariateā€ might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts.
Markdown
[AI and Social Science – Brendan O'Connor](https://brenocon.com/blog/ "AI and Social Science – Brendan O'Connor") cognition, language, social systems; statistics, visualization, computation ![]() [← I don’t get this web parsing shared task](https://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/) [F-scores, Dice, and Jaccard set similarity →](https://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/) # Cosine similarity, Pearson correlation, and OLS coefficients Posted on [March 13, 2012](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ "6:01 pm") Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors \\(x\\) and \\(y\\) and want to measure similarity between them. A basic similarity function is the **[inner product](http://en.wikipedia.org/wiki/Dot_product)** \\\[ Inner(x,y) = \\sum\_i x\_i y\_i = \\langle x, y \\rangle \\\] If x tends to be high where y is also high, and low where y is low, the inner product will be high — the vectors are more similar. The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectors’ L2 norms, giving the **[cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity)** \\\[ CosSim(x,y) = \\frac{\\sum\_i x\_i y\_i}{ \\sqrt{ \\sum\_i x\_i^2} \\sqrt{ \\sum\_i y\_i^2 } } = \\frac{ \\langle x,y \\rangle }{ \|\|x\|\|\\ \|\|y\|\| } \\\] This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \\(\\mathbb{R}^2\\) (e.g. [here](http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html)). Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the **[Pearson correlation](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)**. Let \\(\\bar{x}\\) and \\(\\bar{y}\\) be the respective means: \\begin{align} Corr(x,y) &= \\frac{ \\sum\_i (x\_i-\\bar{x}) (y\_i-\\bar{y}) }{ \\sqrt{\\sum (x\_i-\\bar{x})^2} \\sqrt{ \\sum (y\_i-\\bar{y})^2 } } \\\\ & = \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{ \|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\\ & = CosSim(x-\\bar{x}, y-\\bar{y}) \\end{align} Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y. This isn’t the usual way to derive the Pearson correlation; usually it’s presented as a normalized form of the **[covariance](http://en.wikipedia.org/wiki/Covariance)**, which is a centered average inner product (no normalization) \\\[ Cov(x,y) = \\frac{\\sum (x\_i-\\bar{x})(y\_i-\\bar{y}) }{n} = \\frac{ \\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{n} \\\] Finally, these are all related to the coefficient in a **[one-variable linear regression](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html)**. For the OLS model \\(y\_i \\approx ax\_i\\) with Gaussian noise, whose MLE is the least-squares problem \\(\\arg\\min\_a \\sum (y\_i – ax\_i)^2\\), a few lines of calculus shows \\(a\\) is \\begin{align} OLSCoef(x,y) &= \\frac{ \\sum x\_i y\_i }{ \\sum x\_i^2 } = \\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 } \\end{align} This looks like another normalized inner product. But unlike cosine similarity, we aren’t normalizing by \\(y\\)’s norm — instead we only use \\(x\\)’s norm (and use it twice): denominator of \\(\|\|x\|\|\\ \|\|y\|\|\\) versus \\(\|\|x\|\|^2\\). Not normalizing for \\(y\\) is what you want for the linear regression: if \\(y\\) was stretched to span a larger range, you would need to increase \\(a\\) to match, to get your predictions spread out too. Often it’s desirable to do the OLS model with an intercept term: \\(\\min\_{a,b} \\sum (y – ax\_i – b)^2\\). Then \\(a\\) is \\begin{align} OLSCoefWithIntercept(x,y) &= \\frac { \\sum (x\_i – \\bar{x}) y\_i } { \\sum (x\_i – \\bar{x})^2 } = \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2} \\\\ &= OLSCoef(x-\\bar{x}, y) \\end{align} It’s different because the intercept term picks up the slack associated with where x’s center is. So OLSCoefWithIntercept is invariant to shifts of x. It’s still different than cosine similarity since it’s still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isn’t obvious in the equation, but with a little arithmetic it’s easy to derive that \\( \\langle x-\\bar{x},\\ y \\rangle = \\langle x-\\bar{x},\\ y+c \\rangle \\) for any constant \\(c\\). (There must be a nice geometric interpretation of this.) Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. I’m not sure what this means or if it’s a useful fact, but: \\\[ OLSCoef\\left( \\sqrt{n}\\frac{x-\\bar{x}}{\|\|x-\\bar{x}\|\|}, \\sqrt{n}\\frac{y-\\bar{y}}{\|\|y-\\bar{y}\|\|} \\right) = Corr(x,y) \\\] Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, it’s centered. Of course we need a summary table. ā€œSymmetricā€ means, if you swap the inputs, do you get the same answer. ā€œInvariant to shift in inputā€ means, if you add an arbitrary constant to either input, do you get the same answer. | | | | | | | |---|---|---|---|---|---| | Function | Equation | Symmetric? | Output range | Invariant to shift in input? | Pithy explanation in terms of something else | | Inner(x,y) | \\\[ \\langle x, y\\rangle\\\] | Yes | \\(\\mathbb{R}\\) | No | | | CosSim(x,y) | \\\[ \\frac{\\langle x,y \\rangle}{\|\|x\|\|\\ \|\|y\|\|} \\\] | Yes | \[-1,1\] or \[0,1\] if inputs non-neg | No | normalized inner product | | Corr(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{\|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\] | Yes | \[-1,1\] | Yes | centered cosine; *or* normalized covariance | | Cov(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{n} \\\] | Yes | \\(\\mathbb{R}\\) | Yes | centered inner product | | OLSCoefNoIntcpt(x,y) | \\\[\\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 }\\\] | No | \\(\\mathbb{R}\\) | No | (compare to CosSim) | | OLSCoefWithIntcpt(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2} \\\] | No | \\(\\mathbb{R}\\) | Yes | | Are there any implications? I’ve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when there’s high-dimensional sparse data — the [Friedman et al. 2010 glmnet](http://www.jstatsoft.org/v33/i01) paper talks about this in the context of coordinate descent text regression. I’ve heard [Dhillon et al., NIPS 2011](http://www.cs.utexas.edu/~pradeepr/paperz/coord_nips.pdf) applies [LSH](http://en.wikipedia.org/wiki/Locality-sensitive_hashing) in a similar setting (but haven’t read it yet). And there’s lots of work using LSH for cosine similarity; e.g. [van Durme and Lall 2010 \[slides\]](http://cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf). Any other cool identities? Any corrections to the above? References: I use [Hastie et al 2009, chapter 3](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) to look up linear regression, but it’s covered in zillions of other places. I linked to a nice chapter in [Tufte’s little 1974 book](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html) that he wrote before he went off and did all that visualization stuff. (He calls it ā€œtwo-variable regressionā€, but I think ā€œone-variable regressionā€ is a better term. ā€œone-featureā€ or ā€œone-covariateā€ might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts. This entry was posted in [Uncategorized](https://brenocon.com/blog/category/uncategorized/ "View all posts in Uncategorized"). Bookmark the [permalink](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ "Permalink to Cosine similarity, Pearson correlation, and OLS coefficients"). [← I don’t get this web parsing shared task](https://brenocon.com/blog/2012/03/i-dont-get-this-web-parsing-shared-task/) [F-scores, Dice, and Jaccard set similarity →](https://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/) ### 23 Responses to *Cosine similarity, Pearson correlation, and OLS coefficients* 1. ![](https://secure.gravatar.com/avatar/965693aaf6fc1a5c24ac8746df8e17a1?s=40&d=identicon&r=R) [Victor Chahuneau](http://victor.chahuneau.fr/) says: [March 15, 2012 at 1:21 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129691) I think your OLSCoefWithIntercept is wrong unless y is centered: the right part of the dot product should be (y-) Then the invariance by translation is obvious… Otherwise you would get \<x-, y+c\> = \<x-,y\> + c(n-1) See [Wikipedia](http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line) for the equation - ![](https://secure.gravatar.com/avatar/965693aaf6fc1a5c24ac8746df8e17a1?s=40&d=identicon&r=R) [Victor Chahuneau](http://victor.chahuneau.fr/) says: [March 15, 2012 at 1:24 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129692) … but of course WordPress doesn’t like my brackets… Line 1:\$(y-\\bar y)\$ Line 3: \$ = + c(n-1)\\bar x\$ 2. ![](https://secure.gravatar.com/avatar/fd4b164e15fa2a834d16fb8743ec4f1b?s=40&d=identicon&r=R) [brendano](http://brenocon.com/) says: [March 15, 2012 at 5:05 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129735) Nope, you don’t need to center y if you’re centering x. The Wikipedia equation isn’t as correct as Hastie :) I actually didn’t believe this when I was writing the post, but if you write out the arithmetic like I said you can derive it. Example: \$ R \> x=c(1,2,3); y=c(5,6,10) \> inner\_and\_xnorm=function(x,y) sum(x\*y) / sum(x\*\*2) \> inner\_and\_xnorm(x-mean(x),y) \[1\] 2.5 \> inner\_and\_xnorm(x-mean(x),y+5) \[1\] 2.5 … if you don’t center x, then shifting y matters. 3. ![](https://secure.gravatar.com/avatar/965693aaf6fc1a5c24ac8746df8e17a1?s=40&d=identicon&r=R) [Victor Chahuneau](http://victor.chahuneau.fr/) says: [March 15, 2012 at 4:15 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-129911) Oops… I was wrong about the invariance\! It turns out that we were both right on the formula for the coefficient… thanks to this same invariance. Here is the full derivation: <http://dl.dropbox.com/u/2803234/ols.pdf> Wikipedia & Hastie can be reconciled now… 4. ![](https://secure.gravatar.com/avatar/e92ac569643b505ef24bf6de3f533954?s=40&d=identicon&r=R) Mike says: [March 26, 2012 at 8:40 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-133347) Nice breakdown Brendan. I’ve been working recently with high-dimensional sparse data. The covariance/correlation matrices can be calculated without losing sparsity after rearranging some terms. <http://stackoverflow.com/a/9626089/1257542> - ![](https://secure.gravatar.com/avatar/e92ac569643b505ef24bf6de3f533954?s=40&d=identicon&r=R) Mike says: [March 26, 2012 at 12:17 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-133439) for instance, with two sparse vectors, you can get the correlation and covariance without subtracting the means cov(x,y) = ( inner(x,y) – n mean(x) mean(y)) / (n-1) cor(x,y) = ( inner(x,y) – n mean(x) mean(y)) / (sd(x) sd(y) (n-1)) - ![](https://secure.gravatar.com/avatar/fd4b164e15fa2a834d16fb8743ec4f1b?s=40&d=identicon&r=R) [Brendan O'Connor](http://brenocon.com/) says: [March 26, 2012 at 1:18 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-133486) Oh awesome, thanks\! 5. ![](https://secure.gravatar.com/avatar/382c47ed0c49339969bd0d37ac136c43?s=40&d=identicon&r=R) Kat says: [April 24, 2012 at 11:12 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143747) Hey Brendan! Maybe you are the right person to ask this to – if I want to figure out how similar two sets of paired vectors are (both angle AND magnitude) how would I do that? I originally started by looking at cosine similarity (well, I started them all from 0,0 so I guess now I know it was correlation?) but of course that doesn’t look at magnitude at all. Is there a way that people usually weight direction and magnitude, or is that arbitrary? - ![](https://secure.gravatar.com/avatar/fd4b164e15fa2a834d16fb8743ec4f1b?s=40&d=identicon&r=R) [Brendan O'Connor](http://brenocon.com/) says: [April 24, 2012 at 11:25 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143748) Why not inner product? - ![](https://secure.gravatar.com/avatar/382c47ed0c49339969bd0d37ac136c43?s=40&d=identicon&r=R) Kat says: [April 24, 2012 at 11:43 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143751) I would like and to be more similar than and , for example - ![](https://secure.gravatar.com/avatar/382c47ed0c49339969bd0d37ac136c43?s=40&d=identicon&r=R) Kat says: [April 24, 2012 at 11:44 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-143752) ok no tags this time – 1,1 and 1,1 to be more similar than 1,1 and 5,5 6. Pingback: [Triangle problem – finding height with given area and angles. Ā« Math World – etidhor](http://etidhor.wordpress.com/2013/01/04/triangle-problem-finding-height-with-given-area-and-angles/) 7. ![](https://secure.gravatar.com/avatar/5c736be73387b6197ec41a533ec9663e?s=40&d=identicon&r=R) [Adam](http://www.designandanalytics.com/) says: [February 1, 2013 at 5:57 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-270730) This is one of the best technical summary blog posts that I can remember seeing. I’ve just started in NLP and was confused at first seeing cosine appear as the de facto relatedness measure—this really helped me mentally reconcile it with the alternatives. Very interesting and great post. 8. ![](https://secure.gravatar.com/avatar/e5d98b48bb89cdd3b53f8aae34798c18?s=40&d=identicon&r=R) [Paul Moore](http://people.maths.ox.ac.uk/moorep/) says: [March 18, 2013 at 5:14 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-304935) A very helpful discussion – thanks. Have you seen – ā€˜Thirteen Ways to Look at the Correlation Coefficient’ by Joseph Lee Rodgers; W. Alan Nicewander, The American Statistician, Vol. 42, No. 1. (Feb., 1988), pp. 59-66. It covers a related discussion. 9. ![](https://secure.gravatar.com/avatar/fd4b164e15fa2a834d16fb8743ec4f1b?s=40&d=identicon&r=R) [brendano](http://brenocon.com/) says: [March 18, 2013 at 5:17 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-304941) Great tip — I remember seeing that once but totally forgot about it. Here’s a link, [http://data.psych.udel.edu/laurenceau/PSYC861Regression%20Spring%202012/READINGS/rodgers-nicewander-1988-r-13-ways.pdf](http://data.psych.udel.edu/laurenceau/PSYC861Regression%20Spring%202012/READINGS/rodgers-nicewander-1988-r-13-ways.pdf) 10. Pingback: [Correlation picture \| AI and Social Science – Brendan O'Connor](http://brenocon.com/blog/2013/03/correlation-picture/) 11. ![](https://secure.gravatar.com/avatar/1c5157ed916def68ae1700b8969ff1a8?s=40&d=identicon&r=R) Peter says: [March 29, 2013 at 3:24 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-312261) Useful info: I have a few questions (i am pretty new to that field). You say correlation is invariant of shifts. i guess you just mean if the x-axis is not 1 2 3 4 but 10 20 30 or 30 20 10.. then it doesn’t change anything. but you doesn’t mean that if i shift the signal i will get the same correlation right? ex: \[1 2 1 2 1\] and \[1 2 1 2 1\], corr = 1 but if i cyclically shift \[1 2 1 2 1\] and \[2 1 2 1 2\], corr = -1 or if i just shift by padding zeros \[1 2 1 2 1 0\] and \[0 1 2 1 2 1\] then corr = -0.0588 Please elaborate on that. Also could we say that distance correlation (1-correlation) can be considered as norm\_1 or norm\_2 distance somehow? for example when we want to minimize the squared errors, usually we need to use euclidean distance, but could pearson’s correlation also be used? Ans last, OLSCoef(x,y) can be considered as scale invariant? is very correlated to cosine similarity which is not scale invariant (Pearson’s correlation is right?). Look at: ā€œPatterns of Temporal Variation in Online Mediaā€ and ā€œFast time-series searching with scaling and shiftingā€. That confuses me.. but maybe i am missing something. - ![](https://secure.gravatar.com/avatar/fd4b164e15fa2a834d16fb8743ec4f1b?s=40&d=identicon&r=R) [Brendan O'Connor](http://brenocon.com/) says: [April 1, 2013 at 10:27 pm](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-314504) Hi Peter – By ā€œinvariant to shift in inputā€, I mean, if you \*add\* to the input. That is, f(x, y) = f(x+a, y) for any scalar ā€˜a’. By ā€œscale invariantā€, I mean, if you \*multiply\* the input by something. For (1-corr), the problem is negative correlations. I think maximizing the squared correlation is the same thing as minimizing squared error .. that’s why it’s called R^2, the explained variance ratio. I don’t understand your question about OLSCoef and have not seen the papers you’re talking about. 12. Pingback: [Machine learning literary genres from 19th century seafaring, horror and western novels \| Sub-Sub Algorithm](http://subsubalgorithm.wordpress.com/2013/12/07/machine-learning-literary-genres-from-19th-century-seafaring-horror-and-western-novels/) 13. Pingback: [Machine learning literary genres from 19th century seafaring, horror and western novels \| Sub-Subroutine](http://subsubroutine.wordpress.com/2013/12/08/machine-learning-literary-genres-from-19th-century-seafaring-horror-and-western-novels/) 14. ![](https://secure.gravatar.com/avatar/4259fc571b7b0caa247f4149f0d3d902?s=40&d=identicon&r=R) [Waylon Flinn](http://www.crunchmagic.com/) says: [December 11, 2013 at 4:51 am](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/#comment-758397) Wonderful post. The more I investigate it the more it looks like every relatedness measure around is just a different normalization of the inner product. Similar analyses reveal that Lift, Jaccard Index and even the standard Euclidean metric can be viewed as different corrections to the dot product. It’s not a viewpoint I’ve seen a lot of. It was this post that started my investigation of this phenomenon. For that, I’m grateful to you. The fact that the basic dot product can be seen to underlie all these similarity measures turns out to be convenient. If you stack all the vectors in your space on top of each other to create a matrix, you can produce all the inner products simply by multiplying the matrix by it’s transpose. Furthermore, the extra ingredient in every similarity measure I’ve looked at so far involves the magnitudes (or squared magnitudes) of the individual vectors. These drop out of this matrix multiplication as well. Just extract the diagonal. Because of it’s exceptional utility, I’ve dubbed the symmetric matrix that results from this product the base similarity matrix. I haven’t been able to find many other references which formulate these metrics in terms of this matrix, or the inner product as you’ve done. Known mathematics is both broad and deep, so it seems likely that I’m stumbling upon something that’s already been investigated. Do you know of other work that explores this underlying structure of similarity measures? Is the construction of this base similarity matrix a standard technique in the calculation of these measures? Does it have a common name? Thanks again for sharing your explorations of this topic. P.S. Here’s the other reference I’ve found that does similar work: <http://arxiv.org/pdf/1308.3740.pdf> 15. Pingback: [Building the connection between cosine similarity and correlation in R \| Question and Answer](http://qandasys.info/building-the-connection-between-cosine-similarity-and-correlation-in-r/) 16. Pingback: [ē›øä¼¼ę€§åŗ¦é‡ - CSer之声](http://www.cserzs.com/similarity-measure) - ### About This is a blog on artificial intelligence and "Social Science++", with an emphasis on computation and statistics. My website is [brenocon.com](http://brenocon.com/). - ### Blogroll - [NLPers (Daume)](http://nlpers.blogspot.com/) - [ML Theory (Langford)](http://hunch.net/) - [SMCISS (~Gelman)](http://andrewgelman.com/) - [Normal Deviate (Wasserman)](http://normaldeviate.wordpress.com/) - [LingPipe (~Carpenter)](http://lingpipe-blog.com/) - [Three-Toed Sloth (Shalizi)](http://cscs.umich.edu/~crshalizi/weblog/) - [Smola](http://blog.smola.org/) - [R-Bloggers](http://www.r-bloggers.com/) - [FiveThirtyEight](http://fivethirtyeight.blogs.nytimes.com/) - [Marginal Revolution](http://marginalrevolution.com/) - ### Blog Search - ### [Archives](https://brenocon.com/blog/archives/) [AI and Social Science – Brendan O'Connor](https://brenocon.com/blog/ "AI and Social Science – Brendan O'Connor") [Proudly powered by WordPress.](http://wordpress.org/ "Semantic Personal Publishing Platform")
Readable Markdown
Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors \\(x\\) and \\(y\\) and want to measure similarity between them. A basic similarity function is the **[inner product](http://en.wikipedia.org/wiki/Dot_product)** \\\[ Inner(x,y) = \\sum\_i x\_i y\_i = \\langle x, y \\rangle \\\] If x tends to be high where y is also high, and low where y is low, the inner product will be high — the vectors are more similar. The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectors’ L2 norms, giving the **[cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity)** \\\[ CosSim(x,y) = \\frac{\\sum\_i x\_i y\_i}{ \\sqrt{ \\sum\_i x\_i^2} \\sqrt{ \\sum\_i y\_i^2 } } = \\frac{ \\langle x,y \\rangle }{ \|\|x\|\|\\ \|\|y\|\| } \\\] This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between the two vectors; you can illustrate this for vectors in \\(\\mathbb{R}^2\\) (e.g. [here](http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html)). Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the **[Pearson correlation](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)**. Let \\(\\bar{x}\\) and \\(\\bar{y}\\) be the respective means: \\begin{align} Corr(x,y) &= \\frac{ \\sum\_i (x\_i-\\bar{x}) (y\_i-\\bar{y}) }{ \\sqrt{\\sum (x\_i-\\bar{x})^2} \\sqrt{ \\sum (y\_i-\\bar{y})^2 } } \\\\ & = \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{ \|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\\ & = CosSim(x-\\bar{x}, y-\\bar{y}) \\end{align} Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y. This isn’t the usual way to derive the Pearson correlation; usually it’s presented as a normalized form of the **[covariance](http://en.wikipedia.org/wiki/Covariance)**, which is a centered average inner product (no normalization) \\\[ Cov(x,y) = \\frac{\\sum (x\_i-\\bar{x})(y\_i-\\bar{y}) }{n} = \\frac{ \\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{n} \\\] Finally, these are all related to the coefficient in a **[one-variable linear regression](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html)**. For the OLS model \\(y\_i \\approx ax\_i\\) with Gaussian noise, whose MLE is the least-squares problem \\(\\arg\\min\_a \\sum (y\_i – ax\_i)^2\\), a few lines of calculus shows \\(a\\) is \\begin{align} OLSCoef(x,y) &= \\frac{ \\sum x\_i y\_i }{ \\sum x\_i^2 } = \\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 } \\end{align} This looks like another normalized inner product. But unlike cosine similarity, we aren’t normalizing by \\(y\\)’s norm — instead we only use \\(x\\)’s norm (and use it twice): denominator of \\(\|\|x\|\|\\ \|\|y\|\|\\) versus \\(\|\|x\|\|^2\\). Not normalizing for \\(y\\) is what you want for the linear regression: if \\(y\\) was stretched to span a larger range, you would need to increase \\(a\\) to match, to get your predictions spread out too. Often it’s desirable to do the OLS model with an intercept term: \\(\\min\_{a,b} \\sum (y – ax\_i – b)^2\\). Then \\(a\\) is \\begin{align} OLSCoefWithIntercept(x,y) &= \\frac { \\sum (x\_i – \\bar{x}) y\_i } { \\sum (x\_i – \\bar{x})^2 } = \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2} \\\\ &= OLSCoef(x-\\bar{x}, y) \\end{align} It’s different because the intercept term picks up the slack associated with where x’s center is. So OLSCoefWithIntercept is invariant to shifts of x. It’s still different than cosine similarity since it’s still not normalizing at all for y. Though, subtly, it does actually control for shifts of y. This isn’t obvious in the equation, but with a little arithmetic it’s easy to derive that \\( \\langle x-\\bar{x},\\ y \\rangle = \\langle x-\\bar{x},\\ y+c \\rangle \\) for any constant \\(c\\). (There must be a nice geometric interpretation of this.) Finally, what if x and y are standardized: both centered and normalized to unit standard deviation? The OLS coefficient for that is the same as the Pearson correlation between the original vectors. I’m not sure what this means or if it’s a useful fact, but: \\\[ OLSCoef\\left( \\sqrt{n}\\frac{x-\\bar{x}}{\|\|x-\\bar{x}\|\|}, \\sqrt{n}\\frac{y-\\bar{y}}{\|\|y-\\bar{y}\|\|} \\right) = Corr(x,y) \\\] Summarizing: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity. A one-variable OLS coefficient is like cosine but with one-sided normalization. With an intercept, it’s centered. Of course we need a summary table. ā€œSymmetricā€ means, if you swap the inputs, do you get the same answer. ā€œInvariant to shift in inputā€ means, if you add an arbitrary constant to either input, do you get the same answer. | | | | | | | |---|---|---|---|---|---| | Function | Equation | Symmetric? | Output range | Invariant to shift in input? | Pithy explanation in terms of something else | | Inner(x,y) | \\\[ \\langle x, y\\rangle\\\] | Yes | \\(\\mathbb{R}\\) | No | | | CosSim(x,y) | \\\[ \\frac{\\langle x,y \\rangle}{\|\|x\|\|\\ \|\|y\|\|} \\\] | Yes | \[-1,1\] or \[0,1\] if inputs non-neg | No | normalized inner product | | Corr(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle }{\|\|x-\\bar{x}\|\|\\ \|\|y-\\bar{y}\|\|} \\\] | Yes | \[-1,1\] | Yes | centered cosine; *or* normalized covariance | | Cov(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y-\\bar{y} \\rangle}{n} \\\] | Yes | \\(\\mathbb{R}\\) | Yes | centered inner product | | OLSCoefNoIntcpt(x,y) | \\\[\\frac{ \\langle x, y \\rangle}{ \|\|x\|\|^2 }\\\] | No | \\(\\mathbb{R}\\) | No | (compare to CosSim) | | OLSCoefWithIntcpt(x,y) | \\\[ \\frac{\\langle x-\\bar{x},\\ y \\rangle}{\|\|x-\\bar{x}\|\|^2} \\\] | No | \\(\\mathbb{R}\\) | Yes | | Are there any implications? I’ve been wondering for a while why cosine similarity tends to be so useful for natural language processing applications. Maybe this has something to do with it. Or not. One implication of all the inner product stuff is computational strategies to make it faster when there’s high-dimensional sparse data — the [Friedman et al. 2010 glmnet](http://www.jstatsoft.org/v33/i01) paper talks about this in the context of coordinate descent text regression. I’ve heard [Dhillon et al., NIPS 2011](http://www.cs.utexas.edu/~pradeepr/paperz/coord_nips.pdf) applies [LSH](http://en.wikipedia.org/wiki/Locality-sensitive_hashing) in a similar setting (but haven’t read it yet). And there’s lots of work using LSH for cosine similarity; e.g. [van Durme and Lall 2010 \[slides\]](http://cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf). Any other cool identities? Any corrections to the above? References: I use [Hastie et al 2009, chapter 3](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) to look up linear regression, but it’s covered in zillions of other places. I linked to a nice chapter in [Tufte’s little 1974 book](http://web.archive.org/web/20111103122217/http://www.edwardtufte.com/tufte/dapp/chapter3.html) that he wrote before he went off and did all that visualization stuff. (He calls it ā€œtwo-variable regressionā€, but I think ā€œone-variable regressionā€ is a better term. ā€œone-featureā€ or ā€œone-covariateā€ might be most accurate.) In my experience, cosine similarity is talked about more often in text processing or machine learning contexts.
Shard34 (laksa)
Root Hash11773947750911624434
Unparsed URLcom,brenocon!/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/ s443