🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 143 (from laksa189)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa143.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://andrewcharlesjones.github.io/journal/james-stein-estimator.html\')), getAhrefsUnparsedNoserviceFromURL(\'https://andrewcharlesjones.github.io/journal/james-stein-estimator.html\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/andrewcharlesjones.github.io\/journal\/james-stein-estimator.html","crawl_time":1768091336,"first_indexed_time":1617579156,"http_code":200,"src_unparsed":"io,github!andrewcharlesjones,\/journal\/james-stein-estimator.html s443","src_root_hash":"2566890010099092343","history_drop_reason":null,"meta_title":"Andy Jones","meta_descriptions":["The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables."],"attrs_boilerpipe_text":"The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.\nAdmissibility\nSuppose we want to estimate a parameter (or parameter vector) $\\theta$ in some statistical model. Broadly, we do this by constructing an “estimator” $\\theta(x)$, which is a function of the data $x$. Let $\\theta^\\star$ denote the true value of $\\theta$ (the one that actually corresponds to the data generating process).\nGiven a set of data observations $x$, we can assess the quality of the estimator using a loss function $\\mathcal{L}(\\theta^\\star, \\hat{\\theta})$, which compares the true $\\theta^\\star$ to our estimate. Lower loss values typically correspond a “better” estimator. Example loss functions are the squared (L2) error $\\mathcal{L}(\\theta^\\star, \\hat{\\theta}) = ||\\theta - \\theta^\\star||_2^2$ and the L1 error $\\mathcal{L}(\\theta^\\star, \\hat{\\theta}) = ||\\theta - \\theta^\\star||_1$.\nIf we want to assess the estimator over\nall possible\ndata (not just one set of observations), we can compute the estimator’s\nrisk\n, which is the expectation of $\\mathcal{L}$ over the data distribution $p(x | \\theta^\\star)$.\nSpecifically, given a loss function $\\mathcal{L}$, the risk function is defined as\n\\[R(\\theta^\\star, \\hat{\\theta}) = \\mathbb{E}_{p(x | \\theta^\\star)}[\\mathcal{L}(\\theta^\\star, \\hat{\\theta}(x))].\\]\nNow, we can use the risk function to compare different estimators. Suppose we have two different estimators $\\hat{\\theta}^{(1)}$ and $\\hat{\\theta}^{(2)}$. For any true parameter value $\\theta^\\star$, we can then compute the risk of each, $R(\\theta^\\star, \\hat{\\theta}^{(1)})$ and $R(\\theta^\\star, \\hat{\\theta}^{(2)})$, and compare them.\nOften, one estimator will have a lower risk for some values of $\\theta^\\star$ and a higher risk for others. However, we say that an estimator $\\hat{\\theta}^{(1)}$\ndominates\nanother estimator $\\hat{\\theta}^{(2)}$ if $\\hat{\\theta}^{(1)}$ doesn’t has a higher risk for any $\\theta^\\star$ and has a lower risk for at least one value of $\\theta^\\star$. In other words, $\\hat{\\theta}^{(1)}$ dominates another estimator $\\hat{\\theta}^{(2)}$ if:\n$R(\\theta^\\star, \\hat{\\theta}^{(1)}) \\leq R(\\theta^\\star, \\hat{\\theta}^{(2)}) \\;\\;\\; \\forall \\theta^\\star \\in \\Theta.$\n$\\exists \\theta^\\star \\text{ such that } R(\\theta^\\star, \\hat{\\theta}^{(1)}) < R(\\theta^\\star, \\hat{\\theta}^{(2)})$\nFinally, an estimator is\nadmissible\nif it’s not dominated by any other estimator. Otherwise, we say it’s\ninadmissible\n.\nStein phenomenon\nConsider $p$ Gaussian random variables $X_1, \\dots, X_p$, where\n\\[X_i \\sim \\mathcal{N}(\\mu_i, \\sigma^2), \\;\\;\\; i = 1, \\dots, p\\]\nwhere $\\sigma^2$ is known, and we’d like to estimate each $\\mu_i$.\nSuppose our data consists of one observation of each variable $x_1, \\dots, x_p$. With such little information to work with, under the squared error loss, the least squares estimator (maximum likelihood estimator AKA “ordinary” estimator AKA “usual” estimator) would simply estimate each mean as the data point’s value:\n\\[\\hat{\\mu_i}^{(LS)} = x_i.\\]\nHowever, Charles Stein discovered interesting and surprising result:\nThe least squares estimator is\ninadmissible\nwith respect to the squared error loss when $p \\geq 3$. In other words, the least squares estimator is\ndominated\nby another estimator.\nProf. John Carlos Baez summarizes this unintuitive result nicely in this Twitter thread:\nI have a Gaussian distribution like this in 2d. You know its variance is 1 but don't know its mean. I randomly pick a point (x₁,x₂) according to this distribution and tell you. You try to guess the mean.\nYour best guess is (x₁,x₂).\nBut this is not true in 3d!!!\n(1\/n)\npic.twitter.com\/pWPD8sFmZ6\n— John Carlos Baez (@johncarlosbaez)\nAugust 25, 2020\nSo, what’s this other estimator?\nJames-Stein estimator\nThe James-Stein estimator (concocted by Charles Stein and Willard James) is\n\\[\\hat{\\mu}_i^{(JS)} = \\left( 1 - \\frac{(p - 2) \\sigma^2}{||\\mathbf{x}||_2^2} \\right) x_i.\\]\nwhere $\\mathbf{x}$ is the $p$-vector of observations.\nNotice that this estimator is essentially multiplying each $x_i$ by a term $\\left( 1 - \\frac{(p - 2) \\sigma^2}{||\\mathbf{x}||_2^2} \\right)$ that depends on the other variables as well.\nTo start building intuition about what this estimator is doing, consider the case when $p = 3$, and $\\sigma^2 = 1$. Then the James-Stein estimator reduces to\n\\[\\hat{\\mu}_i^{(JS)} = \\left( 1 - \\frac{1}{||\\mathbf{x}||_2^2} \\right) x_i.\\]\nSince $||\\mathbf{x}||_2^2 \\geq 0$, we know that $\\left( 1 - \\frac{1}{||\\mathbf{x}||_2^2} \\right) < 1$. If $||\\mathbf{x}||_2^2 > 1$, then the estimator shrinks $\\mu_i$ toward 0 compared to the least squares estimator. If $||\\mathbf{x}||_2^2 < 1$, the estimator adjusts the LS estimator even further, and even performs a sign flip.\nMore generally, if $||\\mathbf{x}||_2^2 > (p - 2) \\sigma^2$, then the James-Stein estimator shrinks each $\\mu_i$ toward zero. In other words, if the overall (L2) magnitude of the data vector $\\mathbf{x}$ exceeds the variance (multiplied by $p-2$), the James-Stein estimator “regularizes” the estimates $\\mu_i$ by shrinking it toward zero.\nAnother way to think about the James-Stein estimator is as a “shrinkage” estimator. The James-Stein estimator has the effect of nudging each individual $\\hat{\\mu}_i$ toward the overall average of the data points, $\\bar{\\mathbf{x}} = \\frac1n \\sum\\limits_{i=1}^n x_i$. Brad Efron and Carl Morris have a nice figure demonstrating this in their paper\n“Stein’s Paradox in Statistics”\n. In the figure below, the top row shows the batting averages for 18 baseball players, and the bottom row shows the corresponding James-Stein estimator of each. Notice how the estimates move closer together, thereby sharing information across them.\nRelationship to empirical Bayes\nThe James-Stein estimator also has strong connections to the empirical Bayes methodology. Under an empirical Bayes framework, instead of completely marginalizing out the prior (as in a fully Bayesian treatment), we estimate the prior from the data.\nFor example, consider again $p$ Gaussian random variables $X_1, \\dots, X_p$, where\n\\[X_i \\sim \\mathcal{N}(\\mu_i, \\sigma^2), \\;\\;\\; i = 1, \\dots, p\\]\nwhere $\\sigma^2$ is known, and we’d like to estimate each $\\mu_i$. Now, let’s place a shared normal prior on each $\\mu_i$:\n\\[\\mu_i \\sim \\mathcal{N}(0, \\tau^2), \\;\\;\\; i = 1, \\dots, p.\\]\nWe could manually set $\\tau^2$ to some value, e.g. $\\tau^2 = 1$, or we could place another prior on it and integrate it out.\nOn the other hand, the empirical Bayes approach seeks to estimate $\\tau^2$ from the data itself, leveraging information across observations.\nFirst, notice that the posterior $p(\\mu_i | X)$ is\n\\[p(\\mu_i | X_i) = \\frac{p(X_i | \\mu_i) p(\\mu_i)}{\\int p(X_i | \\mu_i) p(\\mu_i) d\\mu_i)}.\\]\nAfter some arithmetic and extra work to solve the integral (e.g., through completing the square), we see that the posterior is also Gaussian:\n\\[\\mu_i | X_i \\sim \\mathcal{N}\\left(\\frac{\\tau^2}{\\tau^2 + \\sigma^2} X_i, \\frac{\\tau^2}{\\tau^2 + \\sigma^2} \\right).\\]\nSo the “Bayes” estimator (if we just take the expectation of the posterior above) is\n\\[\\hat{\\mu}_i^{(\\text{Bayes})} = \\frac{\\tau^2}{\\tau^2 + \\sigma^2} X_i.\\]\nNotice that this estimator is effectively shrinking $X_i$ toward zero.\nNow, what if we want a more principled way to set $\\tau^2$ above, as opposed to just setting it to some value manually? One way to do this would be to look for an unbiased estimator $\\hat{\\alpha}$ of the “shrinkage coefficient” $\\frac{\\tau^2}{\\tau^2 + \\sigma^2}$ such that\n\\[\\mathbb{E}_{p(X_i)}[\\hat{\\alpha}] = \\frac{\\tau^2}{\\tau^2 + \\sigma^2}.\\]\nNotice that the marginal distribution\n\\[p(X_i) = \\int p(X_i | \\mu_i) p(\\mu_i) d\\mu_i\\]\nis again Gaussian with\n\\[X_i \\sim \\mathcal{N}(0, \\sigma^2 + \\tau^2).\\]\nRecall that for a Guassian random vector $\\mathbf{z} = (z_1, \\dots, z_p)^\\top$, its squared L2-norm $||\\mathbf{z}||_2^2$ will follow a $\\chi^2$ distribution with $p$ degrees of freedom. Furthermore, $1 \/ ||\\mathbf{z}||_2^2$ will follow an inverse-$\\chi^2$ distribution, again with $p$ degrees of freedom.\nIn the case of our data, we can notice that the vector of data $\\mathbf{X} = (X_1, \\dots, X_p)^\\top$ has a norm $||\\mathbf{X}||_2^2$ that follows a scaled $\\chi^2$ (scaled by $\\sigma^2 + \\tau^2$). Consequently,\n\\[\\frac{1}{||\\mathbf{X}||_2^2} \\sim \\frac{1}{\\tau^2 + \\sigma^2} \\cdot \\text{inverse-}\\chi^2_p.\\]\nNotice that the expectation of the James-Stein estimator’s coefficient (let’s call it $\\hat{\\alpha}^{(JS)}$) is unbiased in this case!\n\\begin{align} \\mathbb{E}[\\hat{\\alpha}^{(JS)}] &= \\mathbb{E}\\left[\\left( 1 - \\frac{(p - 2) \\sigma^2}{||\\mathbf{X}||_2^2} \\right)\\right] \\\\ &= 1 - \\mathbb{E}\\left[\\left(\\frac{(p - 2) \\sigma^2}{||\\mathbf{X}||_2^2} \\right)\\right] \\\\ &= 1 - (p - 2) \\sigma^2 \\mathbb{E}\\left[\\frac{1}{||\\mathbf{X}||_2^2} \\right] \\\\ &= 1 - (p - 2) \\sigma^2 \\frac{1}{(\\tau^2 + \\sigma^2) (p - 2)} \\\\ &= 1 -  \\frac{\\sigma^2}{\\tau^2 + \\sigma^2} \\\\ &= \\frac{\\tau^2}{\\tau^2 + \\sigma^2} \\\\ \\end{align}\nThus, although Charles Stein didn’t originally derive the James-Stein estimator this way, we see that it also arises as a particular case of performing empirical Bayes.\nReferences\nStein, Charles. “Inadmissibility of the usual estimator for the mean of a multivariate distribution.” Proc. Third Berkeley Symp. Math. Statist. Prob. Vol. 1. 1956.\nWikipedia entry on\nadmissibility\nWikipedia entry on the\nJames-Stein estimator\nEfron, Bradley, and Carl Morris. “Stein’s paradox in statistics.” Scientific American 236.5 (1977): 119-127.\nProfessor Efron’s\nnotes on the James-Stein estimator\n.\nThis\nStackOverflow post\nBlog post\nby Austin Rochford on the relationship between empirical Bayes and the James-Stein estimator.","attrs_markdown":"### [Andy Jones](https:\/\/andrewcharlesjones.github.io\/)  [Technical blog](https:\/\/andrewcharlesjones.github.io\/menu\/blog.html) [ajones788@gmail.com]()\n# James-Stein estimator\nThe James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.\n\n## Admissibility\nSuppose we want to estimate a parameter (or parameter vector) \\$\\\\theta\\$ in some statistical model. Broadly, we do this by constructing an “estimator” \\$\\\\theta(x)\\$, which is a function of the data \\$x\\$. Let \\$\\\\theta^\\\\star\\$ denote the true value of \\$\\\\theta\\$ (the one that actually corresponds to the data generating process).\n\nGiven a set of data observations \\$x\\$, we can assess the quality of the estimator using a loss function \\$\\\\mathcal{L}(\\\\theta^\\\\star, \\\\hat{\\\\theta})\\$, which compares the true \\$\\\\theta^\\\\star\\$ to our estimate. Lower loss values typically correspond a “better” estimator. Example loss functions are the squared (L2) error \\$\\\\mathcal{L}(\\\\theta^\\\\star, \\\\hat{\\\\theta}) = \\|\\|\\\\theta - \\\\theta^\\\\star\\|\\|\\_2^2\\$ and the L1 error \\$\\\\mathcal{L}(\\\\theta^\\\\star, \\\\hat{\\\\theta}) = \\|\\|\\\\theta - \\\\theta^\\\\star\\|\\|\\_1\\$.\n\nIf we want to assess the estimator over *all possible* data (not just one set of observations), we can compute the estimator’s **risk**, which is the expectation of \\$\\\\mathcal{L}\\$ over the data distribution \\$p(x \\| \\\\theta^\\\\star)\\$.\n\nSpecifically, given a loss function \\$\\\\mathcal{L}\\$, the risk function is defined as\n\n\\\\\\[R(\\\\theta^\\\\star, \\\\hat{\\\\theta}) = \\\\mathbb{E}\\_{p(x \\| \\\\theta^\\\\star)}\\[\\\\mathcal{L}(\\\\theta^\\\\star, \\\\hat{\\\\theta}(x))\\].\\\\\\]\n\nNow, we can use the risk function to compare different estimators. Suppose we have two different estimators \\$\\\\hat{\\\\theta}^{(1)}\\$ and \\$\\\\hat{\\\\theta}^{(2)}\\$. For any true parameter value \\$\\\\theta^\\\\star\\$, we can then compute the risk of each, \\$R(\\\\theta^\\\\star, \\\\hat{\\\\theta}^{(1)})\\$ and \\$R(\\\\theta^\\\\star, \\\\hat{\\\\theta}^{(2)})\\$, and compare them.\n\nOften, one estimator will have a lower risk for some values of \\$\\\\theta^\\\\star\\$ and a higher risk for others. However, we say that an estimator \\$\\\\hat{\\\\theta}^{(1)}\\$ **dominates** another estimator \\$\\\\hat{\\\\theta}^{(2)}\\$ if \\$\\\\hat{\\\\theta}^{(1)}\\$ doesn’t has a higher risk for any \\$\\\\theta^\\\\star\\$ and has a lower risk for at least one value of \\$\\\\theta^\\\\star\\$. In other words, \\$\\\\hat{\\\\theta}^{(1)}\\$ dominates another estimator \\$\\\\hat{\\\\theta}^{(2)}\\$ if:\n\n1. \\$R(\\\\theta^\\\\star, \\\\hat{\\\\theta}^{(1)}) \\\\leq R(\\\\theta^\\\\star, \\\\hat{\\\\theta}^{(2)}) \\\\;\\\\;\\\\; \\\\forall \\\\theta^\\\\star \\\\in \\\\Theta.\\$\n2. \\$\\\\exists \\\\theta^\\\\star \\\\text{ such that } R(\\\\theta^\\\\star, \\\\hat{\\\\theta}^{(1)}) \\< R(\\\\theta^\\\\star, \\\\hat{\\\\theta}^{(2)})\\$\n\nFinally, an estimator is **admissible** if it’s not dominated by any other estimator. Otherwise, we say it’s **inadmissible**.\n\n## Stein phenomenon\nConsider \\$p\\$ Gaussian random variables \\$X\\_1, \\\\dots, X\\_p\\$, where\n\n\\\\\\[X\\_i \\\\sim \\\\mathcal{N}(\\\\mu\\_i, \\\\sigma^2), \\\\;\\\\;\\\\; i = 1, \\\\dots, p\\\\\\]\n\nwhere \\$\\\\sigma^2\\$ is known, and we’d like to estimate each \\$\\\\mu\\_i\\$.\n\nSuppose our data consists of one observation of each variable \\$x\\_1, \\\\dots, x\\_p\\$. With such little information to work with, under the squared error loss, the least squares estimator (maximum likelihood estimator AKA “ordinary” estimator AKA “usual” estimator) would simply estimate each mean as the data point’s value:\n\n\\\\\\[\\\\hat{\\\\mu\\_i}^{(LS)} = x\\_i.\\\\\\]\n\nHowever, Charles Stein discovered interesting and surprising result:\n\n> The least squares estimator is **inadmissible** with respect to the squared error loss when \\$p \\\\geq 3\\$. In other words, the least squares estimator is **dominated** by another estimator.\n\nProf. John Carlos Baez summarizes this unintuitive result nicely in this Twitter thread:\n\n> I have a Gaussian distribution like this in 2d. You know its variance is 1 but don't know its mean. I randomly pick a point (x₁,x₂) according to this distribution and tell you. You try to guess the mean.   \n>   \n> Your best guess is (x₁,x₂).  \n>   \n> But this is not true in 3d!!\\!  \n>   \n> (1\/n) [pic.twitter.com\/pWPD8sFmZ6](https:\/\/t.co\/pWPD8sFmZ6)\n> \n> — John Carlos Baez (@johncarlosbaez) [August 25, 2020](https:\/\/twitter.com\/johncarlosbaez\/status\/1298274201682325509?ref_src=twsrc%5Etfw)\n\nSo, what’s this other estimator?\n\n## James-Stein estimator\nThe James-Stein estimator (concocted by Charles Stein and Willard James) is\n\n\\\\\\[\\\\hat{\\\\mu}\\_i^{(JS)} = \\\\left( 1 - \\\\frac{(p - 2) \\\\sigma^2}{\\|\\|\\\\mathbf{x}\\|\\|\\_2^2} \\\\right) x\\_i.\\\\\\]\n\nwhere \\$\\\\mathbf{x}\\$ is the \\$p\\$-vector of observations.\n\nNotice that this estimator is essentially multiplying each \\$x\\_i\\$ by a term \\$\\\\left( 1 - \\\\frac{(p - 2) \\\\sigma^2}{\\|\\|\\\\mathbf{x}\\|\\|\\_2^2} \\\\right)\\$ that depends on the other variables as well.\n\nTo start building intuition about what this estimator is doing, consider the case when \\$p = 3\\$, and \\$\\\\sigma^2 = 1\\$. Then the James-Stein estimator reduces to\n\n\\\\\\[\\\\hat{\\\\mu}\\_i^{(JS)} = \\\\left( 1 - \\\\frac{1}{\\|\\|\\\\mathbf{x}\\|\\|\\_2^2} \\\\right) x\\_i.\\\\\\]\n\nSince \\$\\|\\|\\\\mathbf{x}\\|\\|\\_2^2 \\\\geq 0\\$, we know that \\$\\\\left( 1 - \\\\frac{1}{\\|\\|\\\\mathbf{x}\\|\\|\\_2^2} \\\\right) \\< 1\\$. If \\$\\|\\|\\\\mathbf{x}\\|\\|\\_2^2 \\> 1\\$, then the estimator shrinks \\$\\\\mu\\_i\\$ toward 0 compared to the least squares estimator. If \\$\\|\\|\\\\mathbf{x}\\|\\|\\_2^2 \\< 1\\$, the estimator adjusts the LS estimator even further, and even performs a sign flip.\n\nMore generally, if \\$\\|\\|\\\\mathbf{x}\\|\\|\\_2^2 \\> (p - 2) \\\\sigma^2\\$, then the James-Stein estimator shrinks each \\$\\\\mu\\_i\\$ toward zero. In other words, if the overall (L2) magnitude of the data vector \\$\\\\mathbf{x}\\$ exceeds the variance (multiplied by \\$p-2\\$), the James-Stein estimator “regularizes” the estimates \\$\\\\mu\\_i\\$ by shrinking it toward zero.\n\nAnother way to think about the James-Stein estimator is as a “shrinkage” estimator. The James-Stein estimator has the effect of nudging each individual \\$\\\\hat{\\\\mu}\\_i\\$ toward the overall average of the data points, \\$\\\\bar{\\\\mathbf{x}} = \\\\frac1n \\\\sum\\\\limits\\_{i=1}^n x\\_i\\$. Brad Efron and Carl Morris have a nice figure demonstrating this in their paper [“Stein’s Paradox in Statistics”](https:\/\/statweb.stanford.edu\/~ckirby\/brad\/other\/Article1977.pdf). In the figure below, the top row shows the batting averages for 18 baseball players, and the bottom row shows the corresponding James-Stein estimator of each. Notice how the estimates move closer together, thereby sharing information across them.\n\n![Stein baseball](https:\/\/andrewcharlesjones.github.io\/assets\/stein_baseball.png)\n\n## Relationship to empirical Bayes\nThe James-Stein estimator also has strong connections to the empirical Bayes methodology. Under an empirical Bayes framework, instead of completely marginalizing out the prior (as in a fully Bayesian treatment), we estimate the prior from the data.\n\nFor example, consider again \\$p\\$ Gaussian random variables \\$X\\_1, \\\\dots, X\\_p\\$, where\n\n\\\\\\[X\\_i \\\\sim \\\\mathcal{N}(\\\\mu\\_i, \\\\sigma^2), \\\\;\\\\;\\\\; i = 1, \\\\dots, p\\\\\\]\n\nwhere \\$\\\\sigma^2\\$ is known, and we’d like to estimate each \\$\\\\mu\\_i\\$. Now, let’s place a shared normal prior on each \\$\\\\mu\\_i\\$:\n\n\\\\\\[\\\\mu\\_i \\\\sim \\\\mathcal{N}(0, \\\\tau^2), \\\\;\\\\;\\\\; i = 1, \\\\dots, p.\\\\\\]\n\nWe could manually set \\$\\\\tau^2\\$ to some value, e.g. \\$\\\\tau^2 = 1\\$, or we could place another prior on it and integrate it out.\n\nOn the other hand, the empirical Bayes approach seeks to estimate \\$\\\\tau^2\\$ from the data itself, leveraging information across observations.\n\nFirst, notice that the posterior \\$p(\\\\mu\\_i \\| X)\\$ is\n\n\\\\\\[p(\\\\mu\\_i \\| X\\_i) = \\\\frac{p(X\\_i \\| \\\\mu\\_i) p(\\\\mu\\_i)}{\\\\int p(X\\_i \\| \\\\mu\\_i) p(\\\\mu\\_i) d\\\\mu\\_i)}.\\\\\\]\n\nAfter some arithmetic and extra work to solve the integral (e.g., through completing the square), we see that the posterior is also Gaussian:\n\n\\\\\\[\\\\mu\\_i \\| X\\_i \\\\sim \\\\mathcal{N}\\\\left(\\\\frac{\\\\tau^2}{\\\\tau^2 + \\\\sigma^2} X\\_i, \\\\frac{\\\\tau^2}{\\\\tau^2 + \\\\sigma^2} \\\\right).\\\\\\]\n\nSo the “Bayes” estimator (if we just take the expectation of the posterior above) is\n\n\\\\\\[\\\\hat{\\\\mu}\\_i^{(\\\\text{Bayes})} = \\\\frac{\\\\tau^2}{\\\\tau^2 + \\\\sigma^2} X\\_i.\\\\\\]\n\nNotice that this estimator is effectively shrinking \\$X\\_i\\$ toward zero.\n\nNow, what if we want a more principled way to set \\$\\\\tau^2\\$ above, as opposed to just setting it to some value manually? One way to do this would be to look for an unbiased estimator \\$\\\\hat{\\\\alpha}\\$ of the “shrinkage coefficient” \\$\\\\frac{\\\\tau^2}{\\\\tau^2 + \\\\sigma^2}\\$ such that\n\n\\\\\\[\\\\mathbb{E}\\_{p(X\\_i)}\\[\\\\hat{\\\\alpha}\\] = \\\\frac{\\\\tau^2}{\\\\tau^2 + \\\\sigma^2}.\\\\\\]\n\nNotice that the marginal distribution\n\n\\\\\\[p(X\\_i) = \\\\int p(X\\_i \\| \\\\mu\\_i) p(\\\\mu\\_i) d\\\\mu\\_i\\\\\\]\n\nis again Gaussian with\n\n\\\\\\[X\\_i \\\\sim \\\\mathcal{N}(0, \\\\sigma^2 + \\\\tau^2).\\\\\\]\n\nRecall that for a Guassian random vector \\$\\\\mathbf{z} = (z\\_1, \\\\dots, z\\_p)^\\\\top\\$, its squared L2-norm \\$\\|\\|\\\\mathbf{z}\\|\\|\\_2^2\\$ will follow a \\$\\\\chi^2\\$ distribution with \\$p\\$ degrees of freedom. Furthermore, \\$1 \/ \\|\\|\\\\mathbf{z}\\|\\|\\_2^2\\$ will follow an inverse-\\$\\\\chi^2\\$ distribution, again with \\$p\\$ degrees of freedom.\n\nIn the case of our data, we can notice that the vector of data \\$\\\\mathbf{X} = (X\\_1, \\\\dots, X\\_p)^\\\\top\\$ has a norm \\$\\|\\|\\\\mathbf{X}\\|\\|\\_2^2\\$ that follows a scaled \\$\\\\chi^2\\$ (scaled by \\$\\\\sigma^2 + \\\\tau^2\\$). Consequently,\n\n\\\\\\[\\\\frac{1}{\\|\\|\\\\mathbf{X}\\|\\|\\_2^2} \\\\sim \\\\frac{1}{\\\\tau^2 + \\\\sigma^2} \\\\cdot \\\\text{inverse-}\\\\chi^2\\_p.\\\\\\]\n\nNotice that the expectation of the James-Stein estimator’s coefficient (let’s call it \\$\\\\hat{\\\\alpha}^{(JS)}\\$) is unbiased in this case\\!\n\n\\\\begin{align} \\\\mathbb{E}\\[\\\\hat{\\\\alpha}^{(JS)}\\] &= \\\\mathbb{E}\\\\left\\[\\\\left( 1 - \\\\frac{(p - 2) \\\\sigma^2}{\\|\\|\\\\mathbf{X}\\|\\|\\_2^2} \\\\right)\\\\right\\] \\\\\\\\ &= 1 - \\\\mathbb{E}\\\\left\\[\\\\left(\\\\frac{(p - 2) \\\\sigma^2}{\\|\\|\\\\mathbf{X}\\|\\|\\_2^2} \\\\right)\\\\right\\] \\\\\\\\ &= 1 - (p - 2) \\\\sigma^2 \\\\mathbb{E}\\\\left\\[\\\\frac{1}{\\|\\|\\\\mathbf{X}\\|\\|\\_2^2} \\\\right\\] \\\\\\\\ &= 1 - (p - 2) \\\\sigma^2 \\\\frac{1}{(\\\\tau^2 + \\\\sigma^2) (p - 2)} \\\\\\\\ &= 1 - \\\\frac{\\\\sigma^2}{\\\\tau^2 + \\\\sigma^2} \\\\\\\\ &= \\\\frac{\\\\tau^2}{\\\\tau^2 + \\\\sigma^2} \\\\\\\\ \\\\end{align}\n\nThus, although Charles Stein didn’t originally derive the James-Stein estimator this way, we see that it also arises as a particular case of performing empirical Bayes.\n\n## References\n- Stein, Charles. “Inadmissibility of the usual estimator for the mean of a multivariate distribution.” Proc. Third Berkeley Symp. Math. Statist. Prob. Vol. 1. 1956.\n- Wikipedia entry on [admissibility](https:\/\/www.wikiwand.com\/en\/Admissible_decision_rule)\n- Wikipedia entry on the [James-Stein estimator](https:\/\/www.wikiwand.com\/en\/James%E2%80%93Stein_estimator)\n- Efron, Bradley, and Carl Morris. “Stein’s paradox in statistics.” Scientific American 236.5 (1977): 119-127.\n- Professor Efron’s [notes on the James-Stein estimator](https:\/\/statweb.stanford.edu\/~ckirby\/brad\/LSI\/chapter1.pdf).\n- This [StackOverflow post](https:\/\/stats.stackexchange.com\/questions\/304308\/why-is-the-james-stein-estimator-called-a-shrinkage-estimator)\n- [Blog post](https:\/\/austinrochford.com\/posts\/2013-11-30-steins-paradox-and-empirical-bayes.html) by Austin Rochford on the relationship between empirical Bayes and the James-Stein estimator.","attrs_readable_markdown":null,"meta_canonical":null,"ml_categories_json":"","ml_types_json":"","ml_intent_types_json":"","meta_language":null,"attrs_author":"Andy Jones","attrs_publish_time":1599264000,"attrs_original_publish_time":1599264000,"attrs_is_republished":0,"attrs_nr_words":"1382","attrs_boilerpipe_nr_words":"1368","body_ext_links_number":10,"body_int_links_number":3,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":0,"src_redirect":"","download_time_msec":55,"download_ttfb_msec":55,"download_size":6118}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

3 months ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	3.4 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property	Value
URL	https://andrewcharlesjones.github.io/journal/james-stein-estimator.html
Last Crawled	2026-01-11 00:28:56 (3 months ago)
First Indexed	2021-04-04 23:32:36 (5 years ago)
HTTP Status Code	200
Content
Meta Title	Andy Jones
Meta Description	The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.
Meta Canonical	null
Boilerpipe Text	The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables. Admissibility Suppose we want to estimate a parameter (or parameter vector) $\theta$ in some statistical model. Broadly, we do this by constructing an “estimator” $\theta(x)$, which is a function of the data $x$. Let $\theta^\star$ denote the true value of $\theta$ (the one that actually corresponds to the data generating process). Given a set of data observations $x$, we can assess the quality of the estimator using a loss function $\mathcal{L}(\theta^\star, \hat{\theta})$, which compares the true $\theta^\star$ to our estimate. Lower loss values typically correspond a “better” estimator. Example loss functions are the squared (L2) error $\mathcal{L}(\theta^\star, \hat{\theta}) = \|\|\theta - \theta^\star\|\|_2^2$ and the L1 error $\mathcal{L}(\theta^\star, \hat{\theta}) = \|\|\theta - \theta^\star\|\|_1$. If we want to assess the estimator over all possible data (not just one set of observations), we can compute the estimator’s risk , which is the expectation of $\mathcal{L}$ over the data distribution $p(x \| \theta^\star)$. Specifically, given a loss function $\mathcal{L}$, the risk function is defined as \[R(\theta^\star, \hat{\theta}) = \mathbb{E}_{p(x \| \theta^\star)}[\mathcal{L}(\theta^\star, \hat{\theta}(x))].\] Now, we can use the risk function to compare different estimators. Suppose we have two different estimators $\hat{\theta}^{(1)}$ and $\hat{\theta}^{(2)}$. For any true parameter value $\theta^\star$, we can then compute the risk of each, $R(\theta^\star, \hat{\theta}^{(1)})$ and $R(\theta^\star, \hat{\theta}^{(2)})$, and compare them. Often, one estimator will have a lower risk for some values of $\theta^\star$ and a higher risk for others. However, we say that an estimator $\hat{\theta}^{(1)}$ dominates another estimator $\hat{\theta}^{(2)}$ if $\hat{\theta}^{(1)}$ doesn’t has a higher risk for any $\theta^\star$ and has a lower risk for at least one value of $\theta^\star$. In other words, $\hat{\theta}^{(1)}$ dominates another estimator $\hat{\theta}^{(2)}$ if: $R(\theta^\star, \hat{\theta}^{(1)}) \leq R(\theta^\star, \hat{\theta}^{(2)}) \;\;\; \forall \theta^\star \in \Theta.$ $\exists \theta^\star \text{ such that } R(\theta^\star, \hat{\theta}^{(1)}) < R(\theta^\star, \hat{\theta}^{(2)})$ Finally, an estimator is admissible if it’s not dominated by any other estimator. Otherwise, we say it’s inadmissible . Stein phenomenon Consider $p$ Gaussian random variables $X_1, \dots, X_p$, where \[X_i \sim \mathcal{N}(\mu_i, \sigma^2), \;\;\; i = 1, \dots, p\] where $\sigma^2$ is known, and we’d like to estimate each $\mu_i$. Suppose our data consists of one observation of each variable $x_1, \dots, x_p$. With such little information to work with, under the squared error loss, the least squares estimator (maximum likelihood estimator AKA “ordinary” estimator AKA “usual” estimator) would simply estimate each mean as the data point’s value: \[\hat{\mu_i}^{(LS)} = x_i.\] However, Charles Stein discovered interesting and surprising result: The least squares estimator is inadmissible with respect to the squared error loss when $p \geq 3$. In other words, the least squares estimator is dominated by another estimator. Prof. John Carlos Baez summarizes this unintuitive result nicely in this Twitter thread: I have a Gaussian distribution like this in 2d. You know its variance is 1 but don't know its mean. I randomly pick a point (x₁,x₂) according to this distribution and tell you. You try to guess the mean. Your best guess is (x₁,x₂). But this is not true in 3d!!! (1/n) pic.twitter.com/pWPD8sFmZ6 — John Carlos Baez (@johncarlosbaez) August 25, 2020 So, what’s this other estimator? James-Stein estimator The James-Stein estimator (concocted by Charles Stein and Willard James) is \[\hat{\mu}_i^{(JS)} = \left( 1 - \frac{(p - 2) \sigma^2}{\|\|\mathbf{x}\|\|_2^2} \right) x_i.\] where $\mathbf{x}$ is the $p$-vector of observations. Notice that this estimator is essentially multiplying each $x_i$ by a term $\left( 1 - \frac{(p - 2) \sigma^2}{\|\|\mathbf{x}\|\|_2^2} \right)$ that depends on the other variables as well. To start building intuition about what this estimator is doing, consider the case when $p = 3$, and $\sigma^2 = 1$. Then the James-Stein estimator reduces to \[\hat{\mu}_i^{(JS)} = \left( 1 - \frac{1}{\|\|\mathbf{x}\|\|_2^2} \right) x_i.\] Since $\|\|\mathbf{x}\|\|_2^2 \geq 0$, we know that $\left( 1 - \frac{1}{\|\|\mathbf{x}\|\|_2^2} \right) < 1$. If $\|\|\mathbf{x}\|\|_2^2 > 1$, then the estimator shrinks $\mu_i$ toward 0 compared to the least squares estimator. If $\|\|\mathbf{x}\|\|_2^2 < 1$, the estimator adjusts the LS estimator even further, and even performs a sign flip. More generally, if $\|\|\mathbf{x}\|\|_2^2 > (p - 2) \sigma^2$, then the James-Stein estimator shrinks each $\mu_i$ toward zero. In other words, if the overall (L2) magnitude of the data vector $\mathbf{x}$ exceeds the variance (multiplied by $p-2$), the James-Stein estimator “regularizes” the estimates $\mu_i$ by shrinking it toward zero. Another way to think about the James-Stein estimator is as a “shrinkage” estimator. The James-Stein estimator has the effect of nudging each individual $\hat{\mu}_i$ toward the overall average of the data points, $\bar{\mathbf{x}} = \frac1n \sum\limits_{i=1}^n x_i$. Brad Efron and Carl Morris have a nice figure demonstrating this in their paper “Stein’s Paradox in Statistics” . In the figure below, the top row shows the batting averages for 18 baseball players, and the bottom row shows the corresponding James-Stein estimator of each. Notice how the estimates move closer together, thereby sharing information across them. Relationship to empirical Bayes The James-Stein estimator also has strong connections to the empirical Bayes methodology. Under an empirical Bayes framework, instead of completely marginalizing out the prior (as in a fully Bayesian treatment), we estimate the prior from the data. For example, consider again $p$ Gaussian random variables $X_1, \dots, X_p$, where \[X_i \sim \mathcal{N}(\mu_i, \sigma^2), \;\;\; i = 1, \dots, p\] where $\sigma^2$ is known, and we’d like to estimate each $\mu_i$. Now, let’s place a shared normal prior on each $\mu_i$: \[\mu_i \sim \mathcal{N}(0, \tau^2), \;\;\; i = 1, \dots, p.\] We could manually set $\tau^2$ to some value, e.g. $\tau^2 = 1$, or we could place another prior on it and integrate it out. On the other hand, the empirical Bayes approach seeks to estimate $\tau^2$ from the data itself, leveraging information across observations. First, notice that the posterior $p(\mu_i \| X)$ is \[p(\mu_i \| X_i) = \frac{p(X_i \| \mu_i) p(\mu_i)}{\int p(X_i \| \mu_i) p(\mu_i) d\mu_i)}.\] After some arithmetic and extra work to solve the integral (e.g., through completing the square), we see that the posterior is also Gaussian: \[\mu_i \| X_i \sim \mathcal{N}\left(\frac{\tau^2}{\tau^2 + \sigma^2} X_i, \frac{\tau^2}{\tau^2 + \sigma^2} \right).\] So the “Bayes” estimator (if we just take the expectation of the posterior above) is \[\hat{\mu}_i^{(\text{Bayes})} = \frac{\tau^2}{\tau^2 + \sigma^2} X_i.\] Notice that this estimator is effectively shrinking $X_i$ toward zero. Now, what if we want a more principled way to set $\tau^2$ above, as opposed to just setting it to some value manually? One way to do this would be to look for an unbiased estimator $\hat{\alpha}$ of the “shrinkage coefficient” $\frac{\tau^2}{\tau^2 + \sigma^2}$ such that \[\mathbb{E}_{p(X_i)}[\hat{\alpha}] = \frac{\tau^2}{\tau^2 + \sigma^2}.\] Notice that the marginal distribution \[p(X_i) = \int p(X_i \| \mu_i) p(\mu_i) d\mu_i\] is again Gaussian with \[X_i \sim \mathcal{N}(0, \sigma^2 + \tau^2).\] Recall that for a Guassian random vector $\mathbf{z} = (z_1, \dots, z_p)^\top$, its squared L2-norm $\|\|\mathbf{z}\|\|_2^2$ will follow a $\chi^2$ distribution with $p$ degrees of freedom. Furthermore, $1 / \|\|\mathbf{z}\|\|_2^2$ will follow an inverse-$\chi^2$ distribution, again with $p$ degrees of freedom. In the case of our data, we can notice that the vector of data $\mathbf{X} = (X_1, \dots, X_p)^\top$ has a norm $\|\|\mathbf{X}\|\|_2^2$ that follows a scaled $\chi^2$ (scaled by $\sigma^2 + \tau^2$). Consequently, \[\frac{1}{\|\|\mathbf{X}\|\|_2^2} \sim \frac{1}{\tau^2 + \sigma^2} \cdot \text{inverse-}\chi^2_p.\] Notice that the expectation of the James-Stein estimator’s coefficient (let’s call it $\hat{\alpha}^{(JS)}$) is unbiased in this case! \begin{align} \mathbb{E}[\hat{\alpha}^{(JS)}] &= \mathbb{E}\left[\left( 1 - \frac{(p - 2) \sigma^2}{\|\|\mathbf{X}\|\|_2^2} \right)\right] \\ &= 1 - \mathbb{E}\left[\left(\frac{(p - 2) \sigma^2}{\|\|\mathbf{X}\|\|_2^2} \right)\right] \\ &= 1 - (p - 2) \sigma^2 \mathbb{E}\left[\frac{1}{\|\|\mathbf{X}\|\|_2^2} \right] \\ &= 1 - (p - 2) \sigma^2 \frac{1}{(\tau^2 + \sigma^2) (p - 2)} \\ &= 1 - \frac{\sigma^2}{\tau^2 + \sigma^2} \\ &= \frac{\tau^2}{\tau^2 + \sigma^2} \\ \end{align} Thus, although Charles Stein didn’t originally derive the James-Stein estimator this way, we see that it also arises as a particular case of performing empirical Bayes. References Stein, Charles. “Inadmissibility of the usual estimator for the mean of a multivariate distribution.” Proc. Third Berkeley Symp. Math. Statist. Prob. Vol. 1. 1956. Wikipedia entry on admissibility Wikipedia entry on the James-Stein estimator Efron, Bradley, and Carl Morris. “Stein’s paradox in statistics.” Scientific American 236.5 (1977): 119-127. Professor Efron’s notes on the James-Stein estimator . This StackOverflow post Blog post by Austin Rochford on the relationship between empirical Bayes and the James-Stein estimator.
Markdown	### [Andy Jones](https://andrewcharlesjones.github.io/) [Technical blog](https://andrewcharlesjones.github.io/menu/blog.html) [ajones788@gmail.com]() # James-Stein estimator The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables. ## Admissibility Suppose we want to estimate a parameter (or parameter vector) \$\\theta\$ in some statistical model. Broadly, we do this by constructing an “estimator” \$\\theta(x)\$, which is a function of the data \$x\$. Let \$\\theta^\\star\$ denote the true value of \$\\theta\$ (the one that actually corresponds to the data generating process). Given a set of data observations \$x\$, we can assess the quality of the estimator using a loss function \$\\mathcal{L}(\\theta^\\star, \\hat{\\theta})\$, which compares the true \$\\theta^\\star\$ to our estimate. Lower loss values typically correspond a “better” estimator. Example loss functions are the squared (L2) error \$\\mathcal{L}(\\theta^\\star, \\hat{\\theta}) = \\|\\|\\theta - \\theta^\\star\\|\\|\_2^2\$ and the L1 error \$\\mathcal{L}(\\theta^\\star, \\hat{\\theta}) = \\|\\|\\theta - \\theta^\\star\\|\\|\_1\$. If we want to assess the estimator over all possible data (not just one set of observations), we can compute the estimator’s risk, which is the expectation of \$\\mathcal{L}\$ over the data distribution \$p(x \\| \\theta^\\star)\$. Specifically, given a loss function \$\\mathcal{L}\$, the risk function is defined as \\\[R(\\theta^\\star, \\hat{\\theta}) = \\mathbb{E}\_{p(x \\| \\theta^\\star)}\[\\mathcal{L}(\\theta^\\star, \\hat{\\theta}(x))\].\\\] Now, we can use the risk function to compare different estimators. Suppose we have two different estimators \$\\hat{\\theta}^{(1)}\$ and \$\\hat{\\theta}^{(2)}\$. For any true parameter value \$\\theta^\\star\$, we can then compute the risk of each, \$R(\\theta^\\star, \\hat{\\theta}^{(1)})\$ and \$R(\\theta^\\star, \\hat{\\theta}^{(2)})\$, and compare them. Often, one estimator will have a lower risk for some values of \$\\theta^\\star\$ and a higher risk for others. However, we say that an estimator \$\\hat{\\theta}^{(1)}\$ dominates another estimator \$\\hat{\\theta}^{(2)}\$ if \$\\hat{\\theta}^{(1)}\$ doesn’t has a higher risk for any \$\\theta^\\star\$ and has a lower risk for at least one value of \$\\theta^\\star\$. In other words, \$\\hat{\\theta}^{(1)}\$ dominates another estimator \$\\hat{\\theta}^{(2)}\$ if: 1. \$R(\\theta^\\star, \\hat{\\theta}^{(1)}) \\leq R(\\theta^\\star, \\hat{\\theta}^{(2)}) \\;\\;\\; \\forall \\theta^\\star \\in \\Theta.\$ 2. \$\\exists \\theta^\\star \\text{ such that } R(\\theta^\\star, \\hat{\\theta}^{(1)}) \< R(\\theta^\\star, \\hat{\\theta}^{(2)})\$ Finally, an estimator is admissible if it’s not dominated by any other estimator. Otherwise, we say it’s inadmissible. ## Stein phenomenon Consider \$p\$ Gaussian random variables \$X\_1, \\dots, X\_p\$, where \\\[X\_i \\sim \\mathcal{N}(\\mu\_i, \\sigma^2), \\;\\;\\; i = 1, \\dots, p\\\] where \$\\sigma^2\$ is known, and we’d like to estimate each \$\\mu\_i\$. Suppose our data consists of one observation of each variable \$x\_1, \\dots, x\_p\$. With such little information to work with, under the squared error loss, the least squares estimator (maximum likelihood estimator AKA “ordinary” estimator AKA “usual” estimator) would simply estimate each mean as the data point’s value: \\\[\\hat{\\mu\_i}^{(LS)} = x\_i.\\\] However, Charles Stein discovered interesting and surprising result: > The least squares estimator is inadmissible with respect to the squared error loss when \$p \\geq 3\$. In other words, the least squares estimator is dominated by another estimator. Prof. John Carlos Baez summarizes this unintuitive result nicely in this Twitter thread: > I have a Gaussian distribution like this in 2d. You know its variance is 1 but don't know its mean. I randomly pick a point (x₁,x₂) according to this distribution and tell you. You try to guess the mean. > > Your best guess is (x₁,x₂). > > But this is not true in 3d!!\! > > (1/n) [pic.twitter.com/pWPD8sFmZ6](https://t.co/pWPD8sFmZ6) > > — John Carlos Baez (@johncarlosbaez) [August 25, 2020](https://twitter.com/johncarlosbaez/status/1298274201682325509?ref_src=twsrc%5Etfw) So, what’s this other estimator? ## James-Stein estimator The James-Stein estimator (concocted by Charles Stein and Willard James) is \\\[\\hat{\\mu}\_i^{(JS)} = \\left( 1 - \\frac{(p - 2) \\sigma^2}{\\|\\|\\mathbf{x}\\|\\|\_2^2} \\right) x\_i.\\\] where \$\\mathbf{x}\$ is the \$p\$-vector of observations. Notice that this estimator is essentially multiplying each \$x\_i\$ by a term \$\\left( 1 - \\frac{(p - 2) \\sigma^2}{\\|\\|\\mathbf{x}\\|\\|\_2^2} \\right)\$ that depends on the other variables as well. To start building intuition about what this estimator is doing, consider the case when \$p = 3\$, and \$\\sigma^2 = 1\$. Then the James-Stein estimator reduces to \\\[\\hat{\\mu}\_i^{(JS)} = \\left( 1 - \\frac{1}{\\|\\|\\mathbf{x}\\|\\|\_2^2} \\right) x\_i.\\\] Since \$\\|\\|\\mathbf{x}\\|\\|\_2^2 \\geq 0\$, we know that \$\\left( 1 - \\frac{1}{\\|\\|\\mathbf{x}\\|\\|\_2^2} \\right) \< 1\$. If \$\\|\\|\\mathbf{x}\\|\\|\_2^2 \> 1\$, then the estimator shrinks \$\\mu\_i\$ toward 0 compared to the least squares estimator. If \$\\|\\|\\mathbf{x}\\|\\|\_2^2 \< 1\$, the estimator adjusts the LS estimator even further, and even performs a sign flip. More generally, if \$\\|\\|\\mathbf{x}\\|\\|\_2^2 \> (p - 2) \\sigma^2\$, then the James-Stein estimator shrinks each \$\\mu\_i\$ toward zero. In other words, if the overall (L2) magnitude of the data vector \$\\mathbf{x}\$ exceeds the variance (multiplied by \$p-2\$), the James-Stein estimator “regularizes” the estimates \$\\mu\_i\$ by shrinking it toward zero. Another way to think about the James-Stein estimator is as a “shrinkage” estimator. The James-Stein estimator has the effect of nudging each individual \$\\hat{\\mu}\_i\$ toward the overall average of the data points, \$\\bar{\\mathbf{x}} = \\frac1n \\sum\\limits\_{i=1}^n x\_i\$. Brad Efron and Carl Morris have a nice figure demonstrating this in their paper [“Stein’s Paradox in Statistics”](https://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf). In the figure below, the top row shows the batting averages for 18 baseball players, and the bottom row shows the corresponding James-Stein estimator of each. Notice how the estimates move closer together, thereby sharing information across them. ![Stein baseball](https://andrewcharlesjones.github.io/assets/stein_baseball.png) ## Relationship to empirical Bayes The James-Stein estimator also has strong connections to the empirical Bayes methodology. Under an empirical Bayes framework, instead of completely marginalizing out the prior (as in a fully Bayesian treatment), we estimate the prior from the data. For example, consider again \$p\$ Gaussian random variables \$X\_1, \\dots, X\_p\$, where \\\[X\_i \\sim \\mathcal{N}(\\mu\_i, \\sigma^2), \\;\\;\\; i = 1, \\dots, p\\\] where \$\\sigma^2\$ is known, and we’d like to estimate each \$\\mu\_i\$. Now, let’s place a shared normal prior on each \$\\mu\_i\$: \\\[\\mu\_i \\sim \\mathcal{N}(0, \\tau^2), \\;\\;\\; i = 1, \\dots, p.\\\] We could manually set \$\\tau^2\$ to some value, e.g. \$\\tau^2 = 1\$, or we could place another prior on it and integrate it out. On the other hand, the empirical Bayes approach seeks to estimate \$\\tau^2\$ from the data itself, leveraging information across observations. First, notice that the posterior \$p(\\mu\_i \\| X)\$ is \\\[p(\\mu\_i \\| X\_i) = \\frac{p(X\_i \\| \\mu\_i) p(\\mu\_i)}{\\int p(X\_i \\| \\mu\_i) p(\\mu\_i) d\\mu\_i)}.\\\] After some arithmetic and extra work to solve the integral (e.g., through completing the square), we see that the posterior is also Gaussian: \\\[\\mu\_i \\| X\_i \\sim \\mathcal{N}\\left(\\frac{\\tau^2}{\\tau^2 + \\sigma^2} X\_i, \\frac{\\tau^2}{\\tau^2 + \\sigma^2} \\right).\\\] So the “Bayes” estimator (if we just take the expectation of the posterior above) is \\\[\\hat{\\mu}\_i^{(\\text{Bayes})} = \\frac{\\tau^2}{\\tau^2 + \\sigma^2} X\_i.\\\] Notice that this estimator is effectively shrinking \$X\_i\$ toward zero. Now, what if we want a more principled way to set \$\\tau^2\$ above, as opposed to just setting it to some value manually? One way to do this would be to look for an unbiased estimator \$\\hat{\\alpha}\$ of the “shrinkage coefficient” \$\\frac{\\tau^2}{\\tau^2 + \\sigma^2}\$ such that \\\[\\mathbb{E}\_{p(X\_i)}\[\\hat{\\alpha}\] = \\frac{\\tau^2}{\\tau^2 + \\sigma^2}.\\\] Notice that the marginal distribution \\\[p(X\_i) = \\int p(X\_i \\| \\mu\_i) p(\\mu\_i) d\\mu\_i\\\] is again Gaussian with \\\[X\_i \\sim \\mathcal{N}(0, \\sigma^2 + \\tau^2).\\\] Recall that for a Guassian random vector \$\\mathbf{z} = (z\_1, \\dots, z\_p)^\\top\$, its squared L2-norm \$\\|\\|\\mathbf{z}\\|\\|\_2^2\$ will follow a \$\\chi^2\$ distribution with \$p\$ degrees of freedom. Furthermore, \$1 / \\|\\|\\mathbf{z}\\|\\|\_2^2\$ will follow an inverse-\$\\chi^2\$ distribution, again with \$p\$ degrees of freedom. In the case of our data, we can notice that the vector of data \$\\mathbf{X} = (X\_1, \\dots, X\_p)^\\top\$ has a norm \$\\|\\|\\mathbf{X}\\|\\|\_2^2\$ that follows a scaled \$\\chi^2\$ (scaled by \$\\sigma^2 + \\tau^2\$). Consequently, \\\[\\frac{1}{\\|\\|\\mathbf{X}\\|\\|\_2^2} \\sim \\frac{1}{\\tau^2 + \\sigma^2} \\cdot \\text{inverse-}\\chi^2\_p.\\\] Notice that the expectation of the James-Stein estimator’s coefficient (let’s call it \$\\hat{\\alpha}^{(JS)}\$) is unbiased in this case\! \\begin{align} \\mathbb{E}\[\\hat{\\alpha}^{(JS)}\] &= \\mathbb{E}\\left\[\\left( 1 - \\frac{(p - 2) \\sigma^2}{\\|\\|\\mathbf{X}\\|\\|\_2^2} \\right)\\right\] \\\\ &= 1 - \\mathbb{E}\\left\[\\left(\\frac{(p - 2) \\sigma^2}{\\|\\|\\mathbf{X}\\|\\|\_2^2} \\right)\\right\] \\\\ &= 1 - (p - 2) \\sigma^2 \\mathbb{E}\\left\[\\frac{1}{\\|\\|\\mathbf{X}\\|\\|\_2^2} \\right\] \\\\ &= 1 - (p - 2) \\sigma^2 \\frac{1}{(\\tau^2 + \\sigma^2) (p - 2)} \\\\ &= 1 - \\frac{\\sigma^2}{\\tau^2 + \\sigma^2} \\\\ &= \\frac{\\tau^2}{\\tau^2 + \\sigma^2} \\\\ \\end{align} Thus, although Charles Stein didn’t originally derive the James-Stein estimator this way, we see that it also arises as a particular case of performing empirical Bayes. ## References - Stein, Charles. “Inadmissibility of the usual estimator for the mean of a multivariate distribution.” Proc. Third Berkeley Symp. Math. Statist. Prob. Vol. 1. 1956. - Wikipedia entry on [admissibility](https://www.wikiwand.com/en/Admissible_decision_rule) - Wikipedia entry on the [James-Stein estimator](https://www.wikiwand.com/en/James%E2%80%93Stein_estimator) - Efron, Bradley, and Carl Morris. “Stein’s paradox in statistics.” Scientific American 236.5 (1977): 119-127. - Professor Efron’s [notes on the James-Stein estimator](https://statweb.stanford.edu/~ckirby/brad/LSI/chapter1.pdf). - This [StackOverflow post](https://stats.stackexchange.com/questions/304308/why-is-the-james-stein-estimator-called-a-shrinkage-estimator) - [Blog post](https://austinrochford.com/posts/2013-11-30-steins-paradox-and-empirical-bayes.html) by Austin Rochford on the relationship between empirical Bayes and the James-Stein estimator.
Readable Markdown	null
ML Classification
ML Categories	null
ML Page Types	null
ML Intent Types	null
Content Metadata
Language	null
Author	Andy Jones
Publish Time	2020-09-05 00:00:00 (5 years ago)
Original Publish Time	2020-09-05 00:00:00 (5 years ago)
Republished	No
Word Count (Total)	1,382
Word Count (Content)	1,368
Links
External Links	10
Internal Links	3
Technical SEO
Meta Nofollow	No
Meta Noarchive	No
JS Rendered	No
Redirect Target	null
Performance
Download Time (ms)	55
TTFB (ms)	55
Download Size (bytes)	6,118
Shard	143 (laksa)
Root Hash	2566890010099092343
Unparsed URL	io,github!andrewcharlesjones,/journal/james-stein-estimator.html s443