🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 7 (from laksa021)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa007.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/\')), getAhrefsUnparsedNoserviceFromURL(\'https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/","crawl_time":1777277610,"first_indexed_time":1723514015,"http_code":200,"src_unparsed":"blog,econometrics!www,\/post\/not-quite-the-james-stein-estimator\/ s443","src_root_hash":"17011614825276657407","history_drop_reason":null,"meta_title":"Not Quite the James-Stein Estimator – econometrics.blog","meta_descriptions":[],"attrs_boilerpipe_text":"If you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the\n“James-Stein Estimator”\n. You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the\nbest linear unbiased estimator\n(BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even\nmuch better\n–than OLS.\n1\nStein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by\nCharles Stein\nin the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity.\nThe supposed\nparadox\nis most simply stated by considering a special case of linear regression–that of estimating multiple unknown means.\nEfron & Morris (1977)\nintroduce the basic idea as follows:\nA baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be.\nI first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my\nEcon 722\ncourse at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see\nlecture 1\nor\nsection 7.3\n–is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an\ninverse-chi-squared random variable\n. It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the\nJames-Stein Estimator\nis flagged as “may be too technical for readers to understand” at the time of this writing!\nAfter six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course in probability and statistics. I’ll show how we can arrive at something that is\nvery nearly\nthe James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on to the end and that you’ll find the payoff worth your time and effort.\nAs far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider\nreading something else instead\n, here are a few references that you may find helpful.\nEfron & Morris (1977)\nis a classic article aimed at the general reader without a background in statistics.\nStigler (1988)\nis a more technical but still accessible discussion of the topic while\nCasella (1985)\nis a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is\nIjiri & Leitch (1980)\n, who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below.\nWarm-up Exercise\nThis section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of\nbias\n,\nvariance\nand\nmean-squared error\nalong with introducing a very simple\nshrinkage estimator\n. To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe\nX\n∼\nNormal\n(\nμ\n,\n1\n)\n, a single draw from a normal distribution with variance one and unknown mean\nμ\n. Your task is to estimate\nμ\n. This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of\nX\nis one! But in fact there’s nothing special about\nn\n=\n1\nand a variance of one: these merely make the notation simpler. If you prefer, you can think of\nX\nas the sample mean of\nn\niid draws from a population with unknown mean\nμ\nwhere we’ve\nrescaled\neverything to have variance one. So how should we estimate\nμ\n? A natural and reasonable idea is to use the sample mean, in this case\nX\nitself. This is in fact the\nmaximum likelihood estimator\nfor\nμ\n, so I’ll define\nμ\n^\nML\n=\nX\n. But is this estimator any good? And can we find something better?\nReview of Bias, Variance and MSE\nThe concepts of\nbias\nand\nvariance\nare key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory,\nbias\nis the difference between an estimators expected value and the true value of the parameter being estimated while\nvariance\nis the expected squared difference between an estimator and its expected value. So if\nθ\n^\nis an estimator of some unknown parameter\nθ\n, then\nBias\n(\nθ\n^\n)\n=\nE\n[\nθ\n^\n]\n−\nθ\nwhile\nVar\n(\nθ\n^\n)\n=\nE\n[\n(\nθ\n^\n−\nE\n[\nθ\n^\n]\n)\n2\n]\n. A bias of zero means that an estimator is\ncorrectly centered\n: its expectation equals the truth. We say that such an estimator is\nunbiased\n.\n2\nA small variance means that an estimator is\nprecise\n: it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a\ntrade-off\nbetween bias and variance: if you want to reduce one of them, you have to accept an increase in the other.\nA common way of trading off bias and variance relies on a concept called\nmean-squared error\n(MSE) defined as the\nsum\nof the squared bias and the variance.\n3\nIn particular:\nMSE\n(\nθ\n^\n)\n=\nVar\n(\nθ\n^\n)\n+\nBias\n(\nθ\n^\n)\n2\n. Equivalently, we can write\nMSE\n(\nθ\n^\n)\n=\nE\n[\n(\nθ\n^\n−\nθ\n)\n2\n]\n.\n4\nTo borrow some terminology from introductory microeconomics, you can think of MSE as the\nnegative\nof a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our\npreferences\nin terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram:\nA Shrinkage Estimator\nReturning to our maximum likelihood estimator: it’s unbiased,\nBias\n(\nμ\n^\nML\n)\n=\n0\n, so\nMSE\n(\nμ\n^\nML\n)\n=\nVar\n(\nμ\n^\nML\n)\n=\n1\n. Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be\nyes\n. Here’s the idea. Suppose we had some reason to believe that the true mean\nμ\nisn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by\nshrinking\nslightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero:\nμ\n^\n(\nλ\n)\n=\n(\n1\n−\nλ\n)\n×\nμ\n^\nML\n+\nλ\n×\n0\n=\n(\n1\n−\nλ\n)\nX\nfor\n0\n≤\nλ\n≤\n1\n. The constant\n(\n1\n−\nλ\n)\nis called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero.\n5\nWe get a different estimator for every value of\nλ\n. If\nλ\n=\n0\nthen we get the ML estimator back. If\nλ\n=\n1\nthen we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of\nλ\n. Substituting the definition of\nμ\n^\n(\nλ\n)\ninto the formulas for bias and variance gives:\nBias\n[\nμ\n^\n(\nλ\n)\n]\n=\nE\n[\n(\n1\n−\nλ\n)\nμ\n^\nML\n]\n−\nμ\n=\n(\n1\n−\nλ\n)\nE\n[\nμ\n^\nML\n]\n−\nμ\n=\n(\n1\n−\nλ\n)\nμ\n−\nμ\n=\n−\nλ\nμ\nVar\n[\nμ\n^\n(\nλ\n)\n]\n=\nVar\n[\n(\n1\n−\nλ\n)\nμ\n^\nML\n]\n=\n(\n1\n−\nλ\n)\n2\nVar\n[\nμ\n^\nML\n]\n=\n(\n1\n−\nλ\n)\n2\nMSE\n[\nμ\n^\n(\nλ\n)\n]\n=\nVar\n[\nμ\n^\n(\nλ\n)\n]\n+\nBias\n[\nμ\n^\n(\nλ\n)\n]\n2\n=\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nμ\n2\nUnless\nλ\n=\n0\n, the shrinkage estimator is\nbiased\n. And while the MSE of the ML estimator is always one, regardless of the true value of\nμ\n, the MSE of the shrinkage estimator\ndepends on the unknown parameter\nμ\n.\nSo why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a\nlarger\nreduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator\ncan indeed\nhave a lower MSE than the ML estimator depending on the value of\nλ\nand the true value of\nμ\n:\n# Range of values for the unknown parameter mu\nmu\n<-\nseq\n(\n-\n4\n,\n4\n,\nlength =\n100\n)\n# Try three different values of lambda\nlambda1\n<-\n0.1\nlambda2\n<-\n0.2\nlambda3\n<-\n0.3\n# Plot the MSE of the shrinkage estimator as a function of mu for all\n# three values of lambda at once\nmatplot\n(mu,\ncbind\n((\n1\n-\nlambda1)\n^\n2\n+\nlambda1\n^\n2\n*\nmu\n^\n2\n,\n(\n1\n-\nlambda2)\n^\n2\n+\nlambda2\n^\n2\n*\nmu\n^\n2\n,\n(\n1\n-\nlambda3)\n^\n2\n+\nlambda3\n^\n2\n*\nmu\n^\n2\n),\ntype =\n'l'\n,\nlty =\n1\n,\nlwd =\n2\n,\ncol =\nc\n(\n'red'\n,\n'blue'\n,\n'green'\n),\nxlab =\nexpression\n(mu),\nylab =\n'MSE'\n,\nmain =\n'MSE of Shrinkage Estimator'\n)\n# Add legend\nlegend\n(\n'topright'\n,\nlegend =\nc\n(\nexpression\n(lambda\n==\n0.1\n),\nexpression\n(lambda\n==\n0.2\n),\nexpression\n(lambda\n==\n0.3\n)),\ncol =\nc\n(\n'red'\n,\n'blue'\n,\n'green'\n),\nlty =\n1\n,\nlwd =\n2\n)\n# Add dashed line for MSE of ML estimator\nabline\n(\nh =\n1\n,\nlty =\n2\n,\nlwd =\n2\n)\nSome Algebra\nIt’s time for some algebra. If you’re tempted to skip this\nplease don’t\n: this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural.\nAs seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of\nλ\nisn’t too large relative to the true value of\nμ\n. With a bit of algebra, we can work out\nprecisely\nhow large\nλ\ncan be to make shrinkage worthwhile. Since\nMSE\n[\nμ\n^\nML\n]\n=\n1\n, by expanding and simplifying the expression for\nMSE\n[\nμ\n^\n(\nλ\n)\n]\nwe see that\nMSE\n[\nμ\n^\n(\nλ\n)\n]\n<\nMSE\n[\nμ\n^\nML\n]\nif and only if\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nμ\n2\n<\n1\n1\n−\n2\nλ\n+\nλ\n2\n+\nλ\n2\nμ\n2\n<\n1\nλ\n2\n(\n1\n+\nμ\n2\n)\n−\n2\nλ\n<\n0\nλ\n[\nλ\n(\n1\n+\nμ\n2\n)\n−\n2\n]\n<\n0.\nSince\nλ\n≥\n0\n, the final inequality can only hold if the factor inside the square brackets is negative, i.e. \nλ\n(\n1\n+\nμ\n2\n)\n−\n2\n<\n0\nλ\n<\n2\n1\n+\nμ\n2\n.\nThis shows that any choice of\nλ\nbetween\n0\nand\n2\n\/\n(\n1\n+\nμ\n2\n)\nwill give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for\nμ\nto obtain the boundary of the region where shrinkage is better than ML:\nλ\n(\n1\n+\nμ\n2\n)\n−\n2\n=\n0\n1\n+\nμ\n2\n=\n2\n\/\nλ\nμ\n=\n±\n2\n\/\nλ\n−\n1\n.\nAdding these boundaries to a simplified version of our previous plot with only\nλ\n=\n0.3\nwe see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator.\n# Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3\nlambda\n<-\n0.3\nplot\n(mu, (\n1\n-\nlambda)\n^\n2\n+\nlambda\n^\n2\n*\nmu\n^\n2\n,\ntype =\n'l'\n,\nlty =\n1\n,\nlwd =\n2\n,\ncol =\n'blue'\n,\nxlab =\nexpression\n(mu),\nylab =\n'MSE'\n,\nmain =\n'Boundary of Region Where Shrinkage is Better than ML'\n)\n# Add dashed line for MSE of ML estimator\nabline\n(\nh =\n1\n,\nlty =\n2\n,\nlwd =\n2\n)\n# Add boundaries of region where shrinkage is better than ML estimator\nabline\n(\nv =\nc\n(\nsqrt\n(\n2\n\/\nlambda\n-\n1\n),\n-\nsqrt\n(\n2\n\/\nlambda\n-\n1\n)),\nlty =\n3\n,\nlwd =\n2\n,\ncol =\n'red'\n)\nBut there’s still more to learn! Suppose we wanted to take things\none step further\nand find the\noptimal\nvalue of\nλ\nfor any given value of\nμ\n. In other words, suppose we wanted the value of\nλ\nthat\nminimizes\nthe MSE of our shrinkage estimator given a particular assumed value for\nμ\n. Since\nMSE\n[\nμ\n^\n(\nλ\n)\n]\nis a quadratic function of\nλ\n, as shown above, this turns out to be a fairly straightforward calculation. Differentiating,\nd\nd\nλ\nMSE\n[\nμ\n^\n(\nλ\n)\n]\n=\nd\nd\nλ\n[\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nμ\n2\n]\n=\n−\n2\n(\n1\n−\nλ\n)\n+\n2\nλ\nμ\n2\n=\n2\n[\nλ\n(\n1\n+\nμ\n2\n)\n−\n1\n]\nd\n2\nd\nλ\n2\nMSE\n[\nμ\n^\n(\nλ\n)\n]\n=\n2\n(\n1\n+\nμ\n2\n)\n>\n0\nso there is a unique global minimum at\nλ\n∗\n≡\n1\n\/\n(\n1\n+\nμ\n2\n)\n. This gives the\noptimal\nshrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting\nλ\n∗\ninto the expression for\nMSE\n[\nμ\n^\n(\nλ\n)\n]\ngives:\nMSE\n[\nμ\n^\n(\nλ\n∗\n)\n]\n=\n(\n1\n−\n1\n1\n+\nμ\n2\n)\n2\n+\n(\n1\n1\n+\nμ\n2\n)\n2\nμ\n2\n=\n(\nμ\n2\n1\n+\nμ\n2\n)\n2\n+\n(\n1\n1\n+\nμ\n2\n)\n2\nμ\n2\n=\n(\n1\n1\n+\nμ\n2\n)\n2\n(\nμ\n4\n+\nμ\n2\n)\n=\n(\n1\n1\n+\nμ\n2\n)\n2\nμ\n2\n(\n1\n+\nμ\n2\n)\n=\nμ\n2\n1\n+\nμ\n2\n<\n1.\nStein’s Paradox\nRecap\nWe’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that\nλ\nis chosen judiciously: it needs to be between zero and\n2\n\/\n(\n1\n+\nμ\n2\n)\n. The optimal choice of\nλ\n, namely\nλ\n∗\n=\n1\n\/\n(\n1\n+\nμ\n2\n)\n, gives an MSE of\nμ\n2\n\/\n(\n1\n+\nμ\n2\n)\n. This is always lower than one, the MSE of the ML estimator.\nThere’s just one massive problem we’ve ignored this whole time:\nwe don’t know the value of\nμ\n! As seen from the figure plotted above, the MSE curves for different values of\nλ\ncross each other\n: the best one to use depends on the true value of\nμ\n. This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of\nμ\nthat could help guide our choice of\nλ\n. What it does mean is that there’s no “one-size-fits-all” value.\nAdmissibility\nIt’s time to introduce a bit of technical vocabulary. We say that an estimator\nθ\n~\ndominates\nanother estimator\nθ\n^\nif\nMSE\n[\nθ\n~\n]\n≤\nMSE\n[\nθ\n^\n]\nfor\nall\npossible values of the parameter\nθ\nbeing estimated and\nMSE\n[\nθ\n~\n]\n<\nMSE\n[\nθ\n^\n]\nfor at least\none\npossible value of\nθ\n.\n6\nIn words, this means that it never makes sense to use\nθ\n^\nin preference to\nθ\n~\n. No matter what the true parameter value is, you can’t do worse with\nθ\n~\nand you might do better. An estimator that is\nnot dominated\nby any other estimator is called\nadmissible\n; an estimator that\nis dominated\nby some other estimator is called\ninadmissible\n. The concept of\nadmissibility\nin decision theory is a bit like the concept of\nPareto efficiency\nin microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off.\nIt’s quite challenging to prove, but in fact the ML estimator\nθ\n^\nM\nL\n=\nX\nturns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large\nμ\nis likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch!\nA More General Example\nNow let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw\nX\nfrom a\nNormal\n(\nμ\n,\n1\n)\ndistribution but a\ncollection\nof\np\nindependent draws from\np\ndifferent\nnormal distributions:\nX\n1\n,\nX\n2\n,\n.\n.\n.\n,\nX\np\n∼\nindependent Normal\n(\nμ\nj\n,\n1\n)\n,\nj\n=\n1\n,\n.\n.\n.\n,\np\n.\nYou can think of this as\np\ncopies of our original problem: we observe\nX\nj\n∼\nNormal\n(\nμ\nj\n,\n1\n)\nand our task is to estimate\nμ\nj\n. The observations are all independent, and each comes from a distribution with a potentially\ndifferent mean\n. At first glance it seems like these\np\nseparate problems should have\nabsolutely nothing to do with each other\n. And indeed the maximum likelihood estimator for the collection of\np\nmeans is simply\nμ\n^\nML\n(\nj\n)\n=\nX\nj\n. As above in our example with\np\n=\n1\n, the question is: how good is the ML estimator, and can we do any better?\nComposite MSE\nBut first things first: how can we evaluate the quality of\np\nestimators for\np\ndifferent parameters\nat the same time\n? A common approach, and the one we will follow here, is to take the\nsum\nof the individual MSEs of each estimator, yielding a quantity called\ncomposite MSE\n. If\nμ\n^\n1\n,\nμ\n^\n2\n,\n…\n,\nμ\n^\np\nis a collection of estimators for each of the individual unknown means, then the composite MSE is defined as\nComposite MSE\n≡\n∑\nj\n=\n1\np\nMSE\n(\nμ\n^\nj\n)\n=\n∑\nj\n=\n1\np\n[\nBias\n(\nμ\n^\nj\n)\n2\n+\nVar\n(\nμ\n^\nj\n)\n]\n=\n∑\nj\n=\n1\np\nE\n[\n(\nμ\n^\nj\n−\nμ\nj\n)\n2\n]\n.\nAdopting composite MSE as our measure of\ngood\nperformance means that we view each of the\np\nestimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating\nμ\nj\nin exchange for doing a much better job estimating\nμ\nk\n. At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to\nminimize the composite MSE\n. The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does.\nStein’s Paradox\nPutting our new idea into practice, we see that the composite MSE of the ML estimator is\np\nregardless of the true values of the individual means\nμ\n1\n,\n…\n,\nμ\np\nsince\n∑\nj\n=\n1\np\nMSE\n[\nμ\n^\nML\n(\nj\n)\n]\n=\n∑\nj\n=\n1\np\nMSE\n(\nX\nj\n)\n=\n∑\nj\n=\n1\np\nVar\n(\nX\nj\n)\n=\np\n.\nIf the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to\np\nand sometimes has an MSE strictly less than\np\n. I’ve already told you that this is true when\np\n=\n1\n. When\np\n=\n2\nit’s still true: the ML estimator remains admissible. But when\np\n≥\n3\nsomething very unexpected happens: it becomes possible to construct an estimator that\ndominates\nthe ML estimator by using information from\nall\nof the\n(\nX\n1\n,\n.\n.\n.\n,\nX\np\n)\nobservations to estimate\nμ\nj\n. This is spite of the fact that there is\nno obvious connection\nbetween the observations. Again: they are all independent and come from distributions with different means!\nThe estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to\nμ\n^\nJS\n(\nj\n)\n=\n(\n1\n−\np\n−\n2\n∑\nk\n=\n1\np\nX\nk\n2\n)\nX\nj\n.\nThis estimator dominates the ML estimator when\np\n≥\n3\nin that\n∑\nj\n=\n1\np\nMSE\n[\nμ\n^\nJS\n(\nj\n)\n]\n≤\n∑\nj\n=\n1\np\nMSE\n[\nμ\n^\nML\n(\nj\n)\n]\n=\np\nfor\nall\npossible values of the\np\nunknown means\nμ\nj\nwith strict inequality for at least\nsome\nvalues. Taking a closer look at the formula, we see that the James-Stein estimator is just a\nshrinkage\nestimator applied to each of the\np\nmeans, namely\nμ\n^\nJS\n(\nj\n)\n=\n(\n1\n−\nλ\n^\nJS\n)\nX\nj\n,\nλ\n^\nJS\n≡\np\n−\n2\n∑\nk\n=\n1\np\nX\nk\n2\n.\nThe shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating,\np\n, along with the\noverall\nsum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero\noverall\n, the less we shrink\neach of them\ntowards zero.\nJust like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the\ndata\nto determine the shrinkage factor. And as long as\np\n≥\n3\nit is always\nat least as good\nas the ML estimator and sometimes\nmuch better\n. The\nparadox\nis that this seems impossible: how can information from\nall\nof the observations be useful when they come from\ndifferent\ndistributions with no obvious connection?\nThe rest of this post will\nnot\nprove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some\nvery good intuition\nfor why the formula for the James-Stein estimator takes the form it does. By the end, I hope you’ll feel that, far from seeming paradoxical, using\nall\nof the observations to determine the shrinkage factor for one particular\nμ\nj\nmakes perfect sense.\nWhere does the James-Stein Estimator Come From?\nAn Infeasible Estimator When\np\n=\n2\nTo start the ball rolling, let’s\nassume a can-opener\n: suppose that we don’t know any of the\nindividual\nmeans\nμ\nj\nbut for some strange reason a benevolent deity has told us the value of their sum of squares:\nc\n2\n≡\n∑\nj\n=\n1\np\nμ\nj\n2\n.\nIt turns out that this is enough information to construct a shrinkage estimator that\nalways\nhas a lower composite MSE than the ML estimator. Let’s see why this is the case. If\np\n=\n1\n, then telling you\nc\n2\nis the same as telling you\nμ\n2\n. Granted, knowledge of\nμ\n2\nisn’t as informative as knowledge of\nμ\n. For example, if I told you that\nμ\n2\n=\n9\nyou couldn’t tell whether\nμ\n=\n3\nor\nμ\n=\n−\n3\n. But, as we showed above, the optimal shrinkage estimator when\np\n=\n1\nsets\nλ\n∗\n=\n1\n\/\n(\n1\n+\nμ\n2\n)\nand yields an MSE of\nμ\n2\n\/\n(\n1\n+\nμ\n2\n)\n<\n1\n. Since\nλ\n∗\nonly depends on\nμ\nthrough\nμ\n2\n, we’ve\nalready shown\nthat knowledge of\nc\n2\nallows us to construct a shrinkage estimator that dominates the ML estimator when\np\n=\n1\n.\nSo what if\np\nequals 2? In this case, knowledge of\nc\n2\n=\nμ\n1\n2\n+\nμ\n2\n2\nis equivalent to knowing the\nradius\nof a circle centered at the origin in the\n(\nμ\n1\n,\nμ\n2\n)\nplane where the two unknown means must lie. For example, if I told you that\nc\n2\n=\n1\nyou would know that\n(\nμ\n1\n,\nμ\n2\n)\nlies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points\n(\nx\n1\n,\nx\n2\n)\nand\n(\ny\n1\n,\ny\n2\n)\nwould then be potential values of\n(\nμ\n1\n,\nμ\n2\n)\nas would all other points on the blue circle.\nSo how can we construct a shrinkage estimator of\n(\nμ\n1\n,\nμ\n2\n)\nwith lower composite MSE than the ML estimator if\nc\n2\nis known? While there are other possibilities, the simplest would be to use the\nsame\nshrinkage factor for each of the two coordinates. In other words, our estimator would be\nμ\n^\n1\n(\nλ\n)\n=\n(\n1\n−\nλ\n)\nX\n1\n,\nμ\n^\n2\n(\nλ\n)\n=\n(\n1\n−\nλ\n)\nX\n2\nfor some\nλ\nbetween zero and one. The composite MSE of this estimator is just the sum of the MSE of each\nindividual\ncomponent, so we can re-use our algebra from above to obtain\nMSE\n[\nμ\n^\n1\n(\nλ\n)\n]\n+\nMSE\n[\nμ\n^\n2\n(\nλ\n)\n]\n=\n[\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nμ\n1\n2\n]\n+\n[\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nμ\n2\n2\n]\n=\n2\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\n(\nμ\n1\n2\n+\nμ\n2\n2\n)\n=\n2\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nc\n2\n.\nNotice that the composite MSE only depends on\n(\nμ\n1\n,\nμ\n2\n)\nthrough their sum of squares,\nc\n2\n. Differentiating with respect to\nλ\n, just as we did above in the\np\n=\n1\ncase,\nd\nd\nλ\n[\n2\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nc\n2\n]\n=\n−\n4\n(\n1\n−\nλ\n)\n+\n2\nλ\nc\n2\n=\n2\n[\nλ\n(\n2\n+\nc\n2\n)\n−\n2\n]\nd\n2\nd\nλ\n2\n[\n2\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nc\n2\n]\n=\n2\n(\n2\n+\nc\n2\n)\n>\n0\nso there is a unique global minimum at\nλ\n∗\n=\n2\n\/\n(\n2\n+\nc\n2\n)\n. Substituting this value of\nλ\ninto the expression for the composite MSE, a few lines of algebra give\nMSE\n[\nμ\n^\n1\n(\nλ\n∗\n)\n]\n+\nMSE\n[\nμ\n^\n2\n(\nλ\n∗\n)\n]\n=\n2\n(\n1\n−\n2\n2\n+\nc\n2\n)\n2\n+\n(\n2\n2\n+\nc\n2\n)\n2\nc\n2\n=\n2\n(\nc\n2\n2\n+\nc\n2\n)\n.\nSince\nc\n2\n\/\n(\n2\n+\nc\n2\n)\n<\n1\nfor all\nc\n2\n>\n0\n, the optimal shrinkage estimator\nalways\nhas a composite MSE less than\n2\n, the composite MSE of the ML estimator. Strictly speaking this estimator is\ninfeasible\nsince we don’t know\nc\n2\n. But it’s a crucial step on our journey to make the leap from applying shrinkage to an estimator for a\nsingle\nunknown mean, to using the same idea for\nmore than one\nunknown mean.\nA Simulation Experiment for\np\n=\n2\nYou may have already noticed that it’s easy to generalize this argument to\np\n>\n2\n. But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for\np\n=\n2\na bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when\np\n=\n2\n. I’ve set the true, unknown, values of\nμ\n1\nand\nμ\n2\nto one so the true value of\nc\n2\nis\n2\nand the optimal choice of\nλ\nis\nλ\n∗\n=\n2\n\/\n(\n2\n+\nc\n2\n)\n=\n2\n\/\n4\n=\n0.5\n. The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action.\nset.seed\n(\n1983\n)\nnreps\n<-\n50\nmu1\n<-\nmu2\n<-\n1\nx1\n<-\nmu1\n+\nrnorm\n(nreps)\nx2\n<-\nmu2\n+\nrnorm\n(nreps)\ncsq\n<-\nmu1\n^\n2\n+\nmu2\n^\n2\nlambda\n<-\ncsq\n\/\n(\n2\n+\ncsq)\npar\n(\nmfrow =\nc\n(\n1\n,\n2\n))\n# Left panel: ML Estimator\nplot\n(x1, x2,\nmain =\n'MLE'\n,\npch =\n20\n,\ncol =\n'black'\n,\ncex =\n2\n,\nxlab =\nexpression\n(mu[\n1\n]),\nylab =\nexpression\n(mu[\n2\n]))\nabline\n(\nv =\nmu1,\nlty =\n1\n,\ncol =\n'red'\n,\nlwd =\n2\n)\nabline\n(\nh =\nmu2,\nlty =\n1\n,\ncol =\n'red'\n,\nlwd =\n2\n)\n# Add MSE to the plot\ntext\n(\nx =\n2\n,\ny =\n3\n,\nlabels =\npaste\n(\n\"MSE =\"\n,\nround\n(\nmean\n((x1\n-\nmu1)\n^\n2\n+\n(x2\n-\nmu2)\n^\n2\n),\n2\n)))\n# Right panel: Shrinkage Estimator\nplot\n(x1, x2,\nmain =\n'Shrinkage'\n,\nxlab =\nexpression\n(mu[\n1\n]),\nylab =\nexpression\n(mu[\n2\n]))\npoints\n(lambda\n*\nx1, lambda\n*\nx2,\npch =\n20\n,\ncol =\n'blue'\n,\ncex =\n2\n)\nsegments\n(\nx0 =\nx1,\ny0 =\nx2,\nx1 =\nlambda\n*\nx1,\ny1 =\nlambda\n*\nx2,\nlty =\n2\n)\nabline\n(\nv =\nmu1,\nlty =\n1\n,\ncol =\n'red'\n,\nlwd =\n2\n)\nabline\n(\nh =\nmu2,\nlty =\n1\n,\ncol =\n'red'\n,\nlwd =\n2\n)\nabline\n(\nv =\n0\n,\nlty =\n1\n,\nlwd =\n2\n)\nabline\n(\nh =\n0\n,\nlty =\n1\n,\nlwd =\n2\n)\n# Add MSE to the plot\ntext\n(\nx =\n2\n,\ny =\n3\n,\nlabels =\npaste\n(\n\"MSE =\"\n,\nround\n(\nmean\n((lambda\n*\nx1\n-\nmu1)\n^\n2\n+\n(lambda\n*\nx2\n-\nmu2)\n^\n2\n),\n2\n)))\nMy plot has two panels. The left panel shows the raw data. Each black point is a pair\n(\nX\n1\n,\nX\n2\n)\nof independent normal draws with means\n(\nμ\n1\n=\n1\n,\nμ\n2\n=\n1\n)\nand variances\n(\n1\n,\n1\n)\n. As such, each point is also the\nML estimate\n(MLE) of\n(\nμ\n1\n,\nμ\n2\n)\nbased on\n(\nX\n1\n,\nX\n2\n)\n. The red cross shows the location of the true values of\n(\nμ\n1\n,\nμ\n2\n)\n, namely\n(\n1\n,\n1\n)\n. There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of\n(\nμ\n1\n,\nμ\n2\n)\nin repeated sampling, approximating the composite MSE.\nThe right panel is more complicated. This shows\nboth\nthe ML estimates (unfilled black circles)\nand\nthe corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of\nλ\n=\n0.5\n. Thus, if a given unfilled black circle is located at\n(\nX\n1\n,\nX\n2\n)\n, the corresponding filled blue circle is located at\n(\n0.5\nX\n1\n,\n0.5\nX\n2\n)\n. As in the left panel, the red cross in the right panel shows the true values of\n(\nμ\n1\n,\nμ\n2\n)\n, namely\n(\n1\n,\n1\n)\n. The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely\n(\n0\n,\n0\n)\n.\nWe see immediately that the ML estimator is\nunbiased\n: the black filled dots in the left panel (along with the unfilled ones in the right) are centered at\n(\n1\n,\n1\n)\n. But the ML estimator is also\nhigh-variance\n: the black dots are quite spread out around\n(\n1\n,\n1\n)\n. We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross.\n7\nAnd in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator.\nIn contrast, the optimal shrinkage estimator is\nbiased\n: the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly\nthey are on average closer to\n(\nμ\n1\n,\nμ\n2\n)\n, as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals\n2\nc\n2\n\/\n(\n2\n+\nc\n2\n)\n. When\nc\n2\n=\n2\n, as in this case, we obtain\n2\n×\n2\n\/\n(\n2\n+\n2\n)\n=\n1\n. Again, this is almost exactly what we see in the simulation.\nIf we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage\npulls\nthe MLE towards the origin, and can give a\nmuch\nlower composite MSE.\nAn Infeasible Estimator: The General Case\nNow that we understand the case of\np\n=\n2\n, the general case is a snap. Our shrinkage estimator of each\nμ\nj\nwill take the form\nμ\n^\nj\n(\nλ\n)\n=\n(\n1\n−\nλ\n)\nX\nj\n,\nj\n=\n1\n,\n…\n,\np\nfor some\nλ\nbetween zero and one. To find the optimal choice of\nλ\n, we minimize\n∑\nj\n=\n1\np\nMSE\n[\nμ\n^\nj\n(\nλ\n)\n]\n=\n∑\nj\n=\n1\np\n[\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nμ\nj\n2\n]\n=\np\n(\n1\n−\nλ\n)\n2\n+\nλ\n2\nc\n2\nwith respect to\nλ\n. Again, the key is that the composite MSE only depends on the unknown means through\nc\n2\n. Using almost exactly the same calculations as above for the case of\np\n=\n2\n, we find that\nλ\n∗\n=\np\np\n+\nc\n2\n,\n∑\nj\n=\n1\np\nMSE\n[\nμ\n^\nj\n(\nλ\n∗\n)\n]\n=\np\n(\nc\n2\np\n+\nc\n2\n)\n.\nsince\nc\n2\n\/\n(\np\n+\nc\n2\n)\n<\n1\nfor all\nc\n2\n>\n0\n, the optimal shrinkage estimator\nalways\nhas a composite MSE less than\np\n, the composite MSE of the ML estimator.\nNot Quite the James-Stein Estimator\nThe end is in sight! We’ve shown that if we knew the sum of squares of the unknown means,\nc\n2\n, we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know\nc\n2\n. So what can we do? To start off, re-write\nλ\n∗\nas follows\nλ\n∗\n=\np\np\n+\nc\n2\n=\n1\n1\n+\nc\n2\n\/\np\n.\nThis way of writing things makes it clear that it’s not\nc\n2\nper se\nthat matters but rather\nc\n2\n\/\np\n. And this quantity is simply is the\naverage\nof the unknown squared means:\nc\n2\np\n=\n1\np\n∑\nj\n=\n1\np\nμ\nj\n2\n.\nSo how could we learn\nc\n2\n\/\np\n? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved\nμ\nj\nwith the corresponding observation\nX\nj\n, in other words\n1\np\n∑\nj\n=\n1\np\nX\nj\n2\n.\nThis is a good starting point, but we can do better. Since\nX\nj\n∼\nNormal\n(\nμ\nj\n,\n1\n)\n, we see that\nE\n[\n1\np\n∑\nj\n=\n1\np\nX\nj\n2\n]\n=\n1\np\n∑\nj\n=\n1\np\nE\n[\nX\nj\n2\n]\n=\n1\np\n∑\nj\n=\n1\np\n[\nVar\n(\nX\nj\n)\n+\nE\n(\nX\nj\n)\n2\n]\n=\n1\np\n∑\nj\n=\n1\np\n(\n1\n+\nμ\nj\n2\n)\n=\n1\n+\nc\n2\np\n.\nThis means that\n(\n∑\nj\n=\n1\np\nX\nj\n2\n)\n\/\np\nwill on average\noverestimate\nc\n2\n\/\np\nby one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is\nno bias-variance tradeoff\n. Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for\nλ\n∗\n, this suggests using the estimator\nλ\n^\n≡\n1\n1\n+\n[\n(\n1\np\n∑\nj\n=\n1\np\nX\nj\n2\n)\n−\n1\n]\n=\n1\n1\np\n∑\nj\n=\n1\np\nX\nj\n2\n=\np\n∑\nj\n=\n1\np\nX\nj\n2\nas our stand-in for the unknown\nλ\n∗\n, yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment:\nμ\n^\nNQ\n(\nj\n)\n=\n(\n1\n−\np\n∑\nk\n=\n1\np\nX\nk\n2\n)\nX\nj\n.\nNotice what’s happening here: our optimal shrinkage estimator depends on\nc\n2\n\/\np\n, something we can’t observe. But we’ve constructed an\nunbiased estimator\nof this quantity by using\nall of the observations\nX\nj\n. This is the resolution of the paradox discussed above: all of the observations contain information about\nc\n2\nsince this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual\nμ\nj\nparameters through\nc\n2\n! This is the sense in which it’s possible to learn something useful about, say,\nμ\n1\nfrom\nX\n2\nin spite of the fact that\nE\n[\nX\n2\n]\n=\nμ\n2\nmay bear no relationship to\nμ\n1\n.\nBut wait a minute! This looks\nsuspiciously familiar\n. Recall that the James-Stein estimator is given by\nμ\n^\nJS\n(\nj\n)\n=\n(\n1\n−\np\n−\n2\n∑\nk\n=\n1\np\nX\nk\n2\n)\nX\nj\n.\nJust like the JS estimator, my NQ estimator shrinks each of the\np\nmeans towards zero by a factor that depends on the number of means we’re estimating,\np\n, and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses\np\n−\n2\nin the numerator instead of\np\n. This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the\nform\nthat it does, I would argue that the difference is minor. If you want all the gory details of where that extra\n−\n2\ncomes from, along with the closely related issue of why\np\n≥\n3\nis crucial for JS to dominate the ML estimator, see\nlecture 1\nor\nsection 7.3\nfrom my Econ 722 teaching materials.\nConclusion\nBefore we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t\nquite\nJS, and that JS only dominates the MLE when\np\n≥\n3\n, there’s one more fundamental issue that could be easily missed. Our decision to minimize\ncomposite\nMSE is\nabsolutely crucial\nto the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context.\nIf we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because\nEuclidean distance\nis obviously what we’re after here. But if instead we’re estimating\nteacher value-added\nand the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our\nvalues\nin a particular problem.\nBut with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is\nnearly identical\nto the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew\nc\n2\nand showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity:\nc\n2\n\/\np\n. Because all the observations contain information about\nc\n2\n, it makes sense that we should decide how much to shrink one component\nX\nj\nby using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically\nobvious\n, excepting of course that pesky\n−\n2\nin the numerator.\nFootnotes\nIf I ruled the universe, the Gauss-Markov Theorem would be demoted to much less exalted status in econometrics teaching!\n↩︎\nDon’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative!\n↩︎\nWhy squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead.\n↩︎\nSee if you can prove this as a homework exercise!\n↩︎\nIn Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of\nμ\nconditional on our data\nX\nunder a normal prior. In this case\nλ\nwould equal\nτ\n\/\n(\n1\n+\nτ\n)\nwhere\nτ\nis the\nprior precision\n, i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective.\n↩︎\nStrictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See\nlecture 1\nof my Econ 722 slides for more detail.\n↩︎\nRemember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand.\n↩︎","attrs_markdown":"[econometrics.blog](https:\/\/www.econometrics.blog\/)\n\n- [Home](https:\/\/www.econometrics.blog\/)\n- [About](https:\/\/www.econometrics.blog\/about\/)\n\n## On this page\n- [Warm-up Exercise](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#warm-up-exercise)\n  - [Review of Bias, Variance and MSE](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#review-of-bias-variance-and-mse)\n  - [A Shrinkage Estimator](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#a-shrinkage-estimator)\n  - [Some Algebra](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#some-algebra)\n- [Stein’s Paradox](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#steins-paradox)\n  - [Recap](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#recap)\n  - [Admissibility](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#admissibility)\n  - [A More General Example](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#a-more-general-example)\n  - [Composite MSE](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#composite-mse)\n  - [Stein’s Paradox](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#steins-paradox-1)\n- [Where does the James-Stein Estimator Come From?](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#where-does-the-james-stein-estimator-come-from)\n  - [An Infeasible Estimator When p = 2](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#an-infeasible-estimator-when-p-2)\n  - [A Simulation Experiment for p = 2](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#a-simulation-experiment-for-p-2)\n  - [An Infeasible Estimator: The General Case](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#an-infeasible-estimator-the-general-case)\n  - [Not Quite the James-Stein Estimator](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#not-quite-the-james-stein-estimator)\n- [Conclusion](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#conclusion)\n\n# Not Quite the James-Stein Estimator\n[statistics](https:\/\/www.econometrics.blog\/index.html#category=statistics)\n\nAuthor\n\nFrancis J. DiTraglia\n\nPublished\n\nAugust 10, 2024\n\nIf you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the [“James-Stein Estimator”](https:\/\/en.wikipedia.org\/wiki\/James%E2%80%93Stein_estimator). You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the [best linear unbiased estimator](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem) (BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even *much better*–than OLS.[1](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn1) Stein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by [Charles Stein](https:\/\/en.wikipedia.org\/wiki\/Charles_M._Stein) in the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity.\n\nThe supposed [paradox](https:\/\/youtu.be\/XXhJKzI1u48?si=cS--uLd09_JnAXdr) is most simply stated by considering a special case of linear regression–that of estimating multiple unknown means. [Efron & Morris (1977)](https:\/\/www.jstor.org\/stable\/24954030) introduce the basic idea as follows:\n\n> A baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be.\n\nI first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my [Econ 722](https:\/\/ditraglia.com\/econ722) course at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see [lecture 1](https:\/\/ditraglia.com\/econ722\/slides\/econ722slides.pdf) or [section 7.3](https:\/\/ditraglia.com\/econ722\/main.pdf)–is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an [inverse-chi-squared random variable](https:\/\/en.wikipedia.org\/wiki\/Inverse-chi-squared_distribution). It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the [James-Stein Estimator](https:\/\/en.wikipedia.org\/wiki\/James%E2%80%93Stein_estimator) is flagged as “may be too technical for readers to understand” at the time of this writing\\!\n\nAfter six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course in probability and statistics. I’ll show how we can arrive at something that is *very nearly* the James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on to the end and that you’ll find the payoff worth your time and effort.\n\nAs far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider [reading something else instead](https:\/\/www.econometrics.blog\/post\/how-to-read-an-econometrics-paper\/), here are a few references that you may find helpful. [Efron & Morris (1977)](https:\/\/www.jstor.org\/stable\/24954030) is a classic article aimed at the general reader without a background in statistics. [Stigler (1988)](https:\/\/projecteuclid.org\/journals\/statistical-science\/volume-5\/issue-1\/The-1988-Neyman-Memorial-Lecture--A-Galtonian-Perspective-on\/10.1214\/ss\/1177012274.full) is a more technical but still accessible discussion of the topic while [Casella (1985)](https:\/\/www.jstor.org\/stable\/2682801) is a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is [Ijiri & Leitch (1980)](https:\/\/www.jstor.org\/stable\/2490394), who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below.\n\n## Warm-up Exercise\nThis section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of **bias**, **variance** and **mean-squared error** along with introducing a very simple **shrinkage estimator**. To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe X ∼ Normal ( μ , 1 ), a single draw from a normal distribution with variance one and unknown mean μ. Your task is to estimate μ. This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of X is one! But in fact there’s nothing special about n \\= 1 and a variance of one: these merely make the notation simpler. If you prefer, you can think of X as the sample mean of n iid draws from a population with unknown mean μ where we’ve *rescaled* everything to have variance one. So how should we estimate μ? A natural and reasonable idea is to use the sample mean, in this case X itself. This is in fact the [maximum likelihood estimator](https:\/\/en.wikipedia.org\/wiki\/Maximum_likelihood_estimation) for μ, so I’ll define μ ^ ML \\= X. But is this estimator any good? And can we find something better?\n\n### Review of Bias, Variance and MSE\nThe concepts of *bias* and *variance* are key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory, *bias* is the difference between an estimators expected value and the true value of the parameter being estimated while *variance* is the expected squared difference between an estimator and its expected value. So if θ ^ is an estimator of some unknown parameter θ, then Bias ( θ ^ ) \\= E \\[ θ ^ \\] − θ while Var ( θ ^ ) \\= E \\[ ( θ ^ − E \\[ θ ^ \\] ) 2 \\]. A bias of zero means that an estimator is *correctly centered*: its expectation equals the truth. We say that such an estimator is *unbiased*.[2](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn2) A small variance means that an estimator is *precise*: it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a *trade-off* between bias and variance: if you want to reduce one of them, you have to accept an increase in the other.\n\nA common way of trading off bias and variance relies on a concept called *mean-squared error* (MSE) defined as the *sum* of the squared bias and the variance.[3](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn3) In particular: MSE ( θ ^ ) \\= Var ( θ ^ ) \\+ Bias ( θ ^ ) 2. Equivalently, we can write MSE ( θ ^ ) \\= E \\[ ( θ ^ − θ ) 2 \\].[4](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn4) To borrow some terminology from introductory microeconomics, you can think of MSE as the *negative* of a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our *preferences* in terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram:\n\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-1-1.png)\n\n### A Shrinkage Estimator\nReturning to our maximum likelihood estimator: it’s unbiased, Bias ( μ ^ ML ) \\= 0, so MSE ( μ ^ ML ) \\= Var ( μ ^ ML ) \\= 1. Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be *yes*. Here’s the idea. Suppose we had some reason to believe that the true mean μ isn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by *shrinking* slightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero: μ ^ ( λ ) \\= ( 1 − λ ) × μ ^ ML \\+ λ × 0 \\= ( 1 − λ ) X for 0 ≤ λ ≤ 1. The constant ( 1 − λ ) is called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero.[5](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn5) We get a different estimator for every value of λ. If λ \\= 0 then we get the ML estimator back. If λ \\= 1 then we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of λ. Substituting the definition of μ ^ ( λ ) into the formulas for bias and variance gives: Bias \\[ μ ^ ( λ ) \\] \\= E \\[ ( 1 − λ ) μ ^ ML \\] − μ \\= ( 1 − λ ) E \\[ μ ^ ML \\] − μ \\= ( 1 − λ ) μ − μ \\= − λ μ Var \\[ μ ^ ( λ ) \\] \\= Var \\[ ( 1 − λ ) μ ^ ML \\] \\= ( 1 − λ ) 2 Var \\[ μ ^ ML \\] \\= ( 1 − λ ) 2 MSE \\[ μ ^ ( λ ) \\] \\= Var \\[ μ ^ ( λ ) \\] \\+ Bias \\[ μ ^ ( λ ) \\] 2 \\= ( 1 − λ ) 2 \\+ λ 2 μ 2 Unless λ \\= 0, the shrinkage estimator is *biased*. And while the MSE of the ML estimator is always one, regardless of the true value of μ, the MSE of the shrinkage estimator *depends on the unknown parameter* μ.\n\nSo why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a *larger* reduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator *can indeed* have a lower MSE than the ML estimator depending on the value of λ and the true value of μ:\n\n```\n# Range of values for the unknown parameter mu\nmu <- seq(-4, 4, length = 100)\n# Try three different values of lambda\nlambda1 <- 0.1\nlambda2 <- 0.2\nlambda3 <- 0.3\n# Plot the MSE of the shrinkage estimator as a function of mu for all \n# three values of lambda at once\nmatplot(mu, cbind((1 - lambda1)^2 + lambda1^2 * mu^2, \n                  (1 - lambda2)^2 + lambda2^2 * mu^2, \n                  (1 - lambda3)^2 + lambda3^2 * mu^2), \n        type = 'l', lty = 1, lwd = 2, \n        col = c('red', 'blue', 'green'), \n        xlab = expression(mu), ylab = 'MSE', \n        main = 'MSE of Shrinkage Estimator')\n# Add legend\nlegend('topright', legend = c(expression(lambda == 0.1), \n                              expression(lambda == 0.2), \n                              expression(lambda == 0.3)), \n       col = c('red', 'blue', 'green'), lty = 1, lwd = 2)\n# Add dashed line for MSE of ML estimator\nabline(h = 1, lty = 2, lwd = 2)\n```\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-2-1.png)\n\n### Some Algebra\nIt’s time for some algebra. If you’re tempted to skip this *please don’t*: this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural.\n\nAs seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of λ isn’t too large relative to the true value of μ. With a bit of algebra, we can work out *precisely* how large λ can be to make shrinkage worthwhile. Since MSE \\[ μ ^ ML \\] \\= 1, by expanding and simplifying the expression for MSE \\[ μ ^ ( λ ) \\] we see that MSE \\[ μ ^ ( λ ) \\] \\< MSE \\[ μ ^ ML \\] if and only if ( 1 − λ ) 2 \\+ λ 2 μ 2 \\< 1 1 − 2 λ \\+ λ 2 \\+ λ 2 μ 2 \\< 1 λ 2 ( 1 \\+ μ 2 ) − 2 λ \\< 0 λ \\[ λ ( 1 \\+ μ 2 ) − 2 \\] \\< 0\\. Since λ ≥ 0, the final inequality can only hold if the factor inside the square brackets is negative, i.e. λ ( 1 \\+ μ 2 ) − 2 \\< 0 λ \\< 2 1 \\+ μ 2 . This shows that any choice of λ between 0 and 2 \/ ( 1 \\+ μ 2 ) will give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for μ to obtain the boundary of the region where shrinkage is better than ML: λ ( 1 \\+ μ 2 ) − 2 \\= 0 1 \\+ μ 2 \\= 2 \/ λ μ \\= ± 2 \/ λ − 1 . Adding these boundaries to a simplified version of our previous plot with only λ \\= 0\\.3 we see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator.\n\n```\n# Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3\nlambda <- 0.3\nplot(mu, (1 - lambda)^2 + lambda^2 * mu^2, type = 'l', lty = 1, lwd = 2, \n     col = 'blue', xlab = expression(mu), ylab = 'MSE', \n     main = 'Boundary of Region Where Shrinkage is Better than ML')\n# Add dashed line for MSE of ML estimator\nabline(h = 1, lty = 2, lwd = 2)\n# Add boundaries of region where shrinkage is better than ML estimator\nabline(v = c(sqrt(2\/lambda - 1), -sqrt(2\/lambda - 1)), lty = 3, lwd = 2,\n       col = 'red')\n```\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-3-1.png)\n\nBut there’s still more to learn! Suppose we wanted to take things *one step further* and find the *optimal* value of λ for any given value of μ. In other words, suppose we wanted the value of λ that *minimizes* the MSE of our shrinkage estimator given a particular assumed value for μ. Since MSE \\[ μ ^ ( λ ) \\] is a quadratic function of λ, as shown above, this turns out to be a fairly straightforward calculation. Differentiating, d d λ MSE \\[ μ ^ ( λ ) \\] \\= d d λ \\[ ( 1 − λ ) 2 \\+ λ 2 μ 2 \\] \\= − 2 ( 1 − λ ) \\+ 2 λ μ 2 \\= 2 \\[ λ ( 1 \\+ μ 2 ) − 1 \\] d 2 d λ 2 MSE \\[ μ ^ ( λ ) \\] \\= 2 ( 1 \\+ μ 2 ) \\> 0 so there is a unique global minimum at λ ∗ ≡ 1 \/ ( 1 \\+ μ 2 ). This gives the *optimal* shrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting λ ∗ into the expression for MSE \\[ μ ^ ( λ ) \\] gives: MSE \\[ μ ^ ( λ ∗ ) \\] \\= ( 1 − 1 1 \\+ μ 2 ) 2 \\+ ( 1 1 \\+ μ 2 ) 2 μ 2 \\= ( μ 2 1 \\+ μ 2 ) 2 \\+ ( 1 1 \\+ μ 2 ) 2 μ 2 \\= ( 1 1 \\+ μ 2 ) 2 ( μ 4 \\+ μ 2 ) \\= ( 1 1 \\+ μ 2 ) 2 μ 2 ( 1 \\+ μ 2 ) \\= μ 2 1 \\+ μ 2 \\< 1\\.\n\n## Stein’s Paradox\n### Recap\nWe’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that λ is chosen judiciously: it needs to be between zero and 2 \/ ( 1 \\+ μ 2 ). The optimal choice of λ, namely λ ∗ \\= 1 \/ ( 1 \\+ μ 2 ), gives an MSE of μ 2 \/ ( 1 \\+ μ 2 ). This is always lower than one, the MSE of the ML estimator.\n\nThere’s just one massive problem we’ve ignored this whole time: **we don’t know the value of** μ! As seen from the figure plotted above, the MSE curves for different values of λ *cross each other*: the best one to use depends on the true value of μ. This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of μ that could help guide our choice of λ. What it does mean is that there’s no “one-size-fits-all” value.\n\n### Admissibility\nIt’s time to introduce a bit of technical vocabulary. We say that an estimator θ ~ **dominates** another estimator θ ^ if MSE \\[ θ ~ \\] ≤ MSE \\[ θ ^ \\] for *all* possible values of the parameter θ being estimated and MSE \\[ θ ~ \\] \\< MSE \\[ θ ^ \\] for at least *one* possible value of θ.[6](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn6) In words, this means that it never makes sense to use θ ^ in preference to θ ~. No matter what the true parameter value is, you can’t do worse with θ ~ and you might do better. An estimator that is *not dominated* by any other estimator is called **admissible**; an estimator that *is dominated* by some other estimator is called **inadmissible**. The concept of *admissibility* in decision theory is a bit like the concept of [Pareto efficiency](https:\/\/en.wikipedia.org\/wiki\/Pareto_efficiency) in microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off.\n\nIt’s quite challenging to prove, but in fact the ML estimator θ ^ M L \\= X turns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large μ is likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch\\!\n\n### A More General Example\nNow let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw X from a Normal ( μ , 1 ) distribution but a *collection* of p independent draws from p *different* normal distributions: X 1 , X 2 , . . . , X p ∼ independent Normal ( μ j , 1 ) , j \\= 1 , . . . , p . You can think of this as p copies of our original problem: we observe X j ∼ Normal ( μ j , 1 ) and our task is to estimate μ j. The observations are all independent, and each comes from a distribution with a potentially **different mean**. At first glance it seems like these p separate problems should have *absolutely nothing to do with each other*. And indeed the maximum likelihood estimator for the collection of p means is simply μ ^ ML ( j ) \\= X j. As above in our example with p \\= 1, the question is: how good is the ML estimator, and can we do any better?\n\n### Composite MSE\nBut first things first: how can we evaluate the quality of p estimators for p different parameters *at the same time*? A common approach, and the one we will follow here, is to take the *sum* of the individual MSEs of each estimator, yielding a quantity called **composite MSE**. If μ ^ 1 , μ ^ 2 , … , μ ^ p is a collection of estimators for each of the individual unknown means, then the composite MSE is defined as Composite MSE ≡ ∑ j \\= 1 p MSE ( μ ^ j ) \\= ∑ j \\= 1 p \\[ Bias ( μ ^ j ) 2 \\+ Var ( μ ^ j ) \\] \\= ∑ j \\= 1 p E \\[ ( μ ^ j − μ j ) 2 \\] . Adopting composite MSE as our measure of *good* performance means that we view each of the p estimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating μ j in exchange for doing a much better job estimating μ k. At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to **minimize the composite MSE**. The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does.\n\n### Stein’s Paradox\nPutting our new idea into practice, we see that the composite MSE of the ML estimator is p regardless of the true values of the individual means μ 1 , … , μ p since ∑ j \\= 1 p MSE \\[ μ ^ ML ( j ) \\] \\= ∑ j \\= 1 p MSE ( X j ) \\= ∑ j \\= 1 p Var ( X j ) \\= p . If the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to p and sometimes has an MSE strictly less than p. I’ve already told you that this is true when p \\= 1. When p \\= 2 it’s still true: the ML estimator remains admissible. But when p ≥ 3 something very unexpected happens: it becomes possible to construct an estimator that **dominates** the ML estimator by using information from *all* of the ( X 1 , . . . , X p ) observations to estimate μ j. This is spite of the fact that there is *no obvious connection* between the observations. Again: they are all independent and come from distributions with different means\\!\n\nThe estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to μ ^ JS ( j ) \\= ( 1 − p − 2 ∑ k \\= 1 p X k 2 ) X j . This estimator dominates the ML estimator when p ≥ 3 in that  \n ∑ j \\= 1 p MSE \\[ μ ^ JS ( j ) \\] ≤ ∑ j \\= 1 p MSE \\[ μ ^ ML ( j ) \\] \\= p for *all* possible values of the p unknown means μ j with strict inequality for at least *some* values. Taking a closer look at the formula, we see that the James-Stein estimator is just a *shrinkage* estimator applied to each of the p means, namely μ ^ JS ( j ) \\= ( 1 − λ ^ JS ) X j , λ ^ JS ≡ p − 2 ∑ k \\= 1 p X k 2 . The shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating, p, along with the *overall* sum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero *overall*, the less we shrink *each of them* towards zero.\n\nJust like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the *data* to determine the shrinkage factor. And as long as p ≥ 3 it is always *at least as good* as the ML estimator and sometimes *much better*. The **paradox** is that this seems impossible: how can information from *all* of the observations be useful when they come from *different* distributions with no obvious connection?\n\nThe rest of this post will *not* prove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some *very good intuition* for why the formula for the James-Stein estimator takes the form it does. By the end, I hope you’ll feel that, far from seeming paradoxical, using *all* of the observations to determine the shrinkage factor for one particular μ j makes perfect sense.\n\n## Where does the James-Stein Estimator Come From?\n### An Infeasible Estimator When p \\= 2\nTo start the ball rolling, let’s [assume a can-opener](https:\/\/en.wikipedia.org\/wiki\/Assume_a_can_opener): suppose that we don’t know any of the *individual* means μ j but for some strange reason a benevolent deity has told us the value of their sum of squares: c 2 ≡ ∑ j \\= 1 p μ j 2 . It turns out that this is enough information to construct a shrinkage estimator that *always* has a lower composite MSE than the ML estimator. Let’s see why this is the case. If p \\= 1, then telling you c 2 is the same as telling you μ 2. Granted, knowledge of μ 2 isn’t as informative as knowledge of μ. For example, if I told you that μ 2 \\= 9 you couldn’t tell whether μ \\= 3 or μ \\= − 3. But, as we showed above, the optimal shrinkage estimator when p \\= 1 sets λ ∗ \\= 1 \/ ( 1 \\+ μ 2 ) and yields an MSE of μ 2 \/ ( 1 \\+ μ 2 ) \\< 1. Since λ ∗ only depends on μ through μ 2, we’ve *already shown* that knowledge of c 2 allows us to construct a shrinkage estimator that dominates the ML estimator when p \\= 1.\n\nSo what if p equals 2? In this case, knowledge of c 2 \\= μ 1 2 \\+ μ 2 2 is equivalent to knowing the *radius* of a circle centered at the origin in the ( μ 1 , μ 2 ) plane where the two unknown means must lie. For example, if I told you that c 2 \\= 1 you would know that ( μ 1 , μ 2 ) lies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points ( x 1 , x 2 ) and ( y 1 , y 2 ) would then be potential values of ( μ 1 , μ 2 ) as would all other points on the blue circle.\n\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-4-1.png)\n\nSo how can we construct a shrinkage estimator of ( μ 1 , μ 2 ) with lower composite MSE than the ML estimator if c 2 is known? While there are other possibilities, the simplest would be to use the *same* shrinkage factor for each of the two coordinates. In other words, our estimator would be μ ^ 1 ( λ ) \\= ( 1 − λ ) X 1 , μ ^ 2 ( λ ) \\= ( 1 − λ ) X 2 for some λ between zero and one. The composite MSE of this estimator is just the sum of the MSE of each *individual* component, so we can re-use our algebra from above to obtain MSE \\[ μ ^ 1 ( λ ) \\] \\+ MSE \\[ μ ^ 2 ( λ ) \\] \\= \\[ ( 1 − λ ) 2 \\+ λ 2 μ 1 2 \\] \\+ \\[ ( 1 − λ ) 2 \\+ λ 2 μ 2 2 \\] \\= 2 ( 1 − λ ) 2 \\+ λ 2 ( μ 1 2 \\+ μ 2 2 ) \\= 2 ( 1 − λ ) 2 \\+ λ 2 c 2 . Notice that the composite MSE only depends on ( μ 1 , μ 2 ) through their sum of squares, c 2. Differentiating with respect to λ, just as we did above in the p \\= 1 case, d d λ \\[ 2 ( 1 − λ ) 2 \\+ λ 2 c 2 \\] \\= − 4 ( 1 − λ ) \\+ 2 λ c 2 \\= 2 \\[ λ ( 2 \\+ c 2 ) − 2 \\] d 2 d λ 2 \\[ 2 ( 1 − λ ) 2 \\+ λ 2 c 2 \\] \\= 2 ( 2 \\+ c 2 ) \\> 0 so there is a unique global minimum at λ ∗ \\= 2 \/ ( 2 \\+ c 2 ). Substituting this value of λ into the expression for the composite MSE, a few lines of algebra give MSE \\[ μ ^ 1 ( λ ∗ ) \\] \\+ MSE \\[ μ ^ 2 ( λ ∗ ) \\] \\= 2 ( 1 − 2 2 \\+ c 2 ) 2 \\+ ( 2 2 \\+ c 2 ) 2 c 2 \\= 2 ( c 2 2 \\+ c 2 ) . Since c 2 \/ ( 2 \\+ c 2 ) \\< 1 for all c 2 \\> 0, the optimal shrinkage estimator *always* has a composite MSE less than 2, the composite MSE of the ML estimator. Strictly speaking this estimator is **infeasible** since we don’t know c 2. But it’s a crucial step on our journey to make the leap from applying shrinkage to an estimator for a *single* unknown mean, to using the same idea for *more than one* unknown mean.\n\n### A Simulation Experiment for p \\= 2\nYou may have already noticed that it’s easy to generalize this argument to p \\> 2. But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for p \\= 2 a bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when p \\= 2. I’ve set the true, unknown, values of μ 1 and μ 2 to one so the true value of c 2 is 2 and the optimal choice of λ is λ ∗ \\= 2 \/ ( 2 \\+ c 2 ) \\= 2 \/ 4 \\= 0\\.5. The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action.\n\n```\nset.seed(1983)\n\nnreps <- 50\nmu1 <- mu2 <- 1\nx1 <- mu1 + rnorm(nreps)\nx2 <- mu2 + rnorm(nreps)\n\ncsq <- mu1^2 + mu2^2\nlambda <- csq \/ (2 + csq)\n\npar(mfrow = c(1, 2))\n\n# Left panel: ML Estimator\nplot(x1, x2, main = 'MLE', pch = 20, col = 'black', cex = 2, \n     xlab = expression(mu[1]), ylab = expression(mu[2]))\nabline(v = mu1, lty = 1, col = 'red', lwd = 2)\nabline(h = mu2, lty = 1, col = 'red', lwd = 2)\n\n# Add MSE to the plot\ntext(x = 2, y = 3, labels = paste(\"MSE =\", \n                                  round(mean((x1 - mu1)^2 + (x2 - mu2)^2), 2)))\n\n# Right panel: Shrinkage Estimator\nplot(x1, x2, main = 'Shrinkage', xlab = expression(mu[1]), \n     ylab = expression(mu[2]))\npoints(lambda * x1, lambda * x2, pch = 20, col = 'blue', cex = 2)\nsegments(x0 = x1, y0 = x2, x1 = lambda * x1, y1 = lambda * x2, lty = 2)\nabline(v = mu1, lty = 1, col = 'red', lwd = 2)\nabline(h = mu2, lty = 1, col = 'red', lwd = 2)\nabline(v = 0, lty = 1, lwd = 2)\nabline(h = 0, lty = 1, lwd = 2)\n\n# Add MSE to the plot\ntext(x = 2, y = 3, labels = paste(\"MSE =\", \n                                  round(mean((lambda * x1 - mu1)^2 + \n                                               (lambda * x2 - mu2)^2), 2)))\n```\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-5-1.png)\n\nMy plot has two panels. The left panel shows the raw data. Each black point is a pair ( X 1 , X 2 ) of independent normal draws with means ( μ 1 \\= 1 , μ 2 \\= 1 ) and variances ( 1 , 1 ). As such, each point is also the *ML estimate* (MLE) of ( μ 1 , μ 2 ) based on ( X 1 , X 2 ). The red cross shows the location of the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of ( μ 1 , μ 2 ) in repeated sampling, approximating the composite MSE.\n\nThe right panel is more complicated. This shows *both* the ML estimates (unfilled black circles) *and* the corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of λ \\= 0\\.5. Thus, if a given unfilled black circle is located at ( X 1 , X 2 ), the corresponding filled blue circle is located at ( 0\\.5 X 1 , 0\\.5 X 2 ). As in the left panel, the red cross in the right panel shows the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely ( 0 , 0 ).\n\nWe see immediately that the ML estimator is *unbiased*: the black filled dots in the left panel (along with the unfilled ones in the right) are centered at ( 1 , 1 ). But the ML estimator is also *high-variance*: the black dots are quite spread out around ( 1 , 1 ). We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross.[7](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn7) And in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator.\n\nIn contrast, the optimal shrinkage estimator is *biased*: the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly *they are on average closer to* ( μ 1 , μ 2 ), as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals 2 c 2 \/ ( 2 \\+ c 2 ). When c 2 \\= 2, as in this case, we obtain 2 × 2 \/ ( 2 \\+ 2 ) \\= 1. Again, this is almost exactly what we see in the simulation.\n\nIf we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage *pulls* the MLE towards the origin, and can give a *much* lower composite MSE.\n\n### An Infeasible Estimator: The General Case\nNow that we understand the case of p \\= 2, the general case is a snap. Our shrinkage estimator of each μ j will take the form μ ^ j ( λ ) \\= ( 1 − λ ) X j , j \\= 1 , … , p for some λ between zero and one. To find the optimal choice of λ, we minimize ∑ j \\= 1 p MSE \\[ μ ^ j ( λ ) \\] \\= ∑ j \\= 1 p \\[ ( 1 − λ ) 2 \\+ λ 2 μ j 2 \\] \\= p ( 1 − λ ) 2 \\+ λ 2 c 2 with respect to λ. Again, the key is that the composite MSE only depends on the unknown means through c 2. Using almost exactly the same calculations as above for the case of p \\= 2, we find that λ ∗ \\= p p \\+ c 2 , ∑ j \\= 1 p MSE \\[ μ ^ j ( λ ∗ ) \\] \\= p ( c 2 p \\+ c 2 ) . since c 2 \/ ( p \\+ c 2 ) \\< 1 for all c 2 \\> 0, the optimal shrinkage estimator *always* has a composite MSE less than p, the composite MSE of the ML estimator.\n\n### Not Quite the James-Stein Estimator\nThe end is in sight! We’ve shown that if we knew the sum of squares of the unknown means, c 2, we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know c 2. So what can we do? To start off, re-write λ ∗ as follows λ ∗ \\= p p \\+ c 2 \\= 1 1 \\+ c 2 \/ p . This way of writing things makes it clear that it’s not c 2 *per se* that matters but rather c 2 \/ p. And this quantity is simply is the *average* of the unknown squared means: c 2 p \\= 1 p ∑ j \\= 1 p μ j 2 . So how could we learn c 2 \/ p? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved μ j with the corresponding observation X j, in other words 1 p ∑ j \\= 1 p X j 2 . This is a good starting point, but we can do better. Since X j ∼ Normal ( μ j , 1 ), we see that E \\[ 1 p ∑ j \\= 1 p X j 2 \\] \\= 1 p ∑ j \\= 1 p E \\[ X j 2 \\] \\= 1 p ∑ j \\= 1 p \\[ Var ( X j ) \\+ E ( X j ) 2 \\] \\= 1 p ∑ j \\= 1 p ( 1 \\+ μ j 2 ) \\= 1 \\+ c 2 p . This means that ( ∑ j \\= 1 p X j 2 ) \/ p will on average *overestimate* c 2 \/ p by one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is *no bias-variance tradeoff*. Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for λ ∗, this suggests using the estimator λ ^ ≡ 1 1 \\+ \\[ ( 1 p ∑ j \\= 1 p X j 2 ) − 1 \\] \\= 1 1 p ∑ j \\= 1 p X j 2 \\= p ∑ j \\= 1 p X j 2 as our stand-in for the unknown λ ∗, yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment: μ ^ NQ ( j ) \\= ( 1 − p ∑ k \\= 1 p X k 2 ) X j . Notice what’s happening here: our optimal shrinkage estimator depends on c 2 \/ p, something we can’t observe. But we’ve constructed an *unbiased estimator* of this quantity by using *all of the observations* X j. This is the resolution of the paradox discussed above: all of the observations contain information about c 2 since this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual μ j parameters through c 2! This is the sense in which it’s possible to learn something useful about, say, μ 1 from X 2 in spite of the fact that E \\[ X 2 \\] \\= μ 2 may bear no relationship to μ 1.\n\nBut wait a minute! This looks *suspiciously familiar*. Recall that the James-Stein estimator is given by μ ^ JS ( j ) \\= ( 1 − p − 2 ∑ k \\= 1 p X k 2 ) X j . Just like the JS estimator, my NQ estimator shrinks each of the p means towards zero by a factor that depends on the number of means we’re estimating, p, and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses p − 2 in the numerator instead of p. This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the *form* that it does, I would argue that the difference is minor. If you want all the gory details of where that extra − 2 comes from, along with the closely related issue of why p ≥ 3 is crucial for JS to dominate the ML estimator, see [lecture 1](https:\/\/ditraglia.com\/econ722\/slides\/econ722slides.pdf) or [section 7.3](https:\/\/ditraglia.com\/econ722\/main.pdf) from my Econ 722 teaching materials.\n\n## Conclusion\nBefore we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t *quite* JS, and that JS only dominates the MLE when p ≥ 3, there’s one more fundamental issue that could be easily missed. Our decision to minimize *composite* MSE is *absolutely crucial* to the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context.\n\nIf we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because *Euclidean distance* is obviously what we’re after here. But if instead we’re estimating [teacher value-added](https:\/\/ideas.repec.org\/p\/nbr\/nberwo\/27094.html) and the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our *values* in a particular problem.\n\nBut with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is *nearly identical* to the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew c 2 and showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity: c 2 \/ p. Because all the observations contain information about c 2, it makes sense that we should decide how much to shrink one component X j by using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically *obvious*, excepting of course that pesky − 2 in the numerator.\n\n## Footnotes\n1. If I ruled the universe, the Gauss-Markov Theorem would be demoted to much less exalted status in econometrics teaching\\![↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref1)\n2. Don’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative\\![↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref2)\n3. Why squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref3)\n4. See if you can prove this as a homework exercise\\![↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref4)\n5. In Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of μ conditional on our data X under a normal prior. In this case λ would equal τ \/ ( 1 \\+ τ ) where τ is the *prior precision*, i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref5)\n6. Strictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See [lecture 1](https:\/\/ditraglia.com\/econ722\/slides\/econ722slides.pdf) of my Econ 722 slides for more detail.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref6)\n7. Remember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref7)","attrs_readable_markdown":"If you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the [“James-Stein Estimator”](https:\/\/en.wikipedia.org\/wiki\/James%E2%80%93Stein_estimator). You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the [best linear unbiased estimator](https:\/\/en.wikipedia.org\/wiki\/Gauss%E2%80%93Markov_theorem) (BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even *much better*–than OLS.[1](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn1) Stein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by [Charles Stein](https:\/\/en.wikipedia.org\/wiki\/Charles_M._Stein) in the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity.\n\nThe supposed [paradox](https:\/\/youtu.be\/XXhJKzI1u48?si=cS--uLd09_JnAXdr) is most simply stated by considering a special case of linear regression–that of estimating multiple unknown means. [Efron & Morris (1977)](https:\/\/www.jstor.org\/stable\/24954030) introduce the basic idea as follows:\n\n> A baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be.\n\nI first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my [Econ 722](https:\/\/ditraglia.com\/econ722) course at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see [lecture 1](https:\/\/ditraglia.com\/econ722\/slides\/econ722slides.pdf) or [section 7.3](https:\/\/ditraglia.com\/econ722\/main.pdf)–is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an [inverse-chi-squared random variable](https:\/\/en.wikipedia.org\/wiki\/Inverse-chi-squared_distribution). It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the [James-Stein Estimator](https:\/\/en.wikipedia.org\/wiki\/James%E2%80%93Stein_estimator) is flagged as “may be too technical for readers to understand” at the time of this writing\\!\n\nAfter six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course in probability and statistics. I’ll show how we can arrive at something that is *very nearly* the James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on to the end and that you’ll find the payoff worth your time and effort.\n\nAs far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider [reading something else instead](https:\/\/www.econometrics.blog\/post\/how-to-read-an-econometrics-paper\/), here are a few references that you may find helpful. [Efron & Morris (1977)](https:\/\/www.jstor.org\/stable\/24954030) is a classic article aimed at the general reader without a background in statistics. [Stigler (1988)](https:\/\/projecteuclid.org\/journals\/statistical-science\/volume-5\/issue-1\/The-1988-Neyman-Memorial-Lecture--A-Galtonian-Perspective-on\/10.1214\/ss\/1177012274.full) is a more technical but still accessible discussion of the topic while [Casella (1985)](https:\/\/www.jstor.org\/stable\/2682801) is a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is [Ijiri & Leitch (1980)](https:\/\/www.jstor.org\/stable\/2490394), who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below.\n\n## Warm-up Exercise\nThis section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of **bias**, **variance** and **mean-squared error** along with introducing a very simple **shrinkage estimator**. To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe X ∼ Normal ( μ , 1 ), a single draw from a normal distribution with variance one and unknown mean μ. Your task is to estimate μ. This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of X is one! But in fact there’s nothing special about n \\= 1 and a variance of one: these merely make the notation simpler. If you prefer, you can think of X as the sample mean of n iid draws from a population with unknown mean μ where we’ve *rescaled* everything to have variance one. So how should we estimate μ? A natural and reasonable idea is to use the sample mean, in this case X itself. This is in fact the [maximum likelihood estimator](https:\/\/en.wikipedia.org\/wiki\/Maximum_likelihood_estimation) for μ, so I’ll define μ ^ ML \\= X. But is this estimator any good? And can we find something better?\n\n### Review of Bias, Variance and MSE\nThe concepts of *bias* and *variance* are key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory, *bias* is the difference between an estimators expected value and the true value of the parameter being estimated while *variance* is the expected squared difference between an estimator and its expected value. So if θ ^ is an estimator of some unknown parameter θ, then Bias ( θ ^ ) \\= E \\[ θ ^ \\] − θ while Var ( θ ^ ) \\= E \\[ ( θ ^ − E \\[ θ ^ \\] ) 2 \\]. A bias of zero means that an estimator is *correctly centered*: its expectation equals the truth. We say that such an estimator is *unbiased*.[2](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn2) A small variance means that an estimator is *precise*: it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a *trade-off* between bias and variance: if you want to reduce one of them, you have to accept an increase in the other.\n\nA common way of trading off bias and variance relies on a concept called *mean-squared error* (MSE) defined as the *sum* of the squared bias and the variance.[3](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn3) In particular: MSE ( θ ^ ) \\= Var ( θ ^ ) \\+ Bias ( θ ^ ) 2. Equivalently, we can write MSE ( θ ^ ) \\= E \\[ ( θ ^ − θ ) 2 \\].[4](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn4) To borrow some terminology from introductory microeconomics, you can think of MSE as the *negative* of a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our *preferences* in terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram:\n\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-1-1.png)\n\n### A Shrinkage Estimator\nReturning to our maximum likelihood estimator: it’s unbiased, Bias ( μ ^ ML ) \\= 0, so MSE ( μ ^ ML ) \\= Var ( μ ^ ML ) \\= 1. Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be *yes*. Here’s the idea. Suppose we had some reason to believe that the true mean μ isn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by *shrinking* slightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero: μ ^ ( λ ) \\= ( 1 − λ ) × μ ^ ML \\+ λ × 0 \\= ( 1 − λ ) X for 0 ≤ λ ≤ 1. The constant ( 1 − λ ) is called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero.[5](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn5) We get a different estimator for every value of λ. If λ \\= 0 then we get the ML estimator back. If λ \\= 1 then we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of λ. Substituting the definition of μ ^ ( λ ) into the formulas for bias and variance gives: Bias \\[ μ ^ ( λ ) \\] \\= E \\[ ( 1 − λ ) μ ^ ML \\] − μ \\= ( 1 − λ ) E \\[ μ ^ ML \\] − μ \\= ( 1 − λ ) μ − μ \\= − λ μ Var \\[ μ ^ ( λ ) \\] \\= Var \\[ ( 1 − λ ) μ ^ ML \\] \\= ( 1 − λ ) 2 Var \\[ μ ^ ML \\] \\= ( 1 − λ ) 2 MSE \\[ μ ^ ( λ ) \\] \\= Var \\[ μ ^ ( λ ) \\] \\+ Bias \\[ μ ^ ( λ ) \\] 2 \\= ( 1 − λ ) 2 \\+ λ 2 μ 2 Unless λ \\= 0, the shrinkage estimator is *biased*. And while the MSE of the ML estimator is always one, regardless of the true value of μ, the MSE of the shrinkage estimator *depends on the unknown parameter* μ.\n\nSo why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a *larger* reduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator *can indeed* have a lower MSE than the ML estimator depending on the value of λ and the true value of μ:\n\n```\n# Range of values for the unknown parameter mu\nmu <- seq(-4, 4, length = 100)\n# Try three different values of lambda\nlambda1 <- 0.1\nlambda2 <- 0.2\nlambda3 <- 0.3\n# Plot the MSE of the shrinkage estimator as a function of mu for all \n# three values of lambda at once\nmatplot(mu, cbind((1 - lambda1)^2 + lambda1^2 * mu^2, \n                  (1 - lambda2)^2 + lambda2^2 * mu^2, \n                  (1 - lambda3)^2 + lambda3^2 * mu^2), \n        type = 'l', lty = 1, lwd = 2, \n        col = c('red', 'blue', 'green'), \n        xlab = expression(mu), ylab = 'MSE', \n        main = 'MSE of Shrinkage Estimator')\n# Add legend\nlegend('topright', legend = c(expression(lambda == 0.1), \n                              expression(lambda == 0.2), \n                              expression(lambda == 0.3)), \n       col = c('red', 'blue', 'green'), lty = 1, lwd = 2)\n# Add dashed line for MSE of ML estimator\nabline(h = 1, lty = 2, lwd = 2)\n```\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-2-1.png)\n\n### Some Algebra\nIt’s time for some algebra. If you’re tempted to skip this *please don’t*: this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural.\n\nAs seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of λ isn’t too large relative to the true value of μ. With a bit of algebra, we can work out *precisely* how large λ can be to make shrinkage worthwhile. Since MSE \\[ μ ^ ML \\] \\= 1, by expanding and simplifying the expression for MSE \\[ μ ^ ( λ ) \\] we see that MSE \\[ μ ^ ( λ ) \\] \\< MSE \\[ μ ^ ML \\] if and only if ( 1 − λ ) 2 \\+ λ 2 μ 2 \\< 1 1 − 2 λ \\+ λ 2 \\+ λ 2 μ 2 \\< 1 λ 2 ( 1 \\+ μ 2 ) − 2 λ \\< 0 λ \\[ λ ( 1 \\+ μ 2 ) − 2 \\] \\< 0\\. Since λ ≥ 0, the final inequality can only hold if the factor inside the square brackets is negative, i.e. λ ( 1 \\+ μ 2 ) − 2 \\< 0 λ \\< 2 1 \\+ μ 2 . This shows that any choice of λ between 0 and 2 \/ ( 1 \\+ μ 2 ) will give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for μ to obtain the boundary of the region where shrinkage is better than ML: λ ( 1 \\+ μ 2 ) − 2 \\= 0 1 \\+ μ 2 \\= 2 \/ λ μ \\= ± 2 \/ λ − 1 . Adding these boundaries to a simplified version of our previous plot with only λ \\= 0\\.3 we see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator.\n\n```\n# Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3\nlambda <- 0.3\nplot(mu, (1 - lambda)^2 + lambda^2 * mu^2, type = 'l', lty = 1, lwd = 2, \n     col = 'blue', xlab = expression(mu), ylab = 'MSE', \n     main = 'Boundary of Region Where Shrinkage is Better than ML')\n# Add dashed line for MSE of ML estimator\nabline(h = 1, lty = 2, lwd = 2)\n# Add boundaries of region where shrinkage is better than ML estimator\nabline(v = c(sqrt(2\/lambda - 1), -sqrt(2\/lambda - 1)), lty = 3, lwd = 2,\n       col = 'red')\n```\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-3-1.png)\n\nBut there’s still more to learn! Suppose we wanted to take things *one step further* and find the *optimal* value of λ for any given value of μ. In other words, suppose we wanted the value of λ that *minimizes* the MSE of our shrinkage estimator given a particular assumed value for μ. Since MSE \\[ μ ^ ( λ ) \\] is a quadratic function of λ, as shown above, this turns out to be a fairly straightforward calculation. Differentiating, d d λ MSE \\[ μ ^ ( λ ) \\] \\= d d λ \\[ ( 1 − λ ) 2 \\+ λ 2 μ 2 \\] \\= − 2 ( 1 − λ ) \\+ 2 λ μ 2 \\= 2 \\[ λ ( 1 \\+ μ 2 ) − 1 \\] d 2 d λ 2 MSE \\[ μ ^ ( λ ) \\] \\= 2 ( 1 \\+ μ 2 ) \\> 0 so there is a unique global minimum at λ ∗ ≡ 1 \/ ( 1 \\+ μ 2 ). This gives the *optimal* shrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting λ ∗ into the expression for MSE \\[ μ ^ ( λ ) \\] gives: MSE \\[ μ ^ ( λ ∗ ) \\] \\= ( 1 − 1 1 \\+ μ 2 ) 2 \\+ ( 1 1 \\+ μ 2 ) 2 μ 2 \\= ( μ 2 1 \\+ μ 2 ) 2 \\+ ( 1 1 \\+ μ 2 ) 2 μ 2 \\= ( 1 1 \\+ μ 2 ) 2 ( μ 4 \\+ μ 2 ) \\= ( 1 1 \\+ μ 2 ) 2 μ 2 ( 1 \\+ μ 2 ) \\= μ 2 1 \\+ μ 2 \\< 1\\.\n\n## Stein’s Paradox\n### Recap\nWe’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that λ is chosen judiciously: it needs to be between zero and 2 \/ ( 1 \\+ μ 2 ). The optimal choice of λ, namely λ ∗ \\= 1 \/ ( 1 \\+ μ 2 ), gives an MSE of μ 2 \/ ( 1 \\+ μ 2 ). This is always lower than one, the MSE of the ML estimator.\n\nThere’s just one massive problem we’ve ignored this whole time: **we don’t know the value of** μ! As seen from the figure plotted above, the MSE curves for different values of λ *cross each other*: the best one to use depends on the true value of μ. This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of μ that could help guide our choice of λ. What it does mean is that there’s no “one-size-fits-all” value.\n\n### Admissibility\nIt’s time to introduce a bit of technical vocabulary. We say that an estimator θ ~ **dominates** another estimator θ ^ if MSE \\[ θ ~ \\] ≤ MSE \\[ θ ^ \\] for *all* possible values of the parameter θ being estimated and MSE \\[ θ ~ \\] \\< MSE \\[ θ ^ \\] for at least *one* possible value of θ.[6](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn6) In words, this means that it never makes sense to use θ ^ in preference to θ ~. No matter what the true parameter value is, you can’t do worse with θ ~ and you might do better. An estimator that is *not dominated* by any other estimator is called **admissible**; an estimator that *is dominated* by some other estimator is called **inadmissible**. The concept of *admissibility* in decision theory is a bit like the concept of [Pareto efficiency](https:\/\/en.wikipedia.org\/wiki\/Pareto_efficiency) in microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off.\n\nIt’s quite challenging to prove, but in fact the ML estimator θ ^ M L \\= X turns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large μ is likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch\\!\n\n### A More General Example\nNow let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw X from a Normal ( μ , 1 ) distribution but a *collection* of p independent draws from p *different* normal distributions: X 1 , X 2 , . . . , X p ∼ independent Normal ( μ j , 1 ) , j \\= 1 , . . . , p . You can think of this as p copies of our original problem: we observe X j ∼ Normal ( μ j , 1 ) and our task is to estimate μ j. The observations are all independent, and each comes from a distribution with a potentially **different mean**. At first glance it seems like these p separate problems should have *absolutely nothing to do with each other*. And indeed the maximum likelihood estimator for the collection of p means is simply μ ^ ML ( j ) \\= X j. As above in our example with p \\= 1, the question is: how good is the ML estimator, and can we do any better?\n\n### Composite MSE\nBut first things first: how can we evaluate the quality of p estimators for p different parameters *at the same time*? A common approach, and the one we will follow here, is to take the *sum* of the individual MSEs of each estimator, yielding a quantity called **composite MSE**. If μ ^ 1 , μ ^ 2 , … , μ ^ p is a collection of estimators for each of the individual unknown means, then the composite MSE is defined as Composite MSE ≡ ∑ j \\= 1 p MSE ( μ ^ j ) \\= ∑ j \\= 1 p \\[ Bias ( μ ^ j ) 2 \\+ Var ( μ ^ j ) \\] \\= ∑ j \\= 1 p E \\[ ( μ ^ j − μ j ) 2 \\] . Adopting composite MSE as our measure of *good* performance means that we view each of the p estimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating μ j in exchange for doing a much better job estimating μ k. At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to **minimize the composite MSE**. The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does.\n\n### Stein’s Paradox\nPutting our new idea into practice, we see that the composite MSE of the ML estimator is p regardless of the true values of the individual means μ 1 , … , μ p since ∑ j \\= 1 p MSE \\[ μ ^ ML ( j ) \\] \\= ∑ j \\= 1 p MSE ( X j ) \\= ∑ j \\= 1 p Var ( X j ) \\= p . If the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to p and sometimes has an MSE strictly less than p. I’ve already told you that this is true when p \\= 1. When p \\= 2 it’s still true: the ML estimator remains admissible. But when p ≥ 3 something very unexpected happens: it becomes possible to construct an estimator that **dominates** the ML estimator by using information from *all* of the ( X 1 , . . . , X p ) observations to estimate μ j. This is spite of the fact that there is *no obvious connection* between the observations. Again: they are all independent and come from distributions with different means\\!\n\nThe estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to μ ^ JS ( j ) \\= ( 1 − p − 2 ∑ k \\= 1 p X k 2 ) X j . This estimator dominates the ML estimator when p ≥ 3 in that  \n ∑ j \\= 1 p MSE \\[ μ ^ JS ( j ) \\] ≤ ∑ j \\= 1 p MSE \\[ μ ^ ML ( j ) \\] \\= p for *all* possible values of the p unknown means μ j with strict inequality for at least *some* values. Taking a closer look at the formula, we see that the James-Stein estimator is just a *shrinkage* estimator applied to each of the p means, namely μ ^ JS ( j ) \\= ( 1 − λ ^ JS ) X j , λ ^ JS ≡ p − 2 ∑ k \\= 1 p X k 2 . The shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating, p, along with the *overall* sum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero *overall*, the less we shrink *each of them* towards zero.\n\nJust like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the *data* to determine the shrinkage factor. And as long as p ≥ 3 it is always *at least as good* as the ML estimator and sometimes *much better*. The **paradox** is that this seems impossible: how can information from *all* of the observations be useful when they come from *different* distributions with no obvious connection?\n\nThe rest of this post will *not* prove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some *very good intuition* for why the formula for the James-Stein estimator takes the form it does. By the end, I hope you’ll feel that, far from seeming paradoxical, using *all* of the observations to determine the shrinkage factor for one particular μ j makes perfect sense.\n\n## Where does the James-Stein Estimator Come From?\n### An Infeasible Estimator When p \\= 2\nTo start the ball rolling, let’s [assume a can-opener](https:\/\/en.wikipedia.org\/wiki\/Assume_a_can_opener): suppose that we don’t know any of the *individual* means μ j but for some strange reason a benevolent deity has told us the value of their sum of squares: c 2 ≡ ∑ j \\= 1 p μ j 2 . It turns out that this is enough information to construct a shrinkage estimator that *always* has a lower composite MSE than the ML estimator. Let’s see why this is the case. If p \\= 1, then telling you c 2 is the same as telling you μ 2. Granted, knowledge of μ 2 isn’t as informative as knowledge of μ. For example, if I told you that μ 2 \\= 9 you couldn’t tell whether μ \\= 3 or μ \\= − 3. But, as we showed above, the optimal shrinkage estimator when p \\= 1 sets λ ∗ \\= 1 \/ ( 1 \\+ μ 2 ) and yields an MSE of μ 2 \/ ( 1 \\+ μ 2 ) \\< 1. Since λ ∗ only depends on μ through μ 2, we’ve *already shown* that knowledge of c 2 allows us to construct a shrinkage estimator that dominates the ML estimator when p \\= 1.\n\nSo what if p equals 2? In this case, knowledge of c 2 \\= μ 1 2 \\+ μ 2 2 is equivalent to knowing the *radius* of a circle centered at the origin in the ( μ 1 , μ 2 ) plane where the two unknown means must lie. For example, if I told you that c 2 \\= 1 you would know that ( μ 1 , μ 2 ) lies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points ( x 1 , x 2 ) and ( y 1 , y 2 ) would then be potential values of ( μ 1 , μ 2 ) as would all other points on the blue circle.\n\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-4-1.png)\n\nSo how can we construct a shrinkage estimator of ( μ 1 , μ 2 ) with lower composite MSE than the ML estimator if c 2 is known? While there are other possibilities, the simplest would be to use the *same* shrinkage factor for each of the two coordinates. In other words, our estimator would be μ ^ 1 ( λ ) \\= ( 1 − λ ) X 1 , μ ^ 2 ( λ ) \\= ( 1 − λ ) X 2 for some λ between zero and one. The composite MSE of this estimator is just the sum of the MSE of each *individual* component, so we can re-use our algebra from above to obtain MSE \\[ μ ^ 1 ( λ ) \\] \\+ MSE \\[ μ ^ 2 ( λ ) \\] \\= \\[ ( 1 − λ ) 2 \\+ λ 2 μ 1 2 \\] \\+ \\[ ( 1 − λ ) 2 \\+ λ 2 μ 2 2 \\] \\= 2 ( 1 − λ ) 2 \\+ λ 2 ( μ 1 2 \\+ μ 2 2 ) \\= 2 ( 1 − λ ) 2 \\+ λ 2 c 2 . Notice that the composite MSE only depends on ( μ 1 , μ 2 ) through their sum of squares, c 2. Differentiating with respect to λ, just as we did above in the p \\= 1 case, d d λ \\[ 2 ( 1 − λ ) 2 \\+ λ 2 c 2 \\] \\= − 4 ( 1 − λ ) \\+ 2 λ c 2 \\= 2 \\[ λ ( 2 \\+ c 2 ) − 2 \\] d 2 d λ 2 \\[ 2 ( 1 − λ ) 2 \\+ λ 2 c 2 \\] \\= 2 ( 2 \\+ c 2 ) \\> 0 so there is a unique global minimum at λ ∗ \\= 2 \/ ( 2 \\+ c 2 ). Substituting this value of λ into the expression for the composite MSE, a few lines of algebra give MSE \\[ μ ^ 1 ( λ ∗ ) \\] \\+ MSE \\[ μ ^ 2 ( λ ∗ ) \\] \\= 2 ( 1 − 2 2 \\+ c 2 ) 2 \\+ ( 2 2 \\+ c 2 ) 2 c 2 \\= 2 ( c 2 2 \\+ c 2 ) . Since c 2 \/ ( 2 \\+ c 2 ) \\< 1 for all c 2 \\> 0, the optimal shrinkage estimator *always* has a composite MSE less than 2, the composite MSE of the ML estimator. Strictly speaking this estimator is **infeasible** since we don’t know c 2. But it’s a crucial step on our journey to make the leap from applying shrinkage to an estimator for a *single* unknown mean, to using the same idea for *more than one* unknown mean.\n\n### A Simulation Experiment for p \\= 2\nYou may have already noticed that it’s easy to generalize this argument to p \\> 2. But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for p \\= 2 a bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when p \\= 2. I’ve set the true, unknown, values of μ 1 and μ 2 to one so the true value of c 2 is 2 and the optimal choice of λ is λ ∗ \\= 2 \/ ( 2 \\+ c 2 ) \\= 2 \/ 4 \\= 0\\.5. The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action.\n\n```\nset.seed(1983)\n\nnreps <- 50\nmu1 <- mu2 <- 1\nx1 <- mu1 + rnorm(nreps)\nx2 <- mu2 + rnorm(nreps)\n\ncsq <- mu1^2 + mu2^2\nlambda <- csq \/ (2 + csq)\n\npar(mfrow = c(1, 2))\n\n# Left panel: ML Estimator\nplot(x1, x2, main = 'MLE', pch = 20, col = 'black', cex = 2, \n     xlab = expression(mu[1]), ylab = expression(mu[2]))\nabline(v = mu1, lty = 1, col = 'red', lwd = 2)\nabline(h = mu2, lty = 1, col = 'red', lwd = 2)\n\n# Add MSE to the plot\ntext(x = 2, y = 3, labels = paste(\"MSE =\", \n                                  round(mean((x1 - mu1)^2 + (x2 - mu2)^2), 2)))\n\n# Right panel: Shrinkage Estimator\nplot(x1, x2, main = 'Shrinkage', xlab = expression(mu[1]), \n     ylab = expression(mu[2]))\npoints(lambda * x1, lambda * x2, pch = 20, col = 'blue', cex = 2)\nsegments(x0 = x1, y0 = x2, x1 = lambda * x1, y1 = lambda * x2, lty = 2)\nabline(v = mu1, lty = 1, col = 'red', lwd = 2)\nabline(h = mu2, lty = 1, col = 'red', lwd = 2)\nabline(v = 0, lty = 1, lwd = 2)\nabline(h = 0, lty = 1, lwd = 2)\n\n# Add MSE to the plot\ntext(x = 2, y = 3, labels = paste(\"MSE =\", \n                                  round(mean((lambda * x1 - mu1)^2 + \n                                               (lambda * x2 - mu2)^2), 2)))\n```\n![](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/index_files\/figure-html\/unnamed-chunk-5-1.png)\n\nMy plot has two panels. The left panel shows the raw data. Each black point is a pair ( X 1 , X 2 ) of independent normal draws with means ( μ 1 \\= 1 , μ 2 \\= 1 ) and variances ( 1 , 1 ). As such, each point is also the *ML estimate* (MLE) of ( μ 1 , μ 2 ) based on ( X 1 , X 2 ). The red cross shows the location of the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of ( μ 1 , μ 2 ) in repeated sampling, approximating the composite MSE.\n\nThe right panel is more complicated. This shows *both* the ML estimates (unfilled black circles) *and* the corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of λ \\= 0\\.5. Thus, if a given unfilled black circle is located at ( X 1 , X 2 ), the corresponding filled blue circle is located at ( 0\\.5 X 1 , 0\\.5 X 2 ). As in the left panel, the red cross in the right panel shows the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely ( 0 , 0 ).\n\nWe see immediately that the ML estimator is *unbiased*: the black filled dots in the left panel (along with the unfilled ones in the right) are centered at ( 1 , 1 ). But the ML estimator is also *high-variance*: the black dots are quite spread out around ( 1 , 1 ). We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross.[7](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fn7) And in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator.\n\nIn contrast, the optimal shrinkage estimator is *biased*: the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly *they are on average closer to* ( μ 1 , μ 2 ), as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals 2 c 2 \/ ( 2 \\+ c 2 ). When c 2 \\= 2, as in this case, we obtain 2 × 2 \/ ( 2 \\+ 2 ) \\= 1. Again, this is almost exactly what we see in the simulation.\n\nIf we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage *pulls* the MLE towards the origin, and can give a *much* lower composite MSE.\n\n### An Infeasible Estimator: The General Case\nNow that we understand the case of p \\= 2, the general case is a snap. Our shrinkage estimator of each μ j will take the form μ ^ j ( λ ) \\= ( 1 − λ ) X j , j \\= 1 , … , p for some λ between zero and one. To find the optimal choice of λ, we minimize ∑ j \\= 1 p MSE \\[ μ ^ j ( λ ) \\] \\= ∑ j \\= 1 p \\[ ( 1 − λ ) 2 \\+ λ 2 μ j 2 \\] \\= p ( 1 − λ ) 2 \\+ λ 2 c 2 with respect to λ. Again, the key is that the composite MSE only depends on the unknown means through c 2. Using almost exactly the same calculations as above for the case of p \\= 2, we find that λ ∗ \\= p p \\+ c 2 , ∑ j \\= 1 p MSE \\[ μ ^ j ( λ ∗ ) \\] \\= p ( c 2 p \\+ c 2 ) . since c 2 \/ ( p \\+ c 2 ) \\< 1 for all c 2 \\> 0, the optimal shrinkage estimator *always* has a composite MSE less than p, the composite MSE of the ML estimator.\n\n### Not Quite the James-Stein Estimator\nThe end is in sight! We’ve shown that if we knew the sum of squares of the unknown means, c 2, we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know c 2. So what can we do? To start off, re-write λ ∗ as follows λ ∗ \\= p p \\+ c 2 \\= 1 1 \\+ c 2 \/ p . This way of writing things makes it clear that it’s not c 2 *per se* that matters but rather c 2 \/ p. And this quantity is simply is the *average* of the unknown squared means: c 2 p \\= 1 p ∑ j \\= 1 p μ j 2 . So how could we learn c 2 \/ p? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved μ j with the corresponding observation X j, in other words 1 p ∑ j \\= 1 p X j 2 . This is a good starting point, but we can do better. Since X j ∼ Normal ( μ j , 1 ), we see that E \\[ 1 p ∑ j \\= 1 p X j 2 \\] \\= 1 p ∑ j \\= 1 p E \\[ X j 2 \\] \\= 1 p ∑ j \\= 1 p \\[ Var ( X j ) \\+ E ( X j ) 2 \\] \\= 1 p ∑ j \\= 1 p ( 1 \\+ μ j 2 ) \\= 1 \\+ c 2 p . This means that ( ∑ j \\= 1 p X j 2 ) \/ p will on average *overestimate* c 2 \/ p by one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is *no bias-variance tradeoff*. Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for λ ∗, this suggests using the estimator λ ^ ≡ 1 1 \\+ \\[ ( 1 p ∑ j \\= 1 p X j 2 ) − 1 \\] \\= 1 1 p ∑ j \\= 1 p X j 2 \\= p ∑ j \\= 1 p X j 2 as our stand-in for the unknown λ ∗, yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment: μ ^ NQ ( j ) \\= ( 1 − p ∑ k \\= 1 p X k 2 ) X j . Notice what’s happening here: our optimal shrinkage estimator depends on c 2 \/ p, something we can’t observe. But we’ve constructed an *unbiased estimator* of this quantity by using *all of the observations* X j. This is the resolution of the paradox discussed above: all of the observations contain information about c 2 since this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual μ j parameters through c 2! This is the sense in which it’s possible to learn something useful about, say, μ 1 from X 2 in spite of the fact that E \\[ X 2 \\] \\= μ 2 may bear no relationship to μ 1.\n\nBut wait a minute! This looks *suspiciously familiar*. Recall that the James-Stein estimator is given by μ ^ JS ( j ) \\= ( 1 − p − 2 ∑ k \\= 1 p X k 2 ) X j . Just like the JS estimator, my NQ estimator shrinks each of the p means towards zero by a factor that depends on the number of means we’re estimating, p, and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses p − 2 in the numerator instead of p. This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the *form* that it does, I would argue that the difference is minor. If you want all the gory details of where that extra − 2 comes from, along with the closely related issue of why p ≥ 3 is crucial for JS to dominate the ML estimator, see [lecture 1](https:\/\/ditraglia.com\/econ722\/slides\/econ722slides.pdf) or [section 7.3](https:\/\/ditraglia.com\/econ722\/main.pdf) from my Econ 722 teaching materials.\n\n## Conclusion\nBefore we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t *quite* JS, and that JS only dominates the MLE when p ≥ 3, there’s one more fundamental issue that could be easily missed. Our decision to minimize *composite* MSE is *absolutely crucial* to the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context.\n\nIf we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because *Euclidean distance* is obviously what we’re after here. But if instead we’re estimating [teacher value-added](https:\/\/ideas.repec.org\/p\/nbr\/nberwo\/27094.html) and the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our *values* in a particular problem.\n\nBut with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is *nearly identical* to the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew c 2 and showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity: c 2 \/ p. Because all the observations contain information about c 2, it makes sense that we should decide how much to shrink one component X j by using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically *obvious*, excepting of course that pesky − 2 in the numerator.\n\n## Footnotes\n1. If I ruled the universe, the Gauss-Markov Theorem would be demoted to much less exalted status in econometrics teaching\\![↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref1)\n2. Don’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative\\![↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref2)\n3. Why squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref3)\n4. See if you can prove this as a homework exercise\\![↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref4)\n5. In Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of μ conditional on our data X under a normal prior. In this case λ would equal τ \/ ( 1 \\+ τ ) where τ is the *prior precision*, i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref5)\n6. Strictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See [lecture 1](https:\/\/ditraglia.com\/econ722\/slides\/econ722slides.pdf) of my Econ 722 slides for more detail.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref6)\n7. Remember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand.[↩︎](https:\/\/www.econometrics.blog\/post\/not-quite-the-james-stein-estimator\/#fnref7)","meta_canonical":null,"ml_categories_json":"{\"\/Science\":863,\"\/Science\/Mathematics\":842,\"\/Science\/Mathematics\/Statistics\":826}","ml_types_json":"{\"\/Article\":999,\"\/Article\/Tutorial_or_Guide\":789}","ml_intent_types_json":"{\"Informational\":999}","meta_language":"en","attrs_author":"Francis J. DiTraglia","attrs_publish_time":0,"attrs_original_publish_time":1723514015,"attrs_is_republished":0,"attrs_nr_words":"8363","attrs_boilerpipe_nr_words":"8280","body_ext_links_number":19,"body_int_links_number":7,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":1,"src_redirect":"","download_time_msec":49,"download_ttfb_msec":48,"download_size":24964}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

1 day ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	0 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property

Value

URL

https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/

Last Crawled

2026-04-27 08:13:30 (1 day ago)

First Indexed

2024-08-13 01:53:35 (1 year ago)

HTTP Status Code

200

Content

Meta Title

Not Quite the James-Stein Estimator – econometrics.blog

Meta Description

null

Meta Canonical

null

Boilerpipe Text

If you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the “James-Stein Estimator” . You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the best linear unbiased estimator (BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even much better –than OLS. 1 Stein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by Charles Stein in the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity. The supposed paradox is most simply stated by considering a special case of linear regression–that of estimating multiple unknown means. Efron & Morris (1977) introduce the basic idea as follows: A baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be. I first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my Econ 722 course at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see lecture 1 or section 7.3 –is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an inverse-chi-squared random variable . It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the James-Stein Estimator is flagged as “may be too technical for readers to understand” at the time of this writing! After six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course in probability and statistics. I’ll show how we can arrive at something that is very nearly the James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on to the end and that you’ll find the payoff worth your time and effort. As far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider reading something else instead , here are a few references that you may find helpful. Efron & Morris (1977) is a classic article aimed at the general reader without a background in statistics. Stigler (1988) is a more technical but still accessible discussion of the topic while Casella (1985) is a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is Ijiri & Leitch (1980) , who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below. Warm-up Exercise This section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of bias , variance and mean-squared error along with introducing a very simple shrinkage estimator . To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe X ∼ Normal ( μ , 1 ) , a single draw from a normal distribution with variance one and unknown mean μ . Your task is to estimate μ . This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of X is one! But in fact there’s nothing special about n = 1 and a variance of one: these merely make the notation simpler. If you prefer, you can think of X as the sample mean of n iid draws from a population with unknown mean μ where we’ve rescaled everything to have variance one. So how should we estimate μ ? A natural and reasonable idea is to use the sample mean, in this case X itself. This is in fact the maximum likelihood estimator for μ , so I’ll define μ ^ ML = X . But is this estimator any good? And can we find something better? Review of Bias, Variance and MSE The concepts of bias and variance are key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory, bias is the difference between an estimators expected value and the true value of the parameter being estimated while variance is the expected squared difference between an estimator and its expected value. So if θ ^ is an estimator of some unknown parameter θ , then Bias ( θ ^ ) = E [ θ ^ ] − θ while Var ( θ ^ ) = E [ ( θ ^ − E [ θ ^ ] ) 2 ] . A bias of zero means that an estimator is correctly centered : its expectation equals the truth. We say that such an estimator is unbiased . 2 A small variance means that an estimator is precise : it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a trade-off between bias and variance: if you want to reduce one of them, you have to accept an increase in the other. A common way of trading off bias and variance relies on a concept called mean-squared error (MSE) defined as the sum of the squared bias and the variance. 3 In particular: MSE ( θ ^ ) = Var ( θ ^ ) + Bias ( θ ^ ) 2 . Equivalently, we can write MSE ( θ ^ ) = E [ ( θ ^ − θ ) 2 ] . 4 To borrow some terminology from introductory microeconomics, you can think of MSE as the negative of a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our preferences in terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram: A Shrinkage Estimator Returning to our maximum likelihood estimator: it’s unbiased, Bias ( μ ^ ML ) = 0 , so MSE ( μ ^ ML ) = Var ( μ ^ ML ) = 1 . Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be yes . Here’s the idea. Suppose we had some reason to believe that the true mean μ isn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by shrinking slightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero: μ ^ ( λ ) = ( 1 − λ ) × μ ^ ML + λ × 0 = ( 1 − λ ) X for 0 ≤ λ ≤ 1 . The constant ( 1 − λ ) is called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero. 5 We get a different estimator for every value of λ . If λ = 0 then we get the ML estimator back. If λ = 1 then we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of λ . Substituting the definition of μ ^ ( λ ) into the formulas for bias and variance gives: Bias [ μ ^ ( λ ) ] = E [ ( 1 − λ ) μ ^ ML ] − μ = ( 1 − λ ) E [ μ ^ ML ] − μ = ( 1 − λ ) μ − μ = − λ μ Var [ μ ^ ( λ ) ] = Var [ ( 1 − λ ) μ ^ ML ] = ( 1 − λ ) 2 Var [ μ ^ ML ] = ( 1 − λ ) 2 MSE [ μ ^ ( λ ) ] = Var [ μ ^ ( λ ) ] + Bias [ μ ^ ( λ ) ] 2 = ( 1 − λ ) 2 + λ 2 μ 2 Unless λ = 0 , the shrinkage estimator is biased . And while the MSE of the ML estimator is always one, regardless of the true value of μ , the MSE of the shrinkage estimator depends on the unknown parameter μ . So why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a larger reduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator can indeed have a lower MSE than the ML estimator depending on the value of λ and the true value of μ : # Range of values for the unknown parameter mu mu <- seq ( - 4 , 4 , length = 100 ) # Try three different values of lambda lambda1 <- 0.1 lambda2 <- 0.2 lambda3 <- 0.3 # Plot the MSE of the shrinkage estimator as a function of mu for all # three values of lambda at once matplot (mu, cbind (( 1 - lambda1) ^ 2 + lambda1 ^ 2 * mu ^ 2 , ( 1 - lambda2) ^ 2 + lambda2 ^ 2 * mu ^ 2 , ( 1 - lambda3) ^ 2 + lambda3 ^ 2 * mu ^ 2 ), type = 'l' , lty = 1 , lwd = 2 , col = c ( 'red' , 'blue' , 'green' ), xlab = expression (mu), ylab = 'MSE' , main = 'MSE of Shrinkage Estimator' ) # Add legend legend ( 'topright' , legend = c ( expression (lambda == 0.1 ), expression (lambda == 0.2 ), expression (lambda == 0.3 )), col = c ( 'red' , 'blue' , 'green' ), lty = 1 , lwd = 2 ) # Add dashed line for MSE of ML estimator abline ( h = 1 , lty = 2 , lwd = 2 ) Some Algebra It’s time for some algebra. If you’re tempted to skip this please don’t : this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural. As seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of λ isn’t too large relative to the true value of μ . With a bit of algebra, we can work out precisely how large λ can be to make shrinkage worthwhile. Since MSE [ μ ^ ML ] = 1 , by expanding and simplifying the expression for MSE [ μ ^ ( λ ) ] we see that MSE [ μ ^ ( λ ) ] < MSE [ μ ^ ML ] if and only if ( 1 − λ ) 2 + λ 2 μ 2 < 1 1 − 2 λ + λ 2 + λ 2 μ 2 < 1 λ 2 ( 1 + μ 2 ) − 2 λ < 0 λ [ λ ( 1 + μ 2 ) − 2 ] < 0. Since λ ≥ 0 , the final inequality can only hold if the factor inside the square brackets is negative, i.e. λ ( 1 + μ 2 ) − 2 < 0 λ < 2 1 + μ 2 . This shows that any choice of λ between 0 and 2 / ( 1 + μ 2 ) will give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for μ to obtain the boundary of the region where shrinkage is better than ML: λ ( 1 + μ 2 ) − 2 = 0 1 + μ 2 = 2 / λ μ = ± 2 / λ − 1 . Adding these boundaries to a simplified version of our previous plot with only λ = 0.3 we see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator. # Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3 lambda <- 0.3 plot (mu, ( 1 - lambda) ^ 2 + lambda ^ 2 * mu ^ 2 , type = 'l' , lty = 1 , lwd = 2 , col = 'blue' , xlab = expression (mu), ylab = 'MSE' , main = 'Boundary of Region Where Shrinkage is Better than ML' ) # Add dashed line for MSE of ML estimator abline ( h = 1 , lty = 2 , lwd = 2 ) # Add boundaries of region where shrinkage is better than ML estimator abline ( v = c ( sqrt ( 2 / lambda - 1 ), - sqrt ( 2 / lambda - 1 )), lty = 3 , lwd = 2 , col = 'red' ) But there’s still more to learn! Suppose we wanted to take things one step further and find the optimal value of λ for any given value of μ . In other words, suppose we wanted the value of λ that minimizes the MSE of our shrinkage estimator given a particular assumed value for μ . Since MSE [ μ ^ ( λ ) ] is a quadratic function of λ , as shown above, this turns out to be a fairly straightforward calculation. Differentiating, d d λ MSE [ μ ^ ( λ ) ] = d d λ [ ( 1 − λ ) 2 + λ 2 μ 2 ] = − 2 ( 1 − λ ) + 2 λ μ 2 = 2 [ λ ( 1 + μ 2 ) − 1 ] d 2 d λ 2 MSE [ μ ^ ( λ ) ] = 2 ( 1 + μ 2 ) > 0 so there is a unique global minimum at λ ∗ ≡ 1 / ( 1 + μ 2 ) . This gives the optimal shrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting λ ∗ into the expression for MSE [ μ ^ ( λ ) ] gives: MSE [ μ ^ ( λ ∗ ) ] = ( 1 − 1 1 + μ 2 ) 2 + ( 1 1 + μ 2 ) 2 μ 2 = ( μ 2 1 + μ 2 ) 2 + ( 1 1 + μ 2 ) 2 μ 2 = ( 1 1 + μ 2 ) 2 ( μ 4 + μ 2 ) = ( 1 1 + μ 2 ) 2 μ 2 ( 1 + μ 2 ) = μ 2 1 + μ 2 < 1. Stein’s Paradox Recap We’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that λ is chosen judiciously: it needs to be between zero and 2 / ( 1 + μ 2 ) . The optimal choice of λ , namely λ ∗ = 1 / ( 1 + μ 2 ) , gives an MSE of μ 2 / ( 1 + μ 2 ) . This is always lower than one, the MSE of the ML estimator. There’s just one massive problem we’ve ignored this whole time: we don’t know the value of μ ! As seen from the figure plotted above, the MSE curves for different values of λ cross each other : the best one to use depends on the true value of μ . This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of μ that could help guide our choice of λ . What it does mean is that there’s no “one-size-fits-all” value. Admissibility It’s time to introduce a bit of technical vocabulary. We say that an estimator θ ~ dominates another estimator θ ^ if MSE [ θ ~ ] ≤ MSE [ θ ^ ] for all possible values of the parameter θ being estimated and MSE [ θ ~ ] < MSE [ θ ^ ] for at least one possible value of θ . 6 In words, this means that it never makes sense to use θ ^ in preference to θ ~ . No matter what the true parameter value is, you can’t do worse with θ ~ and you might do better. An estimator that is not dominated by any other estimator is called admissible ; an estimator that is dominated by some other estimator is called inadmissible . The concept of admissibility in decision theory is a bit like the concept of Pareto efficiency in microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off. It’s quite challenging to prove, but in fact the ML estimator θ ^ M L = X turns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large μ is likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch! A More General Example Now let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw X from a Normal ( μ , 1 ) distribution but a collection of p independent draws from p different normal distributions: X 1 , X 2 , . . . , X p ∼ independent Normal ( μ j , 1 ) , j = 1 , . . . , p . You can think of this as p copies of our original problem: we observe X j ∼ Normal ( μ j , 1 ) and our task is to estimate μ j . The observations are all independent, and each comes from a distribution with a potentially different mean . At first glance it seems like these p separate problems should have absolutely nothing to do with each other . And indeed the maximum likelihood estimator for the collection of p means is simply μ ^ ML ( j ) = X j . As above in our example with p = 1 , the question is: how good is the ML estimator, and can we do any better? Composite MSE But first things first: how can we evaluate the quality of p estimators for p different parameters at the same time ? A common approach, and the one we will follow here, is to take the sum of the individual MSEs of each estimator, yielding a quantity called composite MSE . If μ ^ 1 , μ ^ 2 , … , μ ^ p is a collection of estimators for each of the individual unknown means, then the composite MSE is defined as Composite MSE ≡ ∑ j = 1 p MSE ( μ ^ j ) = ∑ j = 1 p [ Bias ( μ ^ j ) 2 + Var ( μ ^ j ) ] = ∑ j = 1 p E [ ( μ ^ j − μ j ) 2 ] . Adopting composite MSE as our measure of good performance means that we view each of the p estimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating μ j in exchange for doing a much better job estimating μ k . At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to minimize the composite MSE . The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does. Stein’s Paradox Putting our new idea into practice, we see that the composite MSE of the ML estimator is p regardless of the true values of the individual means μ 1 , … , μ p since ∑ j = 1 p MSE [ μ ^ ML ( j ) ] = ∑ j = 1 p MSE ( X j ) = ∑ j = 1 p Var ( X j ) = p . If the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to p and sometimes has an MSE strictly less than p . I’ve already told you that this is true when p = 1 . When p = 2 it’s still true: the ML estimator remains admissible. But when p ≥ 3 something very unexpected happens: it becomes possible to construct an estimator that dominates the ML estimator by using information from all of the ( X 1 , . . . , X p ) observations to estimate μ j . This is spite of the fact that there is no obvious connection between the observations. Again: they are all independent and come from distributions with different means! The estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to μ ^ JS ( j ) = ( 1 − p − 2 ∑ k = 1 p X k 2 ) X j . This estimator dominates the ML estimator when p ≥ 3 in that ∑ j = 1 p MSE [ μ ^ JS ( j ) ] ≤ ∑ j = 1 p MSE [ μ ^ ML ( j ) ] = p for all possible values of the p unknown means μ j with strict inequality for at least some values. Taking a closer look at the formula, we see that the James-Stein estimator is just a shrinkage estimator applied to each of the p means, namely μ ^ JS ( j ) = ( 1 − λ ^ JS ) X j , λ ^ JS ≡ p − 2 ∑ k = 1 p X k 2 . The shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating, p , along with the overall sum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero overall , the less we shrink each of them towards zero. Just like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the data to determine the shrinkage factor. And as long as p ≥ 3 it is always at least as good as the ML estimator and sometimes much better . The paradox is that this seems impossible: how can information from all of the observations be useful when they come from different distributions with no obvious connection? The rest of this post will not prove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some very good intuition for why the formula for the James-Stein estimator takes the form it does. By the end, I hope you’ll feel that, far from seeming paradoxical, using all of the observations to determine the shrinkage factor for one particular μ j makes perfect sense. Where does the James-Stein Estimator Come From? An Infeasible Estimator When p = 2 To start the ball rolling, let’s assume a can-opener : suppose that we don’t know any of the individual means μ j but for some strange reason a benevolent deity has told us the value of their sum of squares: c 2 ≡ ∑ j = 1 p μ j 2 . It turns out that this is enough information to construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. Let’s see why this is the case. If p = 1 , then telling you c 2 is the same as telling you μ 2 . Granted, knowledge of μ 2 isn’t as informative as knowledge of μ . For example, if I told you that μ 2 = 9 you couldn’t tell whether μ = 3 or μ = − 3 . But, as we showed above, the optimal shrinkage estimator when p = 1 sets λ ∗ = 1 / ( 1 + μ 2 ) and yields an MSE of μ 2 / ( 1 + μ 2 ) < 1 . Since λ ∗ only depends on μ through μ 2 , we’ve already shown that knowledge of c 2 allows us to construct a shrinkage estimator that dominates the ML estimator when p = 1 . So what if p equals 2? In this case, knowledge of c 2 = μ 1 2 + μ 2 2 is equivalent to knowing the radius of a circle centered at the origin in the ( μ 1 , μ 2 ) plane where the two unknown means must lie. For example, if I told you that c 2 = 1 you would know that ( μ 1 , μ 2 ) lies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points ( x 1 , x 2 ) and ( y 1 , y 2 ) would then be potential values of ( μ 1 , μ 2 ) as would all other points on the blue circle. So how can we construct a shrinkage estimator of ( μ 1 , μ 2 ) with lower composite MSE than the ML estimator if c 2 is known? While there are other possibilities, the simplest would be to use the same shrinkage factor for each of the two coordinates. In other words, our estimator would be μ ^ 1 ( λ ) = ( 1 − λ ) X 1 , μ ^ 2 ( λ ) = ( 1 − λ ) X 2 for some λ between zero and one. The composite MSE of this estimator is just the sum of the MSE of each individual component, so we can re-use our algebra from above to obtain MSE [ μ ^ 1 ( λ ) ] + MSE [ μ ^ 2 ( λ ) ] = [ ( 1 − λ ) 2 + λ 2 μ 1 2 ] + [ ( 1 − λ ) 2 + λ 2 μ 2 2 ] = 2 ( 1 − λ ) 2 + λ 2 ( μ 1 2 + μ 2 2 ) = 2 ( 1 − λ ) 2 + λ 2 c 2 . Notice that the composite MSE only depends on ( μ 1 , μ 2 ) through their sum of squares, c 2 . Differentiating with respect to λ , just as we did above in the p = 1 case, d d λ [ 2 ( 1 − λ ) 2 + λ 2 c 2 ] = − 4 ( 1 − λ ) + 2 λ c 2 = 2 [ λ ( 2 + c 2 ) − 2 ] d 2 d λ 2 [ 2 ( 1 − λ ) 2 + λ 2 c 2 ] = 2 ( 2 + c 2 ) > 0 so there is a unique global minimum at λ ∗ = 2 / ( 2 + c 2 ) . Substituting this value of λ into the expression for the composite MSE, a few lines of algebra give MSE [ μ ^ 1 ( λ ∗ ) ] + MSE [ μ ^ 2 ( λ ∗ ) ] = 2 ( 1 − 2 2 + c 2 ) 2 + ( 2 2 + c 2 ) 2 c 2 = 2 ( c 2 2 + c 2 ) . Since c 2 / ( 2 + c 2 ) < 1 for all c 2 > 0 , the optimal shrinkage estimator always has a composite MSE less than 2 , the composite MSE of the ML estimator. Strictly speaking this estimator is infeasible since we don’t know c 2 . But it’s a crucial step on our journey to make the leap from applying shrinkage to an estimator for a single unknown mean, to using the same idea for more than one unknown mean. A Simulation Experiment for p = 2 You may have already noticed that it’s easy to generalize this argument to p > 2 . But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for p = 2 a bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when p = 2 . I’ve set the true, unknown, values of μ 1 and μ 2 to one so the true value of c 2 is 2 and the optimal choice of λ is λ ∗ = 2 / ( 2 + c 2 ) = 2 / 4 = 0.5 . The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action. set.seed ( 1983 ) nreps <- 50 mu1 <- mu2 <- 1 x1 <- mu1 + rnorm (nreps) x2 <- mu2 + rnorm (nreps) csq <- mu1 ^ 2 + mu2 ^ 2 lambda <- csq / ( 2 + csq) par ( mfrow = c ( 1 , 2 )) # Left panel: ML Estimator plot (x1, x2, main = 'MLE' , pch = 20 , col = 'black' , cex = 2 , xlab = expression (mu[ 1 ]), ylab = expression (mu[ 2 ])) abline ( v = mu1, lty = 1 , col = 'red' , lwd = 2 ) abline ( h = mu2, lty = 1 , col = 'red' , lwd = 2 ) # Add MSE to the plot text ( x = 2 , y = 3 , labels = paste ( "MSE =" , round ( mean ((x1 - mu1) ^ 2 + (x2 - mu2) ^ 2 ), 2 ))) # Right panel: Shrinkage Estimator plot (x1, x2, main = 'Shrinkage' , xlab = expression (mu[ 1 ]), ylab = expression (mu[ 2 ])) points (lambda * x1, lambda * x2, pch = 20 , col = 'blue' , cex = 2 ) segments ( x0 = x1, y0 = x2, x1 = lambda * x1, y1 = lambda * x2, lty = 2 ) abline ( v = mu1, lty = 1 , col = 'red' , lwd = 2 ) abline ( h = mu2, lty = 1 , col = 'red' , lwd = 2 ) abline ( v = 0 , lty = 1 , lwd = 2 ) abline ( h = 0 , lty = 1 , lwd = 2 ) # Add MSE to the plot text ( x = 2 , y = 3 , labels = paste ( "MSE =" , round ( mean ((lambda * x1 - mu1) ^ 2 + (lambda * x2 - mu2) ^ 2 ), 2 ))) My plot has two panels. The left panel shows the raw data. Each black point is a pair ( X 1 , X 2 ) of independent normal draws with means ( μ 1 = 1 , μ 2 = 1 ) and variances ( 1 , 1 ) . As such, each point is also the ML estimate (MLE) of ( μ 1 , μ 2 ) based on ( X 1 , X 2 ) . The red cross shows the location of the true values of ( μ 1 , μ 2 ) , namely ( 1 , 1 ) . There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of ( μ 1 , μ 2 ) in repeated sampling, approximating the composite MSE. The right panel is more complicated. This shows both the ML estimates (unfilled black circles) and the corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of λ = 0.5 . Thus, if a given unfilled black circle is located at ( X 1 , X 2 ) , the corresponding filled blue circle is located at ( 0.5 X 1 , 0.5 X 2 ) . As in the left panel, the red cross in the right panel shows the true values of ( μ 1 , μ 2 ) , namely ( 1 , 1 ) . The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely ( 0 , 0 ) . We see immediately that the ML estimator is unbiased : the black filled dots in the left panel (along with the unfilled ones in the right) are centered at ( 1 , 1 ) . But the ML estimator is also high-variance : the black dots are quite spread out around ( 1 , 1 ) . We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross. 7 And in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator. In contrast, the optimal shrinkage estimator is biased : the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly they are on average closer to ( μ 1 , μ 2 ) , as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals 2 c 2 / ( 2 + c 2 ) . When c 2 = 2 , as in this case, we obtain 2 × 2 / ( 2 + 2 ) = 1 . Again, this is almost exactly what we see in the simulation. If we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage pulls the MLE towards the origin, and can give a much lower composite MSE. An Infeasible Estimator: The General Case Now that we understand the case of p = 2 , the general case is a snap. Our shrinkage estimator of each μ j will take the form μ ^ j ( λ ) = ( 1 − λ ) X j , j = 1 , … , p for some λ between zero and one. To find the optimal choice of λ , we minimize ∑ j = 1 p MSE [ μ ^ j ( λ ) ] = ∑ j = 1 p [ ( 1 − λ ) 2 + λ 2 μ j 2 ] = p ( 1 − λ ) 2 + λ 2 c 2 with respect to λ . Again, the key is that the composite MSE only depends on the unknown means through c 2 . Using almost exactly the same calculations as above for the case of p = 2 , we find that λ ∗ = p p + c 2 , ∑ j = 1 p MSE [ μ ^ j ( λ ∗ ) ] = p ( c 2 p + c 2 ) . since c 2 / ( p + c 2 ) < 1 for all c 2 > 0 , the optimal shrinkage estimator always has a composite MSE less than p , the composite MSE of the ML estimator. Not Quite the James-Stein Estimator The end is in sight! We’ve shown that if we knew the sum of squares of the unknown means, c 2 , we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know c 2 . So what can we do? To start off, re-write λ ∗ as follows λ ∗ = p p + c 2 = 1 1 + c 2 / p . This way of writing things makes it clear that it’s not c 2 per se that matters but rather c 2 / p . And this quantity is simply is the average of the unknown squared means: c 2 p = 1 p ∑ j = 1 p μ j 2 . So how could we learn c 2 / p ? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved μ j with the corresponding observation X j , in other words 1 p ∑ j = 1 p X j 2 . This is a good starting point, but we can do better. Since X j ∼ Normal ( μ j , 1 ) , we see that E [ 1 p ∑ j = 1 p X j 2 ] = 1 p ∑ j = 1 p E [ X j 2 ] = 1 p ∑ j = 1 p [ Var ( X j ) + E ( X j ) 2 ] = 1 p ∑ j = 1 p ( 1 + μ j 2 ) = 1 + c 2 p . This means that ( ∑ j = 1 p X j 2 ) / p will on average overestimate c 2 / p by one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is no bias-variance tradeoff . Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for λ ∗ , this suggests using the estimator λ ^ ≡ 1 1 + [ ( 1 p ∑ j = 1 p X j 2 ) − 1 ] = 1 1 p ∑ j = 1 p X j 2 = p ∑ j = 1 p X j 2 as our stand-in for the unknown λ ∗ , yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment: μ ^ NQ ( j ) = ( 1 − p ∑ k = 1 p X k 2 ) X j . Notice what’s happening here: our optimal shrinkage estimator depends on c 2 / p , something we can’t observe. But we’ve constructed an unbiased estimator of this quantity by using all of the observations X j . This is the resolution of the paradox discussed above: all of the observations contain information about c 2 since this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual μ j parameters through c 2 ! This is the sense in which it’s possible to learn something useful about, say, μ 1 from X 2 in spite of the fact that E [ X 2 ] = μ 2 may bear no relationship to μ 1 . But wait a minute! This looks suspiciously familiar . Recall that the James-Stein estimator is given by μ ^ JS ( j ) = ( 1 − p − 2 ∑ k = 1 p X k 2 ) X j . Just like the JS estimator, my NQ estimator shrinks each of the p means towards zero by a factor that depends on the number of means we’re estimating, p , and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses p − 2 in the numerator instead of p . This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the form that it does, I would argue that the difference is minor. If you want all the gory details of where that extra − 2 comes from, along with the closely related issue of why p ≥ 3 is crucial for JS to dominate the ML estimator, see lecture 1 or section 7.3 from my Econ 722 teaching materials. Conclusion Before we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t quite JS, and that JS only dominates the MLE when p ≥ 3 , there’s one more fundamental issue that could be easily missed. Our decision to minimize composite MSE is absolutely crucial to the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context. If we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because Euclidean distance is obviously what we’re after here. But if instead we’re estimating teacher value-added and the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our values in a particular problem. But with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is nearly identical to the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew c 2 and showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity: c 2 / p . Because all the observations contain information about c 2 , it makes sense that we should decide how much to shrink one component X j by using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically obvious , excepting of course that pesky − 2 in the numerator. Footnotes If I ruled the universe, the Gauss-Markov Theorem would be demoted to much less exalted status in econometrics teaching! ↩︎ Don’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative! ↩︎ Why squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead. ↩︎ See if you can prove this as a homework exercise! ↩︎ In Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of μ conditional on our data X under a normal prior. In this case λ would equal τ / ( 1 + τ ) where τ is the prior precision , i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective. ↩︎ Strictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See lecture 1 of my Econ 722 slides for more detail. ↩︎ Remember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand. ↩︎

Markdown

[econometrics.blog](https://www.econometrics.blog/) - [Home](https://www.econometrics.blog/) - [About](https://www.econometrics.blog/about/) ## On this page - [Warm-up Exercise](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#warm-up-exercise) - [Review of Bias, Variance and MSE](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#review-of-bias-variance-and-mse) - [A Shrinkage Estimator](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#a-shrinkage-estimator) - [Some Algebra](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#some-algebra) - [Stein’s Paradox](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#steins-paradox) - [Recap](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#recap) - [Admissibility](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#admissibility) - [A More General Example](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#a-more-general-example) - [Composite MSE](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#composite-mse) - [Stein’s Paradox](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#steins-paradox-1) - [Where does the James-Stein Estimator Come From?](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#where-does-the-james-stein-estimator-come-from) - [An Infeasible Estimator When p = 2](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#an-infeasible-estimator-when-p-2) - [A Simulation Experiment for p = 2](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#a-simulation-experiment-for-p-2) - [An Infeasible Estimator: The General Case](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#an-infeasible-estimator-the-general-case) - [Not Quite the James-Stein Estimator](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#not-quite-the-james-stein-estimator) - [Conclusion](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#conclusion) # Not Quite the James-Stein Estimator [statistics](https://www.econometrics.blog/index.html#category=statistics) Author Francis J. DiTraglia Published August 10, 2024 If you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the [“James-Stein Estimator”](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the [best linear unbiased estimator](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem) (BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even *much better*–than OLS.[1](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn1) Stein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by [Charles Stein](https://en.wikipedia.org/wiki/Charles_M._Stein) in the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity. The supposed [paradox](https://youtu.be/XXhJKzI1u48?si=cS--uLd09_JnAXdr) is most simply stated by considering a special case of linear regression–that of estimating multiple unknown means. [Efron & Morris (1977)](https://www.jstor.org/stable/24954030) introduce the basic idea as follows: > A baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be. I first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my [Econ 722](https://ditraglia.com/econ722) course at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see [lecture 1](https://ditraglia.com/econ722/slides/econ722slides.pdf) or [section 7.3](https://ditraglia.com/econ722/main.pdf)–is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an [inverse-chi-squared random variable](https://en.wikipedia.org/wiki/Inverse-chi-squared_distribution). It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the [James-Stein Estimator](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator) is flagged as “may be too technical for readers to understand” at the time of this writing\! After six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course in probability and statistics. I’ll show how we can arrive at something that is *very nearly* the James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on to the end and that you’ll find the payoff worth your time and effort. As far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider [reading something else instead](https://www.econometrics.blog/post/how-to-read-an-econometrics-paper/), here are a few references that you may find helpful. [Efron & Morris (1977)](https://www.jstor.org/stable/24954030) is a classic article aimed at the general reader without a background in statistics. [Stigler (1988)](https://projecteuclid.org/journals/statistical-science/volume-5/issue-1/The-1988-Neyman-Memorial-Lecture--A-Galtonian-Perspective-on/10.1214/ss/1177012274.full) is a more technical but still accessible discussion of the topic while [Casella (1985)](https://www.jstor.org/stable/2682801) is a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is [Ijiri & Leitch (1980)](https://www.jstor.org/stable/2490394), who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below. ## Warm-up Exercise This section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of **bias**, **variance** and **mean-squared error** along with introducing a very simple **shrinkage estimator**. To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe X ∼ Normal ( μ , 1 ), a single draw from a normal distribution with variance one and unknown mean μ. Your task is to estimate μ. This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of X is one! But in fact there’s nothing special about n \= 1 and a variance of one: these merely make the notation simpler. If you prefer, you can think of X as the sample mean of n iid draws from a population with unknown mean μ where we’ve *rescaled* everything to have variance one. So how should we estimate μ? A natural and reasonable idea is to use the sample mean, in this case X itself. This is in fact the [maximum likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) for μ, so I’ll define μ ^ ML \= X. But is this estimator any good? And can we find something better? ### Review of Bias, Variance and MSE The concepts of *bias* and *variance* are key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory, *bias* is the difference between an estimators expected value and the true value of the parameter being estimated while *variance* is the expected squared difference between an estimator and its expected value. So if θ ^ is an estimator of some unknown parameter θ, then Bias ( θ ^ ) \= E \[ θ ^ \] − θ while Var ( θ ^ ) \= E \[ ( θ ^ − E \[ θ ^ \] ) 2 \]. A bias of zero means that an estimator is *correctly centered*: its expectation equals the truth. We say that such an estimator is *unbiased*.[2](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn2) A small variance means that an estimator is *precise*: it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a *trade-off* between bias and variance: if you want to reduce one of them, you have to accept an increase in the other. A common way of trading off bias and variance relies on a concept called *mean-squared error* (MSE) defined as the *sum* of the squared bias and the variance.[3](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn3) In particular: MSE ( θ ^ ) \= Var ( θ ^ ) \+ Bias ( θ ^ ) 2. Equivalently, we can write MSE ( θ ^ ) \= E \[ ( θ ^ − θ ) 2 \].[4](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn4) To borrow some terminology from introductory microeconomics, you can think of MSE as the *negative* of a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our *preferences* in terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram: ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-1-1.png) ### A Shrinkage Estimator Returning to our maximum likelihood estimator: it’s unbiased, Bias ( μ ^ ML ) \= 0, so MSE ( μ ^ ML ) \= Var ( μ ^ ML ) \= 1. Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be *yes*. Here’s the idea. Suppose we had some reason to believe that the true mean μ isn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by *shrinking* slightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero: μ ^ ( λ ) \= ( 1 − λ ) × μ ^ ML \+ λ × 0 \= ( 1 − λ ) X for 0 ≤ λ ≤ 1. The constant ( 1 − λ ) is called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero.[5](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn5) We get a different estimator for every value of λ. If λ \= 0 then we get the ML estimator back. If λ \= 1 then we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of λ. Substituting the definition of μ ^ ( λ ) into the formulas for bias and variance gives: Bias \[ μ ^ ( λ ) \] \= E \[ ( 1 − λ ) μ ^ ML \] − μ \= ( 1 − λ ) E \[ μ ^ ML \] − μ \= ( 1 − λ ) μ − μ \= − λ μ Var \[ μ ^ ( λ ) \] \= Var \[ ( 1 − λ ) μ ^ ML \] \= ( 1 − λ ) 2 Var \[ μ ^ ML \] \= ( 1 − λ ) 2 MSE \[ μ ^ ( λ ) \] \= Var \[ μ ^ ( λ ) \] \+ Bias \[ μ ^ ( λ ) \] 2 \= ( 1 − λ ) 2 \+ λ 2 μ 2 Unless λ \= 0, the shrinkage estimator is *biased*. And while the MSE of the ML estimator is always one, regardless of the true value of μ, the MSE of the shrinkage estimator *depends on the unknown parameter* μ. So why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a *larger* reduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator *can indeed* have a lower MSE than the ML estimator depending on the value of λ and the true value of μ: ``` # Range of values for the unknown parameter mu mu <- seq(-4, 4, length = 100) # Try three different values of lambda lambda1 <- 0.1 lambda2 <- 0.2 lambda3 <- 0.3 # Plot the MSE of the shrinkage estimator as a function of mu for all # three values of lambda at once matplot(mu, cbind((1 - lambda1)^2 + lambda1^2 * mu^2, (1 - lambda2)^2 + lambda2^2 * mu^2, (1 - lambda3)^2 + lambda3^2 * mu^2), type = 'l', lty = 1, lwd = 2, col = c('red', 'blue', 'green'), xlab = expression(mu), ylab = 'MSE', main = 'MSE of Shrinkage Estimator') # Add legend legend('topright', legend = c(expression(lambda == 0.1), expression(lambda == 0.2), expression(lambda == 0.3)), col = c('red', 'blue', 'green'), lty = 1, lwd = 2) # Add dashed line for MSE of ML estimator abline(h = 1, lty = 2, lwd = 2) ``` ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-2-1.png) ### Some Algebra It’s time for some algebra. If you’re tempted to skip this *please don’t*: this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural. As seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of λ isn’t too large relative to the true value of μ. With a bit of algebra, we can work out *precisely* how large λ can be to make shrinkage worthwhile. Since MSE \[ μ ^ ML \] \= 1, by expanding and simplifying the expression for MSE \[ μ ^ ( λ ) \] we see that MSE \[ μ ^ ( λ ) \] \< MSE \[ μ ^ ML \] if and only if ( 1 − λ ) 2 \+ λ 2 μ 2 \< 1 1 − 2 λ \+ λ 2 \+ λ 2 μ 2 \< 1 λ 2 ( 1 \+ μ 2 ) − 2 λ \< 0 λ \[ λ ( 1 \+ μ 2 ) − 2 \] \< 0\. Since λ ≥ 0, the final inequality can only hold if the factor inside the square brackets is negative, i.e. λ ( 1 \+ μ 2 ) − 2 \< 0 λ \< 2 1 \+ μ 2 . This shows that any choice of λ between 0 and 2 / ( 1 \+ μ 2 ) will give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for μ to obtain the boundary of the region where shrinkage is better than ML: λ ( 1 \+ μ 2 ) − 2 \= 0 1 \+ μ 2 \= 2 / λ μ \= ± 2 / λ − 1 . Adding these boundaries to a simplified version of our previous plot with only λ \= 0\.3 we see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator. ``` # Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3 lambda <- 0.3 plot(mu, (1 - lambda)^2 + lambda^2 * mu^2, type = 'l', lty = 1, lwd = 2, col = 'blue', xlab = expression(mu), ylab = 'MSE', main = 'Boundary of Region Where Shrinkage is Better than ML') # Add dashed line for MSE of ML estimator abline(h = 1, lty = 2, lwd = 2) # Add boundaries of region where shrinkage is better than ML estimator abline(v = c(sqrt(2/lambda - 1), -sqrt(2/lambda - 1)), lty = 3, lwd = 2, col = 'red') ``` ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-3-1.png) But there’s still more to learn! Suppose we wanted to take things *one step further* and find the *optimal* value of λ for any given value of μ. In other words, suppose we wanted the value of λ that *minimizes* the MSE of our shrinkage estimator given a particular assumed value for μ. Since MSE \[ μ ^ ( λ ) \] is a quadratic function of λ, as shown above, this turns out to be a fairly straightforward calculation. Differentiating, d d λ MSE \[ μ ^ ( λ ) \] \= d d λ \[ ( 1 − λ ) 2 \+ λ 2 μ 2 \] \= − 2 ( 1 − λ ) \+ 2 λ μ 2 \= 2 \[ λ ( 1 \+ μ 2 ) − 1 \] d 2 d λ 2 MSE \[ μ ^ ( λ ) \] \= 2 ( 1 \+ μ 2 ) \> 0 so there is a unique global minimum at λ ∗ ≡ 1 / ( 1 \+ μ 2 ). This gives the *optimal* shrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting λ ∗ into the expression for MSE \[ μ ^ ( λ ) \] gives: MSE \[ μ ^ ( λ ∗ ) \] \= ( 1 − 1 1 \+ μ 2 ) 2 \+ ( 1 1 \+ μ 2 ) 2 μ 2 \= ( μ 2 1 \+ μ 2 ) 2 \+ ( 1 1 \+ μ 2 ) 2 μ 2 \= ( 1 1 \+ μ 2 ) 2 ( μ 4 \+ μ 2 ) \= ( 1 1 \+ μ 2 ) 2 μ 2 ( 1 \+ μ 2 ) \= μ 2 1 \+ μ 2 \< 1\. ## Stein’s Paradox ### Recap We’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that λ is chosen judiciously: it needs to be between zero and 2 / ( 1 \+ μ 2 ). The optimal choice of λ, namely λ ∗ \= 1 / ( 1 \+ μ 2 ), gives an MSE of μ 2 / ( 1 \+ μ 2 ). This is always lower than one, the MSE of the ML estimator. There’s just one massive problem we’ve ignored this whole time: **we don’t know the value of** μ! As seen from the figure plotted above, the MSE curves for different values of λ *cross each other*: the best one to use depends on the true value of μ. This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of μ that could help guide our choice of λ. What it does mean is that there’s no “one-size-fits-all” value. ### Admissibility It’s time to introduce a bit of technical vocabulary. We say that an estimator θ ~ **dominates** another estimator θ ^ if MSE \[ θ ~ \] ≤ MSE \[ θ ^ \] for *all* possible values of the parameter θ being estimated and MSE \[ θ ~ \] \< MSE \[ θ ^ \] for at least *one* possible value of θ.[6](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn6) In words, this means that it never makes sense to use θ ^ in preference to θ ~. No matter what the true parameter value is, you can’t do worse with θ ~ and you might do better. An estimator that is *not dominated* by any other estimator is called **admissible**; an estimator that *is dominated* by some other estimator is called **inadmissible**. The concept of *admissibility* in decision theory is a bit like the concept of [Pareto efficiency](https://en.wikipedia.org/wiki/Pareto_efficiency) in microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off. It’s quite challenging to prove, but in fact the ML estimator θ ^ M L \= X turns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large μ is likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch\! ### A More General Example Now let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw X from a Normal ( μ , 1 ) distribution but a *collection* of p independent draws from p *different* normal distributions: X 1 , X 2 , . . . , X p ∼ independent Normal ( μ j , 1 ) , j \= 1 , . . . , p . You can think of this as p copies of our original problem: we observe X j ∼ Normal ( μ j , 1 ) and our task is to estimate μ j. The observations are all independent, and each comes from a distribution with a potentially **different mean**. At first glance it seems like these p separate problems should have *absolutely nothing to do with each other*. And indeed the maximum likelihood estimator for the collection of p means is simply μ ^ ML ( j ) \= X j. As above in our example with p \= 1, the question is: how good is the ML estimator, and can we do any better? ### Composite MSE But first things first: how can we evaluate the quality of p estimators for p different parameters *at the same time*? A common approach, and the one we will follow here, is to take the *sum* of the individual MSEs of each estimator, yielding a quantity called **composite MSE**. If μ ^ 1 , μ ^ 2 , … , μ ^ p is a collection of estimators for each of the individual unknown means, then the composite MSE is defined as Composite MSE ≡ ∑ j \= 1 p MSE ( μ ^ j ) \= ∑ j \= 1 p \[ Bias ( μ ^ j ) 2 \+ Var ( μ ^ j ) \] \= ∑ j \= 1 p E \[ ( μ ^ j − μ j ) 2 \] . Adopting composite MSE as our measure of *good* performance means that we view each of the p estimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating μ j in exchange for doing a much better job estimating μ k. At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to **minimize the composite MSE**. The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does. ### Stein’s Paradox Putting our new idea into practice, we see that the composite MSE of the ML estimator is p regardless of the true values of the individual means μ 1 , … , μ p since ∑ j \= 1 p MSE \[ μ ^ ML ( j ) \] \= ∑ j \= 1 p MSE ( X j ) \= ∑ j \= 1 p Var ( X j ) \= p . If the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to p and sometimes has an MSE strictly less than p. I’ve already told you that this is true when p \= 1. When p \= 2 it’s still true: the ML estimator remains admissible. But when p ≥ 3 something very unexpected happens: it becomes possible to construct an estimator that **dominates** the ML estimator by using information from *all* of the ( X 1 , . . . , X p ) observations to estimate μ j. This is spite of the fact that there is *no obvious connection* between the observations. Again: they are all independent and come from distributions with different means\! The estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to μ ^ JS ( j ) \= ( 1 − p − 2 ∑ k \= 1 p X k 2 ) X j . This estimator dominates the ML estimator when p ≥ 3 in that ∑ j \= 1 p MSE \[ μ ^ JS ( j ) \] ≤ ∑ j \= 1 p MSE \[ μ ^ ML ( j ) \] \= p for *all* possible values of the p unknown means μ j with strict inequality for at least *some* values. Taking a closer look at the formula, we see that the James-Stein estimator is just a *shrinkage* estimator applied to each of the p means, namely μ ^ JS ( j ) \= ( 1 − λ ^ JS ) X j , λ ^ JS ≡ p − 2 ∑ k \= 1 p X k 2 . The shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating, p, along with the *overall* sum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero *overall*, the less we shrink *each of them* towards zero. Just like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the *data* to determine the shrinkage factor. And as long as p ≥ 3 it is always *at least as good* as the ML estimator and sometimes *much better*. The **paradox** is that this seems impossible: how can information from *all* of the observations be useful when they come from *different* distributions with no obvious connection? The rest of this post will *not* prove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some *very good intuition* for why the formula for the James-Stein estimator takes the form it does. By the end, I hope you’ll feel that, far from seeming paradoxical, using *all* of the observations to determine the shrinkage factor for one particular μ j makes perfect sense. ## Where does the James-Stein Estimator Come From? ### An Infeasible Estimator When p \= 2 To start the ball rolling, let’s [assume a can-opener](https://en.wikipedia.org/wiki/Assume_a_can_opener): suppose that we don’t know any of the *individual* means μ j but for some strange reason a benevolent deity has told us the value of their sum of squares: c 2 ≡ ∑ j \= 1 p μ j 2 . It turns out that this is enough information to construct a shrinkage estimator that *always* has a lower composite MSE than the ML estimator. Let’s see why this is the case. If p \= 1, then telling you c 2 is the same as telling you μ 2. Granted, knowledge of μ 2 isn’t as informative as knowledge of μ. For example, if I told you that μ 2 \= 9 you couldn’t tell whether μ \= 3 or μ \= − 3. But, as we showed above, the optimal shrinkage estimator when p \= 1 sets λ ∗ \= 1 / ( 1 \+ μ 2 ) and yields an MSE of μ 2 / ( 1 \+ μ 2 ) \< 1. Since λ ∗ only depends on μ through μ 2, we’ve *already shown* that knowledge of c 2 allows us to construct a shrinkage estimator that dominates the ML estimator when p \= 1. So what if p equals 2? In this case, knowledge of c 2 \= μ 1 2 \+ μ 2 2 is equivalent to knowing the *radius* of a circle centered at the origin in the ( μ 1 , μ 2 ) plane where the two unknown means must lie. For example, if I told you that c 2 \= 1 you would know that ( μ 1 , μ 2 ) lies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points ( x 1 , x 2 ) and ( y 1 , y 2 ) would then be potential values of ( μ 1 , μ 2 ) as would all other points on the blue circle. ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-4-1.png) So how can we construct a shrinkage estimator of ( μ 1 , μ 2 ) with lower composite MSE than the ML estimator if c 2 is known? While there are other possibilities, the simplest would be to use the *same* shrinkage factor for each of the two coordinates. In other words, our estimator would be μ ^ 1 ( λ ) \= ( 1 − λ ) X 1 , μ ^ 2 ( λ ) \= ( 1 − λ ) X 2 for some λ between zero and one. The composite MSE of this estimator is just the sum of the MSE of each *individual* component, so we can re-use our algebra from above to obtain MSE \[ μ ^ 1 ( λ ) \] \+ MSE \[ μ ^ 2 ( λ ) \] \= \[ ( 1 − λ ) 2 \+ λ 2 μ 1 2 \] \+ \[ ( 1 − λ ) 2 \+ λ 2 μ 2 2 \] \= 2 ( 1 − λ ) 2 \+ λ 2 ( μ 1 2 \+ μ 2 2 ) \= 2 ( 1 − λ ) 2 \+ λ 2 c 2 . Notice that the composite MSE only depends on ( μ 1 , μ 2 ) through their sum of squares, c 2. Differentiating with respect to λ, just as we did above in the p \= 1 case, d d λ \[ 2 ( 1 − λ ) 2 \+ λ 2 c 2 \] \= − 4 ( 1 − λ ) \+ 2 λ c 2 \= 2 \[ λ ( 2 \+ c 2 ) − 2 \] d 2 d λ 2 \[ 2 ( 1 − λ ) 2 \+ λ 2 c 2 \] \= 2 ( 2 \+ c 2 ) \> 0 so there is a unique global minimum at λ ∗ \= 2 / ( 2 \+ c 2 ). Substituting this value of λ into the expression for the composite MSE, a few lines of algebra give MSE \[ μ ^ 1 ( λ ∗ ) \] \+ MSE \[ μ ^ 2 ( λ ∗ ) \] \= 2 ( 1 − 2 2 \+ c 2 ) 2 \+ ( 2 2 \+ c 2 ) 2 c 2 \= 2 ( c 2 2 \+ c 2 ) . Since c 2 / ( 2 \+ c 2 ) \< 1 for all c 2 \> 0, the optimal shrinkage estimator *always* has a composite MSE less than 2, the composite MSE of the ML estimator. Strictly speaking this estimator is **infeasible** since we don’t know c 2. But it’s a crucial step on our journey to make the leap from applying shrinkage to an estimator for a *single* unknown mean, to using the same idea for *more than one* unknown mean. ### A Simulation Experiment for p \= 2 You may have already noticed that it’s easy to generalize this argument to p \> 2. But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for p \= 2 a bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when p \= 2. I’ve set the true, unknown, values of μ 1 and μ 2 to one so the true value of c 2 is 2 and the optimal choice of λ is λ ∗ \= 2 / ( 2 \+ c 2 ) \= 2 / 4 \= 0\.5. The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action. ``` set.seed(1983) nreps <- 50 mu1 <- mu2 <- 1 x1 <- mu1 + rnorm(nreps) x2 <- mu2 + rnorm(nreps) csq <- mu1^2 + mu2^2 lambda <- csq / (2 + csq) par(mfrow = c(1, 2)) # Left panel: ML Estimator plot(x1, x2, main = 'MLE', pch = 20, col = 'black', cex = 2, xlab = expression(mu[1]), ylab = expression(mu[2])) abline(v = mu1, lty = 1, col = 'red', lwd = 2) abline(h = mu2, lty = 1, col = 'red', lwd = 2) # Add MSE to the plot text(x = 2, y = 3, labels = paste("MSE =", round(mean((x1 - mu1)^2 + (x2 - mu2)^2), 2))) # Right panel: Shrinkage Estimator plot(x1, x2, main = 'Shrinkage', xlab = expression(mu[1]), ylab = expression(mu[2])) points(lambda * x1, lambda * x2, pch = 20, col = 'blue', cex = 2) segments(x0 = x1, y0 = x2, x1 = lambda * x1, y1 = lambda * x2, lty = 2) abline(v = mu1, lty = 1, col = 'red', lwd = 2) abline(h = mu2, lty = 1, col = 'red', lwd = 2) abline(v = 0, lty = 1, lwd = 2) abline(h = 0, lty = 1, lwd = 2) # Add MSE to the plot text(x = 2, y = 3, labels = paste("MSE =", round(mean((lambda * x1 - mu1)^2 + (lambda * x2 - mu2)^2), 2))) ``` ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-5-1.png) My plot has two panels. The left panel shows the raw data. Each black point is a pair ( X 1 , X 2 ) of independent normal draws with means ( μ 1 \= 1 , μ 2 \= 1 ) and variances ( 1 , 1 ). As such, each point is also the *ML estimate* (MLE) of ( μ 1 , μ 2 ) based on ( X 1 , X 2 ). The red cross shows the location of the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of ( μ 1 , μ 2 ) in repeated sampling, approximating the composite MSE. The right panel is more complicated. This shows *both* the ML estimates (unfilled black circles) *and* the corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of λ \= 0\.5. Thus, if a given unfilled black circle is located at ( X 1 , X 2 ), the corresponding filled blue circle is located at ( 0\.5 X 1 , 0\.5 X 2 ). As in the left panel, the red cross in the right panel shows the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely ( 0 , 0 ). We see immediately that the ML estimator is *unbiased*: the black filled dots in the left panel (along with the unfilled ones in the right) are centered at ( 1 , 1 ). But the ML estimator is also *high-variance*: the black dots are quite spread out around ( 1 , 1 ). We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross.[7](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn7) And in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator. In contrast, the optimal shrinkage estimator is *biased*: the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly *they are on average closer to* ( μ 1 , μ 2 ), as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals 2 c 2 / ( 2 \+ c 2 ). When c 2 \= 2, as in this case, we obtain 2 × 2 / ( 2 \+ 2 ) \= 1. Again, this is almost exactly what we see in the simulation. If we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage *pulls* the MLE towards the origin, and can give a *much* lower composite MSE. ### An Infeasible Estimator: The General Case Now that we understand the case of p \= 2, the general case is a snap. Our shrinkage estimator of each μ j will take the form μ ^ j ( λ ) \= ( 1 − λ ) X j , j \= 1 , … , p for some λ between zero and one. To find the optimal choice of λ, we minimize ∑ j \= 1 p MSE \[ μ ^ j ( λ ) \] \= ∑ j \= 1 p \[ ( 1 − λ ) 2 \+ λ 2 μ j 2 \] \= p ( 1 − λ ) 2 \+ λ 2 c 2 with respect to λ. Again, the key is that the composite MSE only depends on the unknown means through c 2. Using almost exactly the same calculations as above for the case of p \= 2, we find that λ ∗ \= p p \+ c 2 , ∑ j \= 1 p MSE \[ μ ^ j ( λ ∗ ) \] \= p ( c 2 p \+ c 2 ) . since c 2 / ( p \+ c 2 ) \< 1 for all c 2 \> 0, the optimal shrinkage estimator *always* has a composite MSE less than p, the composite MSE of the ML estimator. ### Not Quite the James-Stein Estimator The end is in sight! We’ve shown that if we knew the sum of squares of the unknown means, c 2, we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know c 2. So what can we do? To start off, re-write λ ∗ as follows λ ∗ \= p p \+ c 2 \= 1 1 \+ c 2 / p . This way of writing things makes it clear that it’s not c 2 *per se* that matters but rather c 2 / p. And this quantity is simply is the *average* of the unknown squared means: c 2 p \= 1 p ∑ j \= 1 p μ j 2 . So how could we learn c 2 / p? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved μ j with the corresponding observation X j, in other words 1 p ∑ j \= 1 p X j 2 . This is a good starting point, but we can do better. Since X j ∼ Normal ( μ j , 1 ), we see that E \[ 1 p ∑ j \= 1 p X j 2 \] \= 1 p ∑ j \= 1 p E \[ X j 2 \] \= 1 p ∑ j \= 1 p \[ Var ( X j ) \+ E ( X j ) 2 \] \= 1 p ∑ j \= 1 p ( 1 \+ μ j 2 ) \= 1 \+ c 2 p . This means that ( ∑ j \= 1 p X j 2 ) / p will on average *overestimate* c 2 / p by one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is *no bias-variance tradeoff*. Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for λ ∗, this suggests using the estimator λ ^ ≡ 1 1 \+ \[ ( 1 p ∑ j \= 1 p X j 2 ) − 1 \] \= 1 1 p ∑ j \= 1 p X j 2 \= p ∑ j \= 1 p X j 2 as our stand-in for the unknown λ ∗, yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment: μ ^ NQ ( j ) \= ( 1 − p ∑ k \= 1 p X k 2 ) X j . Notice what’s happening here: our optimal shrinkage estimator depends on c 2 / p, something we can’t observe. But we’ve constructed an *unbiased estimator* of this quantity by using *all of the observations* X j. This is the resolution of the paradox discussed above: all of the observations contain information about c 2 since this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual μ j parameters through c 2! This is the sense in which it’s possible to learn something useful about, say, μ 1 from X 2 in spite of the fact that E \[ X 2 \] \= μ 2 may bear no relationship to μ 1. But wait a minute! This looks *suspiciously familiar*. Recall that the James-Stein estimator is given by μ ^ JS ( j ) \= ( 1 − p − 2 ∑ k \= 1 p X k 2 ) X j . Just like the JS estimator, my NQ estimator shrinks each of the p means towards zero by a factor that depends on the number of means we’re estimating, p, and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses p − 2 in the numerator instead of p. This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the *form* that it does, I would argue that the difference is minor. If you want all the gory details of where that extra − 2 comes from, along with the closely related issue of why p ≥ 3 is crucial for JS to dominate the ML estimator, see [lecture 1](https://ditraglia.com/econ722/slides/econ722slides.pdf) or [section 7.3](https://ditraglia.com/econ722/main.pdf) from my Econ 722 teaching materials. ## Conclusion Before we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t *quite* JS, and that JS only dominates the MLE when p ≥ 3, there’s one more fundamental issue that could be easily missed. Our decision to minimize *composite* MSE is *absolutely crucial* to the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context. If we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because *Euclidean distance* is obviously what we’re after here. But if instead we’re estimating [teacher value-added](https://ideas.repec.org/p/nbr/nberwo/27094.html) and the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our *values* in a particular problem. But with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is *nearly identical* to the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew c 2 and showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity: c 2 / p. Because all the observations contain information about c 2, it makes sense that we should decide how much to shrink one component X j by using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically *obvious*, excepting of course that pesky − 2 in the numerator. ## Footnotes 1. If I ruled the universe, the Gauss-Markov Theorem would be demoted to much less exalted status in econometrics teaching\![↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref1) 2. Don’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative\![↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref2) 3. Why squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref3) 4. See if you can prove this as a homework exercise\![↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref4) 5. In Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of μ conditional on our data X under a normal prior. In this case λ would equal τ / ( 1 \+ τ ) where τ is the *prior precision*, i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref5) 6. Strictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See [lecture 1](https://ditraglia.com/econ722/slides/econ722slides.pdf) of my Econ 722 slides for more detail.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref6) 7. Remember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref7)

Readable Markdown

If you study enough econometrics or statistics, you’ll eventually hear someone mention “Stein’s Paradox” or the [“James-Stein Estimator”](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). You’ve probably learned in your introductory econometrics course that ordinary least squares (OLS) is the [best linear unbiased estimator](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem) (BLUE) in a linear regression model under the Gauss-Markov assumptions. The stipulations “linear” and “unbiased” are crucial here. If we remove them, it’s possible to do better–maybe even *much better*–than OLS.[1](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn1) Stein’s paradox is a famous example of this phenomenon, one that created much consternation among statisticians and fellow-travelers when it was first pointed out by [Charles Stein](https://en.wikipedia.org/wiki/Charles_M._Stein) in the mid-1950s. The example is interesting in its own right, but also has deep connections to ideas in Bayesian inference and machine learning making it much more than a mere curiosity. The supposed [paradox](https://youtu.be/XXhJKzI1u48?si=cS--uLd09_JnAXdr) is most simply stated by considering a special case of linear regression–that of estimating multiple unknown means. [Efron & Morris (1977)](https://www.jstor.org/stable/24954030) introduce the basic idea as follows: > A baseball player who gets seven hits in 20 official times at bat is said to have a batting average of .350. In computing this statistic we are forming an estimate of the player’s true batting ability in terms of his observed average rate of success. Asked how well the player will do in his next 100 times at bat, we would probably predict 35 more hits. In traditional statistical theory it can be proved that no other estimation rule is uniformly better than the observed average. The paradoxical element in Stein’s result is that it sometimes contradicts this elementary law of statistical theory. If we have three or more baseball players, and if we are interested in predicting future batting averages for each of them, then there is a procedure that is better than simply extrapolating from the three separate averages. Here “better” has a strong meaning. The statistician who employs Stein’s method can expect to predict the future averages more accurately no matter what the true batting abilities of the players may be. I first encountered Stein’s Paradox in an offhand remark by my PhD supervisor. I dutifully looked it up in an attempt to better understand the point he had been making, but lacked sufficient understanding of decision theory at the time to see what the fuss was all about. The second time I encountered it, after I knew a bit more, it seemed astounding: almost like magic. I decided to include the topic in my [Econ 722](https://ditraglia.com/econ722) course at Penn, but struggled to make it accessible to my students. A big problem, in my view, is that the proof–see [lecture 1](https://ditraglia.com/econ722/slides/econ722slides.pdf) or [section 7.3](https://ditraglia.com/econ722/main.pdf)–is ultimately a bit of a let-down: algebra, followed by repeated integration by parts, and then a fact about the existence of moments for an [inverse-chi-squared random variable](https://en.wikipedia.org/wiki/Inverse-chi-squared_distribution). It seems like a sterile technical exercise when in fact that result itself is deep, surprising, and important. As if a benign deity were keen on making my point for me, the wikipedia article on the [James-Stein Estimator](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator) is flagged as “may be too technical for readers to understand” at the time of this writing\! After six months of pondering, this post is my attempt to explain the James-Stein Estimator in a way that is accessible to a broad audience. The assumed background is minimal: just an introductory course in probability and statistics. I’ll show how we can arrive at something that is *very nearly* the James-Stein estimator by following some very simple and natural intuition. After you understand my “not quite James-Stein” estimator, it’s a short step to the real thing. So the “let-down” proof I mentioned before becomes merely a technical justification for a slight modification of a formula that is already intuitively compelling. As far as possible, I’ve tried to keep this post self-contained by introducing, or at least reviewing, key background material as we go along. The cost of this approach, unfortunately, is that the post is pretty long! I hope you’ll soldier on to the end and that you’ll find the payoff worth your time and effort. As far as I know, the precise way that I motivate the James-Stein estimator in this post is new, but there are many other papers that aim to make sense of the supposed paradox in an intuitive way. In keeping with my injunction that you should always consider [reading something else instead](https://www.econometrics.blog/post/how-to-read-an-econometrics-paper/), here are a few references that you may find helpful. [Efron & Morris (1977)](https://www.jstor.org/stable/24954030) is a classic article aimed at the general reader without a background in statistics. [Stigler (1988)](https://projecteuclid.org/journals/statistical-science/volume-5/issue-1/The-1988-Neyman-Memorial-Lecture--A-Galtonian-Perspective-on/10.1214/ss/1177012274.full) is a more technical but still accessible discussion of the topic while [Casella (1985)](https://www.jstor.org/stable/2682801) is a very readable paper that discusses the James-Stein estimator in the context of empirical Bayes. A less well-known paper that I found helpful is [Ijiri & Leitch (1980)](https://www.jstor.org/stable/2490394), who consider the James-Stein estimator in a real-world setting, namely “Audit Sampling” in accounting. They discuss several interesting practical and philosophical issues including the distinction between “composite” and “individual” risk that I’ll pick up on below. ## Warm-up Exercise This section provides some important background that we’ll need to understand Stein’s Paradox later in the post reviewing the ideas of **bias**, **variance** and **mean-squared error** along with introducing a very simple **shrinkage estimator**. To make these ideas as transparent as possible we’ll start with a ridiculously simple problem. Suppose that you observe X ∼ Normal ( μ , 1 ), a single draw from a normal distribution with variance one and unknown mean μ. Your task is to estimate μ. This may strike you as a very silly problem: it only involves a single datapoint and we assume the variance of X is one! But in fact there’s nothing special about n \= 1 and a variance of one: these merely make the notation simpler. If you prefer, you can think of X as the sample mean of n iid draws from a population with unknown mean μ where we’ve *rescaled* everything to have variance one. So how should we estimate μ? A natural and reasonable idea is to use the sample mean, in this case X itself. This is in fact the [maximum likelihood estimator](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) for μ, so I’ll define μ ^ ML \= X. But is this estimator any good? And can we find something better? ### Review of Bias, Variance and MSE The concepts of *bias* and *variance* are key ideas that we typically reach for when considering the quality of an estimator. To refresh your memory, *bias* is the difference between an estimators expected value and the true value of the parameter being estimated while *variance* is the expected squared difference between an estimator and its expected value. So if θ ^ is an estimator of some unknown parameter θ, then Bias ( θ ^ ) \= E \[ θ ^ \] − θ while Var ( θ ^ ) \= E \[ ( θ ^ − E \[ θ ^ \] ) 2 \]. A bias of zero means that an estimator is *correctly centered*: its expectation equals the truth. We say that such an estimator is *unbiased*.[2](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn2) A small variance means that an estimator is *precise*: it doesn’t “jump around” too much. Ideally we’d like an estimator that is correctly centered and precise. But it turns out that there is generally a *trade-off* between bias and variance: if you want to reduce one of them, you have to accept an increase in the other. A common way of trading off bias and variance relies on a concept called *mean-squared error* (MSE) defined as the *sum* of the squared bias and the variance.[3](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn3) In particular: MSE ( θ ^ ) \= Var ( θ ^ ) \+ Bias ( θ ^ ) 2. Equivalently, we can write MSE ( θ ^ ) \= E \[ ( θ ^ − θ ) 2 \].[4](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn4) To borrow some terminology from introductory microeconomics, you can think of MSE as the *negative* of a utility function over bias and variance. Both bias and variance are “bads” in that we’d rather have less rather than more of each. This formula expresses our *preferences* in terms of how much of one we’d be willing to accept in exchange for less of the other. Slightly foreshadowing something that will come later in this post, we can think of MSE as the square of the average distance that an archer’s arrows land from the bulls-eye. Smaller values of MSE are better: variance measures how closely the arrows cluster together while bias measures how far the center of the cluster is from the bulls-eye, as in the following diagram: ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-1-1.png) ### A Shrinkage Estimator Returning to our maximum likelihood estimator: it’s unbiased, Bias ( μ ^ ML ) \= 0, so MSE ( μ ^ ML ) \= Var ( μ ^ ML ) \= 1. Suppose that low MSE is what we’re after. Is there any way to improve on the ML estimator? In other words, can we achieve an MSE that’s lower than one? The answer turns out to be *yes*. Here’s the idea. Suppose we had some reason to believe that the true mean μ isn’t very large. Then perhaps we could try to adjust our maximum likelihood estimate by *shrinking* slightly towards zero. One way to do this would be by taking a weighted average of the ML estimator and zero: μ ^ ( λ ) \= ( 1 − λ ) × μ ^ ML \+ λ × 0 \= ( 1 − λ ) X for 0 ≤ λ ≤ 1. The constant ( 1 − λ ) is called the “shrinkage factor” and controls how the ML estimator gets pulled towards zero.[5](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn5) We get a different estimator for every value of λ. If λ \= 0 then we get the ML estimator back. If λ \= 1 then we get a very silly estimator that ignores the data and simply reports zero no matter what! So let’s see how the MSE depends on our choice of λ. Substituting the definition of μ ^ ( λ ) into the formulas for bias and variance gives: Bias \[ μ ^ ( λ ) \] \= E \[ ( 1 − λ ) μ ^ ML \] − μ \= ( 1 − λ ) E \[ μ ^ ML \] − μ \= ( 1 − λ ) μ − μ \= − λ μ Var \[ μ ^ ( λ ) \] \= Var \[ ( 1 − λ ) μ ^ ML \] \= ( 1 − λ ) 2 Var \[ μ ^ ML \] \= ( 1 − λ ) 2 MSE \[ μ ^ ( λ ) \] \= Var \[ μ ^ ( λ ) \] \+ Bias \[ μ ^ ( λ ) \] 2 \= ( 1 − λ ) 2 \+ λ 2 μ 2 Unless λ \= 0, the shrinkage estimator is *biased*. And while the MSE of the ML estimator is always one, regardless of the true value of μ, the MSE of the shrinkage estimator *depends on the unknown parameter* μ. So why should we use a biased estimator? The answer is that by tolerating a small amount of bias we may be able to achieve a *larger* reduction in variance, resulting in a lower MSE compared to the higher variance but unbiased ML estimator. A quick plot shows us that the shrinkage estimator *can indeed* have a lower MSE than the ML estimator depending on the value of λ and the true value of μ: ``` # Range of values for the unknown parameter mu mu <- seq(-4, 4, length = 100) # Try three different values of lambda lambda1 <- 0.1 lambda2 <- 0.2 lambda3 <- 0.3 # Plot the MSE of the shrinkage estimator as a function of mu for all # three values of lambda at once matplot(mu, cbind((1 - lambda1)^2 + lambda1^2 * mu^2, (1 - lambda2)^2 + lambda2^2 * mu^2, (1 - lambda3)^2 + lambda3^2 * mu^2), type = 'l', lty = 1, lwd = 2, col = c('red', 'blue', 'green'), xlab = expression(mu), ylab = 'MSE', main = 'MSE of Shrinkage Estimator') # Add legend legend('topright', legend = c(expression(lambda == 0.1), expression(lambda == 0.2), expression(lambda == 0.3)), col = c('red', 'blue', 'green'), lty = 1, lwd = 2) # Add dashed line for MSE of ML estimator abline(h = 1, lty = 2, lwd = 2) ``` ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-2-1.png) ### Some Algebra It’s time for some algebra. If you’re tempted to skip this *please don’t*: this section is a warm-up for our main event. If you thoroughly understand the mechanics of shrinkage in this simple example, everything that follows below will seem much more natural. As seen from the plot above, the MSE of our shrinkage estimator (the solid lines) is lower than that of the ML estimator (the dashed line) provided that our chosen value of λ isn’t too large relative to the true value of μ. With a bit of algebra, we can work out *precisely* how large λ can be to make shrinkage worthwhile. Since MSE \[ μ ^ ML \] \= 1, by expanding and simplifying the expression for MSE \[ μ ^ ( λ ) \] we see that MSE \[ μ ^ ( λ ) \] \< MSE \[ μ ^ ML \] if and only if ( 1 − λ ) 2 \+ λ 2 μ 2 \< 1 1 − 2 λ \+ λ 2 \+ λ 2 μ 2 \< 1 λ 2 ( 1 \+ μ 2 ) − 2 λ \< 0 λ \[ λ ( 1 \+ μ 2 ) − 2 \] \< 0\. Since λ ≥ 0, the final inequality can only hold if the factor inside the square brackets is negative, i.e. λ ( 1 \+ μ 2 ) − 2 \< 0 λ \< 2 1 \+ μ 2 . This shows that any choice of λ between 0 and 2 / ( 1 \+ μ 2 ) will give us a shrinkage estimator with an MSE less than one. To check our algebra, we can change the inequality to an equality and solve for μ to obtain the boundary of the region where shrinkage is better than ML: λ ( 1 \+ μ 2 ) − 2 \= 0 1 \+ μ 2 \= 2 / λ μ \= ± 2 / λ − 1 . Adding these boundaries to a simplified version of our previous plot with only λ \= 0\.3 we see that everything works out correctly: the dashed red lines intersect the blue curve at the points where the MSE of the shrinkage estimator equals that of the ML estimator. ``` # Plot the MSE of the shrinkage estimator as a function of mu for lambda = 0.3 lambda <- 0.3 plot(mu, (1 - lambda)^2 + lambda^2 * mu^2, type = 'l', lty = 1, lwd = 2, col = 'blue', xlab = expression(mu), ylab = 'MSE', main = 'Boundary of Region Where Shrinkage is Better than ML') # Add dashed line for MSE of ML estimator abline(h = 1, lty = 2, lwd = 2) # Add boundaries of region where shrinkage is better than ML estimator abline(v = c(sqrt(2/lambda - 1), -sqrt(2/lambda - 1)), lty = 3, lwd = 2, col = 'red') ``` ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-3-1.png) But there’s still more to learn! Suppose we wanted to take things *one step further* and find the *optimal* value of λ for any given value of μ. In other words, suppose we wanted the value of λ that *minimizes* the MSE of our shrinkage estimator given a particular assumed value for μ. Since MSE \[ μ ^ ( λ ) \] is a quadratic function of λ, as shown above, this turns out to be a fairly straightforward calculation. Differentiating, d d λ MSE \[ μ ^ ( λ ) \] \= d d λ \[ ( 1 − λ ) 2 \+ λ 2 μ 2 \] \= − 2 ( 1 − λ ) \+ 2 λ μ 2 \= 2 \[ λ ( 1 \+ μ 2 ) − 1 \] d 2 d λ 2 MSE \[ μ ^ ( λ ) \] \= 2 ( 1 \+ μ 2 ) \> 0 so there is a unique global minimum at λ ∗ ≡ 1 / ( 1 \+ μ 2 ). This gives the *optimal* shrinkage factor in the sense that it minimizes the MSE of the shrinkage estimator. Substituting λ ∗ into the expression for MSE \[ μ ^ ( λ ) \] gives: MSE \[ μ ^ ( λ ∗ ) \] \= ( 1 − 1 1 \+ μ 2 ) 2 \+ ( 1 1 \+ μ 2 ) 2 μ 2 \= ( μ 2 1 \+ μ 2 ) 2 \+ ( 1 1 \+ μ 2 ) 2 μ 2 \= ( 1 1 \+ μ 2 ) 2 ( μ 4 \+ μ 2 ) \= ( 1 1 \+ μ 2 ) 2 μ 2 ( 1 \+ μ 2 ) \= μ 2 1 \+ μ 2 \< 1\. ## Stein’s Paradox ### Recap We’re moments away from having all the ingredients we need to introduce Stein’s Paradox! But first let’s review what we’ve uncovered thus far. We’ve seen that the shrinkage estimator can improve on the ML estimator in terms of MSE provided that λ is chosen judiciously: it needs to be between zero and 2 / ( 1 \+ μ 2 ). The optimal choice of λ, namely λ ∗ \= 1 / ( 1 \+ μ 2 ), gives an MSE of μ 2 / ( 1 \+ μ 2 ). This is always lower than one, the MSE of the ML estimator. There’s just one massive problem we’ve ignored this whole time: **we don’t know the value of** μ! As seen from the figure plotted above, the MSE curves for different values of λ *cross each other*: the best one to use depends on the true value of μ. This doesn’t mean that all is lost. Perhaps in practice we have some outside information about the likely value of μ that could help guide our choice of λ. What it does mean is that there’s no “one-size-fits-all” value. ### Admissibility It’s time to introduce a bit of technical vocabulary. We say that an estimator θ ~ **dominates** another estimator θ ^ if MSE \[ θ ~ \] ≤ MSE \[ θ ^ \] for *all* possible values of the parameter θ being estimated and MSE \[ θ ~ \] \< MSE \[ θ ^ \] for at least *one* possible value of θ.[6](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn6) In words, this means that it never makes sense to use θ ^ in preference to θ ~. No matter what the true parameter value is, you can’t do worse with θ ~ and you might do better. An estimator that is *not dominated* by any other estimator is called **admissible**; an estimator that *is dominated* by some other estimator is called **inadmissible**. The concept of *admissibility* in decision theory is a bit like the concept of [Pareto efficiency](https://en.wikipedia.org/wiki/Pareto_efficiency) in microeconomics. An admissible estimator is only “good” in the sense that it doesn’t leave any money on the table: there’s no way to do better for one parameter value without doing worse for another. In a similar way, a Pareto efficient allocation in economics is one in which no individual can be made better off without making another person worse off. It’s quite challenging to prove, but in fact the ML estimator θ ^ M L \= X turns out to be admissible in our little example. So while we could potentially do better by using shrinkage, it’s not a slam-dunk case. If we really have no idea of how large μ is likely to be, the ML estimator is a reasonable choice. Because it’s admissible, at the very least we know that there’s no free lunch\! ### A More General Example Now let’s make things a bit more interesting. For the rest of this post, suppose that we observe not a single draw X from a Normal ( μ , 1 ) distribution but a *collection* of p independent draws from p *different* normal distributions: X 1 , X 2 , . . . , X p ∼ independent Normal ( μ j , 1 ) , j \= 1 , . . . , p . You can think of this as p copies of our original problem: we observe X j ∼ Normal ( μ j , 1 ) and our task is to estimate μ j. The observations are all independent, and each comes from a distribution with a potentially **different mean**. At first glance it seems like these p separate problems should have *absolutely nothing to do with each other*. And indeed the maximum likelihood estimator for the collection of p means is simply μ ^ ML ( j ) \= X j. As above in our example with p \= 1, the question is: how good is the ML estimator, and can we do any better? ### Composite MSE But first things first: how can we evaluate the quality of p estimators for p different parameters *at the same time*? A common approach, and the one we will follow here, is to take the *sum* of the individual MSEs of each estimator, yielding a quantity called **composite MSE**. If μ ^ 1 , μ ^ 2 , … , μ ^ p is a collection of estimators for each of the individual unknown means, then the composite MSE is defined as Composite MSE ≡ ∑ j \= 1 p MSE ( μ ^ j ) \= ∑ j \= 1 p \[ Bias ( μ ^ j ) 2 \+ Var ( μ ^ j ) \] \= ∑ j \= 1 p E \[ ( μ ^ j − μ j ) 2 \] . Adopting composite MSE as our measure of *good* performance means that we view each of the p estimation problems as in some way “interchangeable”–we’re happy to accept a trade in which we do a slightly worse job estimating μ j in exchange for doing a much better job estimating μ k. At the end of the post I’ll say a few more words about this idea and when it may or may not be reasonable. But for the rest of the post, we will assume that our goal is to **minimize the composite MSE**. The concept of composite MSE will be crucial in understanding why the James-Stein estimator works the way it does. ### Stein’s Paradox Putting our new idea into practice, we see that the composite MSE of the ML estimator is p regardless of the true values of the individual means μ 1 , … , μ p since ∑ j \= 1 p MSE \[ μ ^ ML ( j ) \] \= ∑ j \= 1 p MSE ( X j ) \= ∑ j \= 1 p Var ( X j ) \= p . If the ML estimator is admissible, then there should be no other estimator that always has an MSE less than or equal to p and sometimes has an MSE strictly less than p. I’ve already told you that this is true when p \= 1. When p \= 2 it’s still true: the ML estimator remains admissible. But when p ≥ 3 something very unexpected happens: it becomes possible to construct an estimator that **dominates** the ML estimator by using information from *all* of the ( X 1 , . . . , X p ) observations to estimate μ j. This is spite of the fact that there is *no obvious connection* between the observations. Again: they are all independent and come from distributions with different means\! The estimator that does the trick is the so-called “James-Stein Estimator” (JS), defined according to μ ^ JS ( j ) \= ( 1 − p − 2 ∑ k \= 1 p X k 2 ) X j . This estimator dominates the ML estimator when p ≥ 3 in that ∑ j \= 1 p MSE \[ μ ^ JS ( j ) \] ≤ ∑ j \= 1 p MSE \[ μ ^ ML ( j ) \] \= p for *all* possible values of the p unknown means μ j with strict inequality for at least *some* values. Taking a closer look at the formula, we see that the James-Stein estimator is just a *shrinkage* estimator applied to each of the p means, namely μ ^ JS ( j ) \= ( 1 − λ ^ JS ) X j , λ ^ JS ≡ p − 2 ∑ k \= 1 p X k 2 . The shrinkage factor in the James-Stein estimator depends on the number of means we’re estimating, p, along with the *overall* sum of the squared observations. All else equal, the more parameters we need to estimate, the more we shrink each of them towards zero. And the farther the observations are from zero *overall*, the less we shrink *each of them* towards zero. Just like our simple shrinkage estimator from above, the James-Stein estimator achieves a lower MSE by tolerating a small bias in exchange for a larger reduction in variance, compared to the higher-variance but unbiased ML estimator. Unlike our simple shrinkage estimator, the James-Stein estimator uses the *data* to determine the shrinkage factor. And as long as p ≥ 3 it is always *at least as good* as the ML estimator and sometimes *much better*. The **paradox** is that this seems impossible: how can information from *all* of the observations be useful when they come from *different* distributions with no obvious connection? The rest of this post will *not* prove that the James-Stein estimator dominates the ML estimator. Instead it will try to convince you that there is some *very good intuition* for why the formula for the James-Stein estimator takes the form it does. By the end, I hope you’ll feel that, far from seeming paradoxical, using *all* of the observations to determine the shrinkage factor for one particular μ j makes perfect sense. ## Where does the James-Stein Estimator Come From? ### An Infeasible Estimator When p \= 2 To start the ball rolling, let’s [assume a can-opener](https://en.wikipedia.org/wiki/Assume_a_can_opener): suppose that we don’t know any of the *individual* means μ j but for some strange reason a benevolent deity has told us the value of their sum of squares: c 2 ≡ ∑ j \= 1 p μ j 2 . It turns out that this is enough information to construct a shrinkage estimator that *always* has a lower composite MSE than the ML estimator. Let’s see why this is the case. If p \= 1, then telling you c 2 is the same as telling you μ 2. Granted, knowledge of μ 2 isn’t as informative as knowledge of μ. For example, if I told you that μ 2 \= 9 you couldn’t tell whether μ \= 3 or μ \= − 3. But, as we showed above, the optimal shrinkage estimator when p \= 1 sets λ ∗ \= 1 / ( 1 \+ μ 2 ) and yields an MSE of μ 2 / ( 1 \+ μ 2 ) \< 1. Since λ ∗ only depends on μ through μ 2, we’ve *already shown* that knowledge of c 2 allows us to construct a shrinkage estimator that dominates the ML estimator when p \= 1. So what if p equals 2? In this case, knowledge of c 2 \= μ 1 2 \+ μ 2 2 is equivalent to knowing the *radius* of a circle centered at the origin in the ( μ 1 , μ 2 ) plane where the two unknown means must lie. For example, if I told you that c 2 \= 1 you would know that ( μ 1 , μ 2 ) lies somewhere on a circle of radius one centered at the origin. As illustrated in the following plot, the points ( x 1 , x 2 ) and ( y 1 , y 2 ) would then be potential values of ( μ 1 , μ 2 ) as would all other points on the blue circle. ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-4-1.png) So how can we construct a shrinkage estimator of ( μ 1 , μ 2 ) with lower composite MSE than the ML estimator if c 2 is known? While there are other possibilities, the simplest would be to use the *same* shrinkage factor for each of the two coordinates. In other words, our estimator would be μ ^ 1 ( λ ) \= ( 1 − λ ) X 1 , μ ^ 2 ( λ ) \= ( 1 − λ ) X 2 for some λ between zero and one. The composite MSE of this estimator is just the sum of the MSE of each *individual* component, so we can re-use our algebra from above to obtain MSE \[ μ ^ 1 ( λ ) \] \+ MSE \[ μ ^ 2 ( λ ) \] \= \[ ( 1 − λ ) 2 \+ λ 2 μ 1 2 \] \+ \[ ( 1 − λ ) 2 \+ λ 2 μ 2 2 \] \= 2 ( 1 − λ ) 2 \+ λ 2 ( μ 1 2 \+ μ 2 2 ) \= 2 ( 1 − λ ) 2 \+ λ 2 c 2 . Notice that the composite MSE only depends on ( μ 1 , μ 2 ) through their sum of squares, c 2. Differentiating with respect to λ, just as we did above in the p \= 1 case, d d λ \[ 2 ( 1 − λ ) 2 \+ λ 2 c 2 \] \= − 4 ( 1 − λ ) \+ 2 λ c 2 \= 2 \[ λ ( 2 \+ c 2 ) − 2 \] d 2 d λ 2 \[ 2 ( 1 − λ ) 2 \+ λ 2 c 2 \] \= 2 ( 2 \+ c 2 ) \> 0 so there is a unique global minimum at λ ∗ \= 2 / ( 2 \+ c 2 ). Substituting this value of λ into the expression for the composite MSE, a few lines of algebra give MSE \[ μ ^ 1 ( λ ∗ ) \] \+ MSE \[ μ ^ 2 ( λ ∗ ) \] \= 2 ( 1 − 2 2 \+ c 2 ) 2 \+ ( 2 2 \+ c 2 ) 2 c 2 \= 2 ( c 2 2 \+ c 2 ) . Since c 2 / ( 2 \+ c 2 ) \< 1 for all c 2 \> 0, the optimal shrinkage estimator *always* has a composite MSE less than 2, the composite MSE of the ML estimator. Strictly speaking this estimator is **infeasible** since we don’t know c 2. But it’s a crucial step on our journey to make the leap from applying shrinkage to an estimator for a *single* unknown mean, to using the same idea for *more than one* unknown mean. ### A Simulation Experiment for p \= 2 You may have already noticed that it’s easy to generalize this argument to p \> 2. But before we consider the general case, let’s take a moment to understand the geometry of shrinkage estimation for p \= 2 a bit more deeply. The nice thing about two-dimensional problems is that they’re easy to plot. So here’s a graphical representation of both the ML estimator and our infeasible optimum shrinkage estimator when p \= 2. I’ve set the true, unknown, values of μ 1 and μ 2 to one so the true value of c 2 is 2 and the optimal choice of λ is λ ∗ \= 2 / ( 2 \+ c 2 ) \= 2 / 4 \= 0\.5. The following R code simulates our estimators and visualizes their performance, helping us see the shrinkage effect in action. ``` set.seed(1983) nreps <- 50 mu1 <- mu2 <- 1 x1 <- mu1 + rnorm(nreps) x2 <- mu2 + rnorm(nreps) csq <- mu1^2 + mu2^2 lambda <- csq / (2 + csq) par(mfrow = c(1, 2)) # Left panel: ML Estimator plot(x1, x2, main = 'MLE', pch = 20, col = 'black', cex = 2, xlab = expression(mu[1]), ylab = expression(mu[2])) abline(v = mu1, lty = 1, col = 'red', lwd = 2) abline(h = mu2, lty = 1, col = 'red', lwd = 2) # Add MSE to the plot text(x = 2, y = 3, labels = paste("MSE =", round(mean((x1 - mu1)^2 + (x2 - mu2)^2), 2))) # Right panel: Shrinkage Estimator plot(x1, x2, main = 'Shrinkage', xlab = expression(mu[1]), ylab = expression(mu[2])) points(lambda * x1, lambda * x2, pch = 20, col = 'blue', cex = 2) segments(x0 = x1, y0 = x2, x1 = lambda * x1, y1 = lambda * x2, lty = 2) abline(v = mu1, lty = 1, col = 'red', lwd = 2) abline(h = mu2, lty = 1, col = 'red', lwd = 2) abline(v = 0, lty = 1, lwd = 2) abline(h = 0, lty = 1, lwd = 2) # Add MSE to the plot text(x = 2, y = 3, labels = paste("MSE =", round(mean((lambda * x1 - mu1)^2 + (lambda * x2 - mu2)^2), 2))) ``` ![](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/index_files/figure-html/unnamed-chunk-5-1.png) My plot has two panels. The left panel shows the raw data. Each black point is a pair ( X 1 , X 2 ) of independent normal draws with means ( μ 1 \= 1 , μ 2 \= 1 ) and variances ( 1 , 1 ). As such, each point is also the *ML estimate* (MLE) of ( μ 1 , μ 2 ) based on ( X 1 , X 2 ). The red cross shows the location of the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). There are 50 points in the plot, representing 50 replications of the simulation, each independent of the rest and with the same parameter values. This allows us to measure how close the ML estimator is to the true value of ( μ 1 , μ 2 ) in repeated sampling, approximating the composite MSE. The right panel is more complicated. This shows *both* the ML estimates (unfilled black circles) *and* the corresponding shrinkage estimates (filled blue circles) along with dashed lines connecting them. Each shrinkage estimate is constructed by “pulling” the corresponding MLE towards the origin by a factor of λ \= 0\.5. Thus, if a given unfilled black circle is located at ( X 1 , X 2 ), the corresponding filled blue circle is located at ( 0\.5 X 1 , 0\.5 X 2 ). As in the left panel, the red cross in the right panel shows the true values of ( μ 1 , μ 2 ), namely ( 1 , 1 ). The black cross, on the other hand, shows the point towards which the shrinkage estimator pulls the ML estimator, namely ( 0 , 0 ). We see immediately that the ML estimator is *unbiased*: the black filled dots in the left panel (along with the unfilled ones in the right) are centered at ( 1 , 1 ). But the ML estimator is also *high-variance*: the black dots are quite spread out around ( 1 , 1 ). We can approximate the composite MSE of the ML estimator by computing the average squared Euclidean distance between the black points and the red cross.[7](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fn7) And in keeping with our theoretical calculations, the simulation gives a composite MSE of almost exactly 2 for the ML estimator. In contrast, the optimal shrinkage estimator is *biased*: the filled blue dots in the right panel centered somewhere between the red cross (the true means) and the origin. But the shrinkage estimator also has a lower variance: the filled blue dots are much closer together than the black ones. Even more importantly *they are on average closer to* ( μ 1 , μ 2 ), as indicated by the red cross and as measured by composite MSE. Our theoretical calculations showed that the composite MSE of the optimal shrinkage estimator equals 2 c 2 / ( 2 \+ c 2 ). When c 2 \= 2, as in this case, we obtain 2 × 2 / ( 2 \+ 2 ) \= 1. Again, this is almost exactly what we see in the simulation. If we had used more than 50 simulation replications, the composite MSE values would have been even closer to our theoretical predictions, at the cost of making the plot much harder to read! But I hope the key point is still clear: shrinkage *pulls* the MLE towards the origin, and can give a *much* lower composite MSE. ### An Infeasible Estimator: The General Case Now that we understand the case of p \= 2, the general case is a snap. Our shrinkage estimator of each μ j will take the form μ ^ j ( λ ) \= ( 1 − λ ) X j , j \= 1 , … , p for some λ between zero and one. To find the optimal choice of λ, we minimize ∑ j \= 1 p MSE \[ μ ^ j ( λ ) \] \= ∑ j \= 1 p \[ ( 1 − λ ) 2 \+ λ 2 μ j 2 \] \= p ( 1 − λ ) 2 \+ λ 2 c 2 with respect to λ. Again, the key is that the composite MSE only depends on the unknown means through c 2. Using almost exactly the same calculations as above for the case of p \= 2, we find that λ ∗ \= p p \+ c 2 , ∑ j \= 1 p MSE \[ μ ^ j ( λ ∗ ) \] \= p ( c 2 p \+ c 2 ) . since c 2 / ( p \+ c 2 ) \< 1 for all c 2 \> 0, the optimal shrinkage estimator *always* has a composite MSE less than p, the composite MSE of the ML estimator. ### Not Quite the James-Stein Estimator The end is in sight! We’ve shown that if we knew the sum of squares of the unknown means, c 2, we could construct a shrinkage estimator that always has a lower composite MSE than the ML estimator. But we don’t know c 2. So what can we do? To start off, re-write λ ∗ as follows λ ∗ \= p p \+ c 2 \= 1 1 \+ c 2 / p . This way of writing things makes it clear that it’s not c 2 *per se* that matters but rather c 2 / p. And this quantity is simply is the *average* of the unknown squared means: c 2 p \= 1 p ∑ j \= 1 p μ j 2 . So how could we learn c 2 / p? An idea that immediately suggests itself is to estimate this quantity by replacing each unobserved μ j with the corresponding observation X j, in other words 1 p ∑ j \= 1 p X j 2 . This is a good starting point, but we can do better. Since X j ∼ Normal ( μ j , 1 ), we see that E \[ 1 p ∑ j \= 1 p X j 2 \] \= 1 p ∑ j \= 1 p E \[ X j 2 \] \= 1 p ∑ j \= 1 p \[ Var ( X j ) \+ E ( X j ) 2 \] \= 1 p ∑ j \= 1 p ( 1 \+ μ j 2 ) \= 1 \+ c 2 p . This means that ( ∑ j \= 1 p X j 2 ) / p will on average *overestimate* c 2 / p by one. But that’s a problem that’s easy to fix: simply subtract one! This is a rare situation in which there is *no bias-variance tradeoff*. Subtracting a constant, in this case one, doesn’t contribute any additional variation while completely removing the bias. Plugging into our formula for λ ∗, this suggests using the estimator λ ^ ≡ 1 1 \+ \[ ( 1 p ∑ j \= 1 p X j 2 ) − 1 \] \= 1 1 p ∑ j \= 1 p X j 2 \= p ∑ j \= 1 p X j 2 as our stand-in for the unknown λ ∗, yielding a shrinkage estimator that I’ll call “NQ” for “not quite” for reasons that will become apparent in a moment: μ ^ NQ ( j ) \= ( 1 − p ∑ k \= 1 p X k 2 ) X j . Notice what’s happening here: our optimal shrinkage estimator depends on c 2 / p, something we can’t observe. But we’ve constructed an *unbiased estimator* of this quantity by using *all of the observations* X j. This is the resolution of the paradox discussed above: all of the observations contain information about c 2 since this is simply the sum of the squared means. And because we’ve chosen to minimize composite MSE, the optimal shrinkage factor only depends on the individual μ j parameters through c 2! This is the sense in which it’s possible to learn something useful about, say, μ 1 from X 2 in spite of the fact that E \[ X 2 \] \= μ 2 may bear no relationship to μ 1. But wait a minute! This looks *suspiciously familiar*. Recall that the James-Stein estimator is given by μ ^ JS ( j ) \= ( 1 − p − 2 ∑ k \= 1 p X k 2 ) X j . Just like the JS estimator, my NQ estimator shrinks each of the p means towards zero by a factor that depends on the number of means we’re estimating, p, and the overall sum of the squared observations. The key difference between JS and NQ is that JS uses p − 2 in the numerator instead of p. This means that NQ is a more “aggressive” shrinkage estimator than JS: it pulls the means towards zero by a larger amount than JS. This difference turns out to be crucial for proving that the JS estimator dominates the ML estimator. But when it comes to understanding why the JS estimator has the *form* that it does, I would argue that the difference is minor. If you want all the gory details of where that extra − 2 comes from, along with the closely related issue of why p ≥ 3 is crucial for JS to dominate the ML estimator, see [lecture 1](https://ditraglia.com/econ722/slides/econ722slides.pdf) or [section 7.3](https://ditraglia.com/econ722/main.pdf) from my Econ 722 teaching materials. ## Conclusion Before we conclude, there’s one important caveat to bear in mind. In addition to the qualifications that NQ isn’t *quite* JS, and that JS only dominates the MLE when p ≥ 3, there’s one more fundamental issue that could be easily missed. Our decision to minimize *composite* MSE is *absolutely crucial* to the reasoning given above. The magic of shrinkage depends on our willingness to accept a trade-off in which we do a worse job estimating one mean in exchange for doing a better job estimating another, as composite MSE imposes. Whether this makes sense in practice depends on the context. If we’re searching for a lost submarine in the ocean (a 3-dimensional problem), it makes perfect sense to be willing to be farther from the submarine in one dimension in exchange for being closer in another. That’s because *Euclidean distance* is obviously what we’re after here. But if instead we’re estimating [teacher value-added](https://ideas.repec.org/p/nbr/nberwo/27094.html) and the results of our estimation exercise will be used to determine which teachers lose their jobs, it’s less clear that we should be willing to be farther from one teacher in exchange for being closer to another. Certainly that would be no consolation to someone who had been wrongly dismissed! If we were merely using this information to identify teachers who might need extra help, it’s another story. But the point I’m trying to make here is that our choice of which criterion to minimize necessarily encodes our *values* in a particular problem. But with that said, I hope you’re satisfied that this extremely long post was worth the effort. Without using any fancy mathematics or statistical theory, we’ve managed to invent something that is *nearly identical* to the James-Stein estimator and thus to resolve Stein’s paradox. We started by pretending what we knew c 2 and showed that this would allow us to derive a shrinkage estimator with a lower composite MSE than the ML estimator. Then we simply plugged in an unbiased estimator of the key unknown quantity: c 2 / p. Because all the observations contain information about c 2, it makes sense that we should decide how much to shrink one component X j by using all of the others. At this point, I hope that the James-Stein estimator seems not only plausible but practically *obvious*, excepting of course that pesky − 2 in the numerator. ## Footnotes 1. If I ruled the universe, the Gauss-Markov Theorem would be demoted to much less exalted status in econometrics teaching\![↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref1) 2. Don’t let words do your thinking for you: “bias” sounds like a very bad thing, like kicking puppies. But that’s because the word “bias” has a negative connotation in English. In statistics, it’s just a technical term for “not centered”. An estimator can be biased and still be very good. Indeed the punchline of this post is that the James-Stein estimator is biased but can be much better than the obvious alternative\![↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref2) 3. Why squared bias and not simply bias itself? The answer is units: bias is measured in the same units as the parameter being estimated while the variance is in squared units. It doesn’t make sense to add things with different units, so we either have to square the bias or take the square root of the variance, i.e. replace it with the standard deviation. But bias can be negative, and we wouldn’t want a large negative bias to cancel out a large standard deviation so MSE squares the bias instead.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref3) 4. See if you can prove this as a homework exercise\![↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref4) 5. In Bayesian terms, we could view this “shrinkage” idea as calculating the posterior mean of μ conditional on our data X under a normal prior. In this case λ would equal τ / ( 1 \+ τ ) where τ is the *prior precision*, i.e. the reciprocal of the prior variance. But for this post we’ll mainly stick to the Frequentist perspective.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref5) 6. Strictly speaking all of this pre-supposes that we’re working with squared-error loss so that MSE is the right thing to minimize. There are other loss functions we could have used instead and these would lead to different risk functions. But for the purposes of this post, I prefer to keep things simple. See [lecture 1](https://ditraglia.com/econ722/slides/econ722slides.pdf) of my Econ 722 slides for more detail.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref6) 7. Remember that there are two equivalent definitions of MSE: bias squared plus variance on the one hand and expected squared distance from the truth on the other hand.[↩︎](https://www.econometrics.blog/post/not-quite-the-james-stein-estimator/#fnref7)

ML Classification

ML Categories

/Science		86.3%
/Science/Mathematics		84.2%
/Science/Mathematics/Statistics		82.6%

Raw JSON

{
    "/Science": 863,
    "/Science/Mathematics": 842,
    "/Science/Mathematics/Statistics": 826
}

ML Page Types

/Article		99.9%
/Article/Tutorial_or_Guide		78.9%

Raw JSON

{
    "/Article": 999,
    "/Article/Tutorial_or_Guide": 789
}

ML Intent Types

Informational

99.9%

Raw JSON

{
    "Informational": 999
}

Content Metadata

Language

Author

Francis J. DiTraglia

Publish Time

not set

Original Publish Time

2024-08-13 01:53:35 (1 year ago)

Republished

Word Count (Total)

8,363

Word Count (Content)

8,280

Links

External Links

Internal Links

Technical SEO

Meta Nofollow

Meta Noarchive

JS Rendered

Yes

Redirect Target

null

Performance

Download Time (ms)

TTFB (ms)

Download Size (bytes)

24,964

Shard

7 (laksa)

Root Hash

17011614825276657407

Unparsed URL

blog,econometrics!www,/post/not-quite-the-james-stein-estimator/ s443