🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 143 (from laksa173)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa143.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://joe-antognini.github.io/machine-learning/steins-paradox\')), getAhrefsUnparsedNoserviceFromURL(\'https://joe-antognini.github.io/machine-learning/steins-paradox\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox","crawl_time":1773829246,"first_indexed_time":1609631921,"http_code":200,"src_unparsed":"io,github!joe-antognini,\/machine-learning\/steins-paradox s443","src_root_hash":"2566890010099092343","history_drop_reason":null,"meta_title":"Understanding Stein's paradox – Joe Antognini","meta_descriptions":["An intuitive explanation of the James-Stein estimator."],"attrs_boilerpipe_text":"The paradox\nStein’s paradox is among the most surprising results in statistics.  The basic\nidea is easily stated, but it is difficult to understand how it could possibly\nbe true.  The premise is this: suppose that I have a Gaussian distribution with\na variance of unity and some mean which I don’t tell you.  I then draw a single\nsample from this distribution, give it to you, and ask you to guess the mean.\nWhat do you do?  Well, you don’t have a lot of information to go on here, so\nyou just guess that the mean is the number I gave you.  This is a good guess!\n(We will make the notion of a “good guess” a little more precise later on.)\nNo big surprise there.  Now we play again, but this time, my distribution is a\ntwo\n-dimensional Gaussian.  The covariance is the identity matrix (so this is\nequivalent to sampling from two independent one-dimensional Gaussians).  But\nagain I have not told you the mean (which is now a two-dimensional vector).\nOnce more I draw a single sample from the distribution, hand it over to you,\nand ask you to guess the mean.  You simply guess that the mean is the sample I\nhave given you.  Once more you have guessed well!\nNow we do the same thing in three dimensions.  I draw a single sample, hand it\nover to you, and ask you to guess the mean.  Just as before, you guess that the\nmean is the sample I gave you.  But this is no longer a good guess!  Stein’s\nparadox is that if we play this game in three dimensions or more, a better\nguess is to say that the mean is this:\nμ\n^\n=\nReLU\n(\n1\n−\nD\n−\n2\n|\nx\n|\n2\n)\nx\n,\nwhere\nD\nis the dimensionality of the Gaussian, and\nx\nis the\nsample drawn from the distribution.  This is the so-called “James-Stein\nestimator.”\nWho would have thought!  What is going on here?\nWhat makes a guess good?\nBefore we go on, we should clarify exactly what we mean by a “good guess.”  We\nare trying to do what is called “parameter estimation” in statistics — based\non a sample from a distribution, we want to infer some underlying parameter (or\nparameters) of the distribution.  (In this case the parameter we are interested\nin is the mean.)  In order to quantify how good or bad our estimate is we\nchoose a function called a “loss function.”  There is some freedom in choosing\na loss function, but the mean squared error is a common choice and has a lot of\nvaluable properties.  Stein’s paradox assumes that we are using the mean\nsquared error.  So if we guess that the mean is\nμ\n^\nand the true\nvalue of the mean is\nμ\n, then the loss is\nL\n=\n|\nμ\n^\n−\nμ\n|\n2\n.\nNow, of course, we need some rule to go from the sample\nx\nto the\nestimate\nμ\n^\n.  This rule is just a function of some kind, say,\nf\n(\nx\n)\n.  This function has the special name of an “estimator.”  We\ncan choose whatever function we want here.  Our original guess was to just use\nf\n(\nx\n)\n=\nx\n.  But another choice here is to say\nf\n(\nx\n)\n=\nx\n+\n7\n, or\nf\n(\nx\n)\n=\nsin\n⁡\n(\nx\n)\n\/\nx\n71\n, or even just\nf\n(\nx\n)\n=\n31\n.  It doesn’t take much\nimagination to see that there are an infinite number of possible choices.  But\npresumably some of these choices are better than others.  How do we know which\nones are good?\nStatisticians use the concept of\nrisk\nfor this purpose.  Risk is simply the\nexpected value of your loss function.  One thing that can be a little confusing\nis that the risk is a function of\nboth\nyour choice of estimator\nand\nthe\ntrue value of the parameter itself.  So in the original game where you’re\nguessing the mean of a one-dimensional Gaussian, the risk will be a function of\nwhatever rule you decide to use and the actual, unknown value of the mean.\nThe fact that the risk is a function of the true value of the parameter makes\nthings a little tricky.  If you’re trying to decide between two estimators, you\nmight find that one estimator works better for certain values that the\nparameter can take, and the other works better for others.  As a dumb example,\nlet’s go back to guessing the mean of a one-dimensional Gaussian.  Our original\nestimator was\nμ\n^\n=\nx\n.  But another, perfectly valid, estimator is\nμ\n^\n=\n7\n.  In other words we ignore the sample entirely and say that\nthe mean is 7 no matter what.  Generally this doesn’t seem like a smart thing\nto do.  But if the mean turns out to actually be pretty close to 7, on average\nthis will be the better guess!  Specifically, the risk of our initial estimator\nis\nR\nx\n=\nE\n[\n(\nx\n−\nμ\n)\n2\n]\n=\n1\n2\nπ\n∫\n(\nx\n−\nμ\n)\n2\ne\n−\n(\nx\n−\nμ\n)\n2\n\/\n2\nd\nx\n=\n1.\nAnd the risk on our second, dumb estimator is\nR\ndumb\n=\nE\n[\n(\n7\n−\nμ\n)\n2\n]\n=\n1\n2\nπ\n∫\n(\n7\n−\nμ\n)\n2\ne\n−\n(\nx\n−\nμ\n)\n2\n\/\n2\nd\nx\n=\n(\n7\n−\nμ\n)\n2\n.\nAs long as the true mean,\nμ\n, happens to be between 6 and 8, the dumb\nestimator of just saying 7 actually has lower risk!\nBased on this example it might seem that we’re stuck.  Since we don’t know the\ntrue value of the mean, we can’t generally say if one estimator is better than\nanother.  And indeed this is often the case.  But there are certain situations\nwhere this is not true.  If we have two estimators and one of them has a lower\nrisk\nfor any possible value the parameter can take\n, we can say that one is\ndefinitively better than the other.  In statistical parlance, we say that the\nworse estimator is “inadmissable.”\nIn more precise terms, Stein’s paradox states that in three dimensions or more,\nthe naive estimator (just guessing that the mean is\nx\n) is\ninadmissable because the risk of the James-Stein estimator is lower for any\npossible mean I could choose.\nWhat is the James-Stein estimator doing?\nBefore we can understand\nwhy\nthe James-Stein estimator gives you a better\nguess than the naive estimator\nx\n, we should understand\nwhat\nexactly it’s doing.  The idea is that we take the naive guess\nx\nand then we scale it towards the origin by some amount.  The factor by which we\nscale it is:\nReLU\n(\n1\n−\nD\n−\n2\n|\nx\n|\n2\n)\n.\nThe ReLU function simply takes the maximum of its argument and zero, so this\nscale factor will be either positive or zero.  Let’s suppose that it’s\npositive.  In this case we scale it towards the origin more if the magnitude of\nthe sample is smaller and less if it is lower.  In the limit of\n|\nx\n|\n→\n∞\nwe don’t change it at all and our guess reduces to the naive\nestimator\nx\n.\nIn the other limit, if\n|\nx\n|\nis very small, then the ReLU function\nwill kick in and just set the scale factor to zero.  Hence anytime we get a\nsufficiently small sample we throw it out and just guess that the mean is zero\ninstead.  And all else being equal we will shrink our estimate more in higher\ndimensional spaces than in lower dimensional spaces.\nSo that is what the James-Stein estimator is doing.  Why does it work?  Before\nwe can answer that we need to take a quick detour.\nSamples in high dimensional spaces\nHigh dimensional spaces are counterintuitive.  One of the counterintuitive\nproperties of high dimensional distributions is this: a sample from a symmetric\nhigh dimensional distribution is highly likely to be further from the origin\nthan the mean.  Specifically, for an isotropic\nD\n-dimensional Gaussian, the\ndifference between the average distance to a sample and the distance to the\nmean grows as\n∼\nD\n.\nIt’s a little strange to put it this way, but isn’t so surprising with a little\nbit of thought.  Even in two dimensions we can see that this is the case just\nby drawing it:\nThe shaded area of the circle is less than half the area, so we are less likely\nto choose a sample closer to the origin than the mean.  What perhaps makes this\ncounterintuitive is that in two dimensions a fairly large fraction of the\ncircle is still shaded.  But as the dimensionality increases, this fraction\ndecreases exponentially.  Once the dimensionality is even moderately large we\nare highly unlikely to sample a point in this shaded region.\nOne caveat here is that this effect decreases the larger the mean is.  You can\nimagine that as we move the circle further away from the origin, the shaded\nfraction gets closer and closer to ½.  So long as the mean is sufficiently\nlarge, the probability of sampling a point closer to the origin than the mean\ncan be close to ½ even in high dimensional spaces; it just requires a very\nlarge mean.  We can start to see here the connection to the James-Stein\nestimator, which also gets very close to the naive estimator\nx\nas\nthe mean (and hence\n|\nx\n|\n) gets very large.\nWhat we are doing by shrinking the estimate towards the origin is correcting\nfor the tendency of the typical sample to be slightly further away from the\norigin than the mean.  This correction allows us to reduce the overall risk of\nthe estimate. [\n1\n]\nHow arbitrary is the origin, really?\nStein’s paradox is particularly strange because there are actually two\ncounterintuitive things going on:\nThe origin is arbitrary, so why does moving your estimate towards the origin\nhelp?\nWhy does this not work in one or two dimensions?\nLet’s take a look at the first of these.  A central principle in physics is\nthat of relativity — coordinate systems are arbitrary, so the laws of physics\nmust be valid in all of them.  Surely this is also true in statistics as well.\nWe can choose the origin to be wherever we like, so it cannot contain any\ninformation.  But this sensible assertion is false.  Statistics is not physics.\nIf we truly had no information about the mean, what value would the sample\nhave?  If it could really be\nanything\nthen presumably its value would be\nexceedingly large.  After all, there are a lot of numbers, almost all of them\nare too big to write down, and the mean could be absolutely\nany\nof them.  But\nif we pick a sample and find that its distance from the origin is 3.72 there\nmust be something fishy going on.  Clearly we have, in fact, managed to embed\nsome\ninformation in our choice of coordinate system.  The only reason we have\ngotten out sensible values in our sample is because we have some prior as to\nwhere the mean is expected to be.\nIn fact, in the limit of no information\n|\nx\n|\nwill be infinitely\nlarge and the James-Stein estimator will reduce to the naive estimator,\nx\n.  So it was our clever choice of origin that smuggled\ninformation into our estimator and allowed us to do better than the naive\nestimate.\nIt’s important to distinguish between the\ndirection\nof the origin and the\ndistance\nbetween the origin and the mean.  The direction of the origin is\nunimportant.  In fact, we could shrink our estimate in any direction and we\nwould still get the benefits of the James-Stein estimator.  The\nimportant thing is the distance between the origin and the mean — this is\nwhat has encoded some prior information that we can exploit to make a better\nprediction.\nThe bias-variance tradeoff\nThe way the James-Stein estimator works can be understood by looking at the\nbias-variance tradeoff.  The bias-variance tradeoff states that the risk of an\nestimator can be decomposed into two components: a constant “bias” term, which\nreflects how far off the average value of the estimate is from the correct\nvalue; and an unbiased “variance” term, which accounts for the randomness of\nthe sample.\nThe naive estimator\nx\nis unbiased but has a high variance.  One\nreason that Stein’s paradox seems unnatural is that we tend to confuse unbiased\nestimators with estimators that minimize risk.  But in high dimensional spaces,\nsamples from an isotropic Gaussian encompass a tremendous volume.  Although our\nnaive estimate is unbiased it has very high variance.\nWhat the James-Stein estimator does is scale the overall distribution towards\nthe origin, thereby shrinking the volume of the distribution (and hence its\nvariance), at the cost of introducing a little bit of bias.  Although the\nestimator is now biased its overall risk is lower.\nDeriving the James-Stein estimator geometrically\nAt this point we have some idea as to why shrinking the estimate towards the\norigin could be helpful, but we haven’t yet figured out why the specific form\nof the James-Stein estimator seems to work.  To do this we’ll follow an\nargument presented by\nBrown & Zhao (2012)\n.\nRemember that the direction of the coordinate system is arbitrary.  So let’s\nrotate into a new one where one axis points directly toward the mean, and the\nother\nD\n−\n1\naxes are pointed in arbitrary (orthogonal) directions.  (Of\ncourse,\nyou\ncan’t do this because you don’t know where the true mean is.  But\nI\ndo, and I’ve decided to help you out.)  In this coordinate system the mean\nis just\n(\n|\nμ\n|\n,\n0\n,\n…\n,\n0\n)\n.  Let’s write the sample in this coordinate\nsystem as\n(\nζ\n1\n,\nζ\n2\n,\n…\n,\nζ\nD\n)\n.\nThe loss for our guess can now be broken into two orthogonal components: one\ncomponent,\n(\nζ\n1\n−\n|\nμ\n|\n)\n2\n, which tells us how far off our estimate is\nfrom the true value when it’s projected onto the correct direction of the mean;\nand a second component,\n∑\ni\n=\n2\nD\nζ\ni\n2\n, which tells us how far off\nour estimate is from this correct direction.  Let’s call this second component\nthe “residual” component and define it as\nρ\n≡\n∑\ni\n=\n2\nD\nζ\ni\n2\n.\nOur underlying distribution is isotropic, so rotating the coordinate system\ndoesn’t change the fact that each coordinate\nζ\ni\nis a sample from an\nindependent one-dimensional Gaussian with unit variance.  And if\ni\n>\n1\nthen\nthe mean of this distribution is zero as well.\nρ\ntherefore follows a\nχ\ndistribution\nwith\nD\n−\n1\ndegrees of freedom.\nLet’s suppose that we’ve gone and chosen a sample from our distribution and by\ngood fortune\nζ\n1\nhappens to be exactly equal to\n|\nμ\n|\n.  In general\nthe rest of the\nζ\ni\nin the sample won’t be exactly zero and so\nρ\nwill be positive.  If we plot\nζ\n1\non the\nx\n-axis and\nρ\non the\ny\n-axis, the situation will look like this:\nThe sample from our distribution is the point at the top right of the triangle.\nIf we just use this sample as our estimate of the mean then the loss of this\nestimate will simply be the squared distance between the point and the bottom\nright corner of the triangle, or\nρ\n2\n.\nHow\ncan\nwe transform the sample?\nWe only have two points in this problem: the origin, and the sample\nx\n.  By symmetry this means that the only way by which we can make\nan estimate is to choose a point somewhere on the line between the origin and\nx\n. [\n2\n] In other words, you don’t actually know what direction\nany of the\nζ\ni\nare in so you can’t simply move your guess down the\nright side of the triangle to reduce\nρ\n.  The only direction you are\nallowed to move your sample is along the hypotenuse of this triangle.\nHow\nshould\nwe transform the sample?\nThe beauty of the James-Stein estimator is that even though we are constrained\nto move the sample along the hypotenuse, we can nevertheless reduce the\ndistance between our guess and the true mean by shrinking it until the\ndirection between our guess and the mean is perpendicular to the hypotenuse.\nSome simple geometry reveals that this new point is located at:\n(\n1\n−\nρ\n2\n|\nx\n|\n2\n)\nx\nThis is looking like the beginnings of the James-Stein estimator!\nWhat exactly is\nρ\n?  Unfortunately\nρ\nis random variable, but let’s\nrepresent it by some central point of its distribution.  Now\nρ\n2\nfollows a\nχ\n2\ndistribution with\nD\n−\n1\ndegrees of freedom and the mode of this\ndistribution is\nmax\n(\nD\n−\n2\n,\n0\n)\n.  So if we simply represent this distribution by its\nmode, our estimator becomes\n(\n1\n−\nD\n−\n2\n|\nx\n|\n2\n)\nx\n,\nfor\nD\n≥\n3\nand is the naive estimator\nx\nfor\nD\n<\n3\n, which\nis exactly the James-Stein estimator without the ReLU.\nTo be clear, this is a very hand-wavey argument.  Representing the entire\ndistribution by a single point is not particularly sophisticated, and there is\nno special reason to choose to use the mode instead of the mean or median\nexcept that it happens to give an answer that corresponds to the James-Stein\nestimator. [\n3\n]\nCan we derive the James-Stein estimator rigorously?\nGiven how hand-wavey this argument was, you might wonder whether it’s possible\nto rigorously derive the James-Stein estimator by explicitly calculating the\nrisk given the Gaussian and\nχ\ndistributions, and then finding the\nfunction that minimizes the risk using the calculus of variations.  However\nthis approach will not work.  The reason is that the James-Stein optimizer does\nnot actually minimize the risk!  In fact, despite its fame, the James-Stein\nestimator is itself inadmissible.\nAs we’ll see below, we can improve our estimator by wrapping it in a ReLU\nfunction.  But the ReLU function introduces some difficulties, because there is\na very general theorem which states that any admissible estimator must be a\nBayes estimator for some prior, or the limit of Bayes estimators. [\n4\n]  This\nends up implying that the estimator must be a smooth function and the ReLU\nfunction is not smooth.  So while the James-Stein estimator beats the naive\nestimator\nx\n, there exists some other estimator which beats it!\nRecovering the ReLU\nThere is one last piece to tackle: the ReLU function.  Where does that come\nfrom?\nThe ReLU has no effect on our estimate as long as\n|\nx\n|\n≥\nD\n−\n2\n.\nThis will generally be the case if\n|\nμ\n|\nis large.  But if\n|\nμ\n|\nis\nsmall, then some of the points sampled will have\n|\nx\n|\n<\nD\n−\n2\n.  If\nwe were to just use the same scaling we derived above, we would reflect our\nestimate through the origin and make it negative!  This estimate has a worse\nloss than an estimate at the origin.\nThe reason for this is that once we have shrunk the sample all the way down to\nthe origin we have already traded away all the variance for bias.  In other\nwords, we started with an estimate that had no bias and high variance, and\nended up with an estimate that had high bias and no variance.  But if we were\nto keep going past the origin, we’d continue to increase the bias but would\nstart to increase the variance as well!  This is strictly worse than keeping\nour estimate at the origin, so we are better off clamping the estimate at zero\nwith a ReLU function.\nWhy does Stein’s paradox not hold in two dimensions?\nLet’s now turn to the second counterintuitive property of Stein’s paradox:\nwhat’s so special about three dimensions?  We saw a hint that three dimensions\nis special since the mode of the\nχ\n2\ndistribution is 0 for one and two\ndimensions, but is\nD\n−\n2\nfor higher numbers of dimensions.  But as I\npointed out, that came about from representing the entire\nρ\n2\ndistribution by its mode, and it’s not really obvious why we should pick the\nmode in particular rather than, say, the mean or median.\nThinking back to our geometric argument, it is not hard to see why the\nJames-Stein estimator doesn’t help in one dimension — we arrived at the\nJames-Stein estimator by separating the sample into two components, one along\nthe direction to the true mean, and the other as a residual component\nperpendicular to the first.  But in one dimension there is no residual\ncomponent!  By shrinking your estimate towards the origin you introduce some\nbias, but this is not counterbalanced by any reduction in variance.\nBut what about the two dimensional case?  Here again we unfortunately must wave\nour hands.  In the two dimensional case we do, in fact, reduce the variance by\nshrinking our estimate towards the origin, just not by enough to offset the\nbias we introduce.  While the orthogonal component,\nρ\n, is not\nidentically zero as it was in the one-dimensional case, it is strongly bunched\nup close to zero in the two dimensional case.  The idea of the James-Stein\nestimator is to note that a certain proportion of a sample’s magnitude is due\nto a component orthogonal to the mean and to shrink the estimate to reduce that\northogonal component somewhat.  But if the magnitude of the orthogonal component\nis too small, this won’t always work.  There are some values of the mean for\nwhich the James-Stein estimator is better, but others where it is worse.  It is\nnot the better estimator no matter what.\nA digression on random walks\nAs an aside, there is a\ndeep correspondence\nfound by Lawrence Brown\nbetween random walks and the admissibility of the naive estimator,\nx\n.  Brown showed that the naive estimator is admissible if and\nonly if a random walk returns to the origin an infinite number of times.\nRandom walks in one or two dimensions do this, but random walks in three or\nmore dimensions do not.  At a high level this is because the random walk drifts\naway from the origin at a rate linear in the dimensionality of the space, but\nthe volume subtended by the origin at a fixed distance decays exponentially\nwith dimension.  In one or two dimensions the volume subtended by the origin is\nlarge enough that you are guaranteed to eventually make your way back to it.\nBut in higher dimensions, the volume subtended is smaller, so that as time\nprogresses you are less and less likely to make your way back and the probability\napproaches zero even in the limit of an infinite number of steps.\nWhile rigorously proving this correspondence is non-trivial, both problems have\nat their core a comparison between the distance between a point and the origin\nand the volume of the unit sphere in the space.\nThere be dragons far from the origin\nSo that, at a high level, is why the James-Stein estimator works.  We’ve been\nfocused in this post on the simplest form of Stein’s paradox, which is a bit of\na toy problem.  But the artificiality of the problem shouldn’t befog us as to\nthe significance of these ideas.  The artificiality eases the analysis, but in\nfact the concepts that Stein’s paradox illustrates run much deeper through\nstatistics and machine learning.\nAs we discussed earlier on, while we usually don’t think that there is anything\nspecial about our choice of origin, we do nevertheless use it to smuggle in\nsome priors, even if those priors are weak and we perhaps do so unwittingly.\nThis is also true of machine learning models, maybe even more so.  For most\nmachine learning models the origin really\nis\nspecial.  For example, at the\norigin most garden-variety neural networks will exhibit particularly simple\nbehavior — they will output all zeros independent of the input. [\n5\n]  Any\nmovement away from this special point will introduce complexity into the\nfunctional behavior of the model.  Neural networks are, of course, non-linear,\nso it is not always true that points further from the origin in parameter space\nare more complicated, but it is\ngenerally\ntrue.\nMost ML practitioners appreciate that techniques like L2 regularization have\nthe effect of making models simpler, and therefore less likely to overfit.  But\nStein’s paradox dramatically illustrates the relationship between the risk of\nan estimate and the dimensionality of the space — in high dimensional spaces\nthere is\nmuch\nmore volume further away from the origin than closer to it.  We\nare oftentimes better off introducing bias in order to reduce variance because\nin high dimensional spaces shrinking a small amount towards the origin reduces\nan\nenormous\nvolume of parameter space.  In other words, for a large machine\nlearning model, there are\nvastly\nmore ways to overfit than there are to\nunderfit so we should bias our models towards underfitting because that is a\nsimpler problem to solve than overfitting.  (Or maybe to put it another way,\nunderfitting is a small number of problems, whereas overfitting is a stupendous\nnumber of problems.)\nThis phenomenon can manifest itself during neural network training.  Suppose we\nhave converged on a local minimum in the loss landscape after training with\nsome variant of stochastic gradient descent.  At convergence there will be\nequality between the stochastic component of SGD, for which the random walk is\ncausing the parameters to drift away from the minimum, and the true gradient,\nwhich is pushing the parameters towards the minimum. [\n6\n]  At this point the\nneural network is wandering on some high-dimensional ellipsoid around the\nminimum.  But by the nature of high dimensional spaces, the neural network will\nalways\nbe further from the origin than the minimum, and therefore on the side\nof too much complexity.  Thus this converged model will always slightly\noverfit.\nThe further the model recedes from the origin, the more inexplicable its\nfunctional behavior can become.  The moral of this story is that far from the\norigin there be dragons, and in a high dimensional space, moving just a little\nfurther away from the origin introduces lots and lots of dragons.  The\nJames-Stein estimator advises you to stay close to the origin to keep your risk\ndown and the dragons at bay.\nFootnotes\nIt’s important to note that this does not imply that our original estimate\nis biased.  In fact the naive estimator is unbiased and shrinking it\ntowards the origin introduces bias.  It is the distance of the naive\nestimator from the origin that is biased high.  But by correcting for\nthis\nbias we happen to reduce the risk. \n↩\nWhile there’s nothing stopping us from choosing some other random\ndirection, this adds no new information and so is effectively the same as\nchoosing a different origin. \n↩\nThe median and mean of the\nχ\ndistribution are more complicated but\nare asymptotically equal to the mode in the limit of a large number of\ndegrees of freedom.  But the asymptotic limit does not help you prove the\ncase for\nD\n=\n3\n.\nIn fact Brown & Zhao (2012) chose to use the mean instead and derive a\n(\nD\n−\n1\n)\n\/\n|\nx\n|\n2\nterm instead of\n(\nD\n−\n2\n)\n\/\n|\nx\n|\n2\nas we do from the mode.  They then argue that the extra\n−\n1\n\/\n|\nx\n|\n2\nis due to the variance in\nζ\n1\n.  I am not fully\nconvinced of this argument; it seems to me that once one has represented\nthe entire distribution by a central point one is already waving one’s\nhands. \n↩\nA Bayes estimator is determined by a particular choice of prior.  A “limit\nof Bayes estimators” is found by taking the limit of the Bayes estimators\nfound from a sequence of priors as the sequence tends to infinity. \n↩\nOf course the origin is not the only point that exhibits this behavior.  I\nspeculate that if we shrank to any point for which the model had functional\nsimplicity we would observe the same benefits as shrinking towards the\norigin. \n↩\nLearning rate decay also mitigates this issue. \n↩","attrs_markdown":"# [Joe Antognini](https:\/\/joe-antognini.github.io\/)\n[☰](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#nav)\n\n- [Publications](https:\/\/joe-antognini.github.io\/publications)\n- [Projects](https:\/\/joe-antognini.github.io\/projects)\n- [About](https:\/\/joe-antognini.github.io\/about)\n\n# Understanding Stein's paradox\nJanuary 2, 2021\n## The paradox\nStein’s paradox is among the most surprising results in statistics. The basic idea is easily stated, but it is difficult to understand how it could possibly be true. The premise is this: suppose that I have a Gaussian distribution with a variance of unity and some mean which I don’t tell you. I then draw a single sample from this distribution, give it to you, and ask you to guess the mean. What do you do? Well, you don’t have a lot of information to go on here, so you just guess that the mean is the number I gave you. This is a good guess! (We will make the notion of a “good guess” a little more precise later on.)\n\nNo big surprise there. Now we play again, but this time, my distribution is a *two*\\-dimensional Gaussian. The covariance is the identity matrix (so this is equivalent to sampling from two independent one-dimensional Gaussians). But again I have not told you the mean (which is now a two-dimensional vector). Once more I draw a single sample from the distribution, hand it over to you, and ask you to guess the mean. You simply guess that the mean is the sample I have given you. Once more you have guessed well\\!\n\nNow we do the same thing in three dimensions. I draw a single sample, hand it over to you, and ask you to guess the mean. Just as before, you guess that the mean is the sample I gave you. But this is no longer a good guess! Stein’s paradox is that if we play this game in three dimensions or more, a better guess is to say that the mean is this:\n\nμ^\\=ReLU(1−D−2\\|x\\|2)x,\n\nμ\n\n^\n\n\\=\n\nReLU\n\n(\n\n1\n\n−\n\nD\n\n−\n\n2\n\n\\|\n\nx\n\n\\|\n\n2\n\n)\n\nx\n\n,\n\nwhere D D is the dimensionality of the Gaussian, and x x is the sample drawn from the distribution. This is the so-called “James-Stein estimator.”\n\nWho would have thought! What is going on here?\n\n## What makes a guess good?\nBefore we go on, we should clarify exactly what we mean by a “good guess.” We are trying to do what is called “parameter estimation” in statistics — based on a sample from a distribution, we want to infer some underlying parameter (or parameters) of the distribution. (In this case the parameter we are interested in is the mean.) In order to quantify how good or bad our estimate is we choose a function called a “loss function.” There is some freedom in choosing a loss function, but the mean squared error is a common choice and has a lot of valuable properties. Stein’s paradox assumes that we are using the mean squared error. So if we guess that the mean is μ^ μ ^ and the true value of the mean is μ μ, then the loss is\n\nL\\=\\|μ^−μ\\|2.\n\nL\n\n\\=\n\n\\|\n\nμ\n\n^\n\n−\n\nμ\n\n\\|\n\n2\n\n.\n\nNow, of course, we need some rule to go from the sample x x to the estimate μ^ μ ^. This rule is just a function of some kind, say, f(x) f ( x ). This function has the special name of an “estimator.” We can choose whatever function we want here. Our original guess was to just use f(x)\\=x f ( x ) \\= x. But another choice here is to say f(x)\\=x\\+7 f ( x ) \\= x \\+ 7, or f(x)\\=sin(x)\/x71 f ( x ) \\= sin ⁡ ( x ) \/ x 71, or even just f(x)\\=31 f ( x ) \\= 31. It doesn’t take much imagination to see that there are an infinite number of possible choices. But presumably some of these choices are better than others. How do we know which ones are good?\n\nStatisticians use the concept of *risk* for this purpose. Risk is simply the expected value of your loss function. One thing that can be a little confusing is that the risk is a function of *both* your choice of estimator *and* the true value of the parameter itself. So in the original game where you’re guessing the mean of a one-dimensional Gaussian, the risk will be a function of whatever rule you decide to use and the actual, unknown value of the mean.\n\nThe fact that the risk is a function of the true value of the parameter makes things a little tricky. If you’re trying to decide between two estimators, you might find that one estimator works better for certain values that the parameter can take, and the other works better for others. As a dumb example, let’s go back to guessing the mean of a one-dimensional Gaussian. Our original estimator was μ^\\=x μ ^ \\= x. But another, perfectly valid, estimator is μ^\\=7 μ ^ \\= 7. In other words we ignore the sample entirely and say that the mean is 7 no matter what. Generally this doesn’t seem like a smart thing to do. But if the mean turns out to actually be pretty close to 7, on average this will be the better guess! Specifically, the risk of our initial estimator is\n\nRx\\=\\=\\=E\\[(x−μ)2\\]12π−−√∫(x−μ)2e−(x−μ)2\/2dx1\\.\n\nR\n\nx\n\n\\=\n\nE\n\n\\[\n\n(\n\nx\n\n−\n\nμ\n\n)\n\n2\n\n\\]\n\n\\=\n\n1\n\n2\n\nπ\n\n∫\n\n(\n\nx\n\n−\n\nμ\n\n)\n\n2\n\ne\n\n−\n\n(\n\nx\n\n−\n\nμ\n\n)\n\n2\n\n\/\n\n2\n\nd\n\nx\n\n\\=\n\n1\\.\n\nAnd the risk on our second, dumb estimator is\n\nRdumb\\=\\=\\=E\\[(7−μ)2\\]12π−−√∫(7−μ)2e−(x−μ)2\/2dx(7−μ)2.\n\nR\n\ndumb\n\n\\=\n\nE\n\n\\[\n\n(\n\n7\n\n−\n\nμ\n\n)\n\n2\n\n\\]\n\n\\=\n\n1\n\n2\n\nπ\n\n∫\n\n(\n\n7\n\n−\n\nμ\n\n)\n\n2\n\ne\n\n−\n\n(\n\nx\n\n−\n\nμ\n\n)\n\n2\n\n\/\n\n2\n\nd\n\nx\n\n\\=\n\n(\n\n7\n\n−\n\nμ\n\n)\n\n2\n\n.\n\nAs long as the true mean, μ μ, happens to be between 6 and 8, the dumb estimator of just saying 7 actually has lower risk\\!\n\nBased on this example it might seem that we’re stuck. Since we don’t know the true value of the mean, we can’t generally say if one estimator is better than another. And indeed this is often the case. But there are certain situations where this is not true. If we have two estimators and one of them has a lower risk *for any possible value the parameter can take*, we can say that one is definitively better than the other. In statistical parlance, we say that the worse estimator is “inadmissable.”\n\nIn more precise terms, Stein’s paradox states that in three dimensions or more, the naive estimator (just guessing that the mean is x x) is inadmissable because the risk of the James-Stein estimator is lower for any possible mean I could choose.\n\n## What is the James-Stein estimator doing?\nBefore we can understand *why* the James-Stein estimator gives you a better guess than the naive estimator x x, we should understand *what* exactly it’s doing. The idea is that we take the naive guess x x and then we scale it towards the origin by some amount. The factor by which we scale it is:\n\nReLU(1−D−2\\|x\\|2).\n\nReLU\n\n(\n\n1\n\n−\n\nD\n\n−\n\n2\n\n\\|\n\nx\n\n\\|\n\n2\n\n)\n\n.\n\nThe ReLU function simply takes the maximum of its argument and zero, so this scale factor will be either positive or zero. Let’s suppose that it’s positive. In this case we scale it towards the origin more if the magnitude of the sample is smaller and less if it is lower. In the limit of \\|x\\|→∞ \\| x \\| → ∞ we don’t change it at all and our guess reduces to the naive estimator x x.\n\nIn the other limit, if \\|x\\| \\| x \\| is very small, then the ReLU function will kick in and just set the scale factor to zero. Hence anytime we get a sufficiently small sample we throw it out and just guess that the mean is zero instead. And all else being equal we will shrink our estimate more in higher dimensional spaces than in lower dimensional spaces.\n\nSo that is what the James-Stein estimator is doing. Why does it work? Before we can answer that we need to take a quick detour.\n\n## Samples in high dimensional spaces\nHigh dimensional spaces are counterintuitive. One of the counterintuitive properties of high dimensional distributions is this: a sample from a symmetric high dimensional distribution is highly likely to be further from the origin than the mean. Specifically, for an isotropic D D\\-dimensional Gaussian, the difference between the average distance to a sample and the distance to the mean grows as ∼ ∼D−−√ D.\n\nIt’s a little strange to put it this way, but isn’t so surprising with a little bit of thought. Even in two dimensions we can see that this is the case just by drawing it:\n\n![](https:\/\/joe-antognini.github.io\/assets\/posts\/steins-paradox\/circle-origin.png)\n\nThe shaded area of the circle is less than half the area, so we are less likely to choose a sample closer to the origin than the mean. What perhaps makes this counterintuitive is that in two dimensions a fairly large fraction of the circle is still shaded. But as the dimensionality increases, this fraction decreases exponentially. Once the dimensionality is even moderately large we are highly unlikely to sample a point in this shaded region.\n\nOne caveat here is that this effect decreases the larger the mean is. You can imagine that as we move the circle further away from the origin, the shaded fraction gets closer and closer to ½. So long as the mean is sufficiently large, the probability of sampling a point closer to the origin than the mean can be close to ½ even in high dimensional spaces; it just requires a very large mean. We can start to see here the connection to the James-Stein estimator, which also gets very close to the naive estimator x x as the mean (and hence \\|x\\| \\| x \\|) gets very large.\n\nWhat we are doing by shrinking the estimate towards the origin is correcting for the tendency of the typical sample to be slightly further away from the origin than the mean. This correction allows us to reduce the overall risk of the estimate. \\[[1](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:6)\\]\n\n## How arbitrary is the origin, really?\nStein’s paradox is particularly strange because there are actually two counterintuitive things going on:\n\n1. The origin is arbitrary, so why does moving your estimate towards the origin help?\n2. Why does this not work in one or two dimensions?\n\nLet’s take a look at the first of these. A central principle in physics is that of relativity — coordinate systems are arbitrary, so the laws of physics must be valid in all of them. Surely this is also true in statistics as well. We can choose the origin to be wherever we like, so it cannot contain any information. But this sensible assertion is false. Statistics is not physics.\n\nIf we truly had no information about the mean, what value would the sample have? If it could really be *anything* then presumably its value would be exceedingly large. After all, there are a lot of numbers, almost all of them are too big to write down, and the mean could be absolutely *any* of them. But if we pick a sample and find that its distance from the origin is 3.72 there must be something fishy going on. Clearly we have, in fact, managed to embed *some* information in our choice of coordinate system. The only reason we have gotten out sensible values in our sample is because we have some prior as to where the mean is expected to be.\n\nIn fact, in the limit of no information \\|x\\| \\| x \\| will be infinitely large and the James-Stein estimator will reduce to the naive estimator, x x. So it was our clever choice of origin that smuggled information into our estimator and allowed us to do better than the naive estimate.\n\nIt’s important to distinguish between the *direction* of the origin and the *distance* between the origin and the mean. The direction of the origin is unimportant. In fact, we could shrink our estimate in any direction and we would still get the benefits of the James-Stein estimator. The important thing is the distance between the origin and the mean — this is what has encoded some prior information that we can exploit to make a better prediction.\n\n## The bias-variance tradeoff\nThe way the James-Stein estimator works can be understood by looking at the bias-variance tradeoff. The bias-variance tradeoff states that the risk of an estimator can be decomposed into two components: a constant “bias” term, which reflects how far off the average value of the estimate is from the correct value; and an unbiased “variance” term, which accounts for the randomness of the sample.\n\nThe naive estimator x x is unbiased but has a high variance. One reason that Stein’s paradox seems unnatural is that we tend to confuse unbiased estimators with estimators that minimize risk. But in high dimensional spaces, samples from an isotropic Gaussian encompass a tremendous volume. Although our naive estimate is unbiased it has very high variance.\n\nWhat the James-Stein estimator does is scale the overall distribution towards the origin, thereby shrinking the volume of the distribution (and hence its variance), at the cost of introducing a little bit of bias. Although the estimator is now biased its overall risk is lower.\n\n## Deriving the James-Stein estimator geometrically\nAt this point we have some idea as to why shrinking the estimate towards the origin could be helpful, but we haven’t yet figured out why the specific form of the James-Stein estimator seems to work. To do this we’ll follow an argument presented by [Brown & Zhao (2012)](https:\/\/projecteuclid.org\/download\/pdfview_1\/euclid.ss\/1331729980).\n\nRemember that the direction of the coordinate system is arbitrary. So let’s rotate into a new one where one axis points directly toward the mean, and the other D−1 D − 1 axes are pointed in arbitrary (orthogonal) directions. (Of course, *you* can’t do this because you don’t know where the true mean is. But *I* do, and I’ve decided to help you out.) In this coordinate system the mean is just (\\|μ\\|,0,…,0) ( \\| μ \\| , 0 , … , 0 ). Let’s write the sample in this coordinate system as (ζ1,ζ2,…,ζD) ( ζ 1 , ζ 2 , … , ζ D ).\n\nThe loss for our guess can now be broken into two orthogonal components: one component, (ζ1−\\|μ\\|)2 ( ζ 1 − \\| μ \\| ) 2, which tells us how far off our estimate is from the true value when it’s projected onto the correct direction of the mean; and a second component, ∑Di\\=2ζ2i ∑ i \\= 2 D ζ i 2, which tells us how far off our estimate is from this correct direction. Let’s call this second component the “residual” component and define it as\n\nρ≡∑i\\=2Dζ2i−−−−−⎷.\n\nρ\n\n≡\n\n∑\n\ni\n\n\\=\n\n2\n\nD\n\nζ\n\ni\n\n2\n\n.\n\nOur underlying distribution is isotropic, so rotating the coordinate system doesn’t change the fact that each coordinate ζi ζ i is a sample from an independent one-dimensional Gaussian with unit variance. And if i\\>1 i \\> 1 then the mean of this distribution is zero as well. ρ ρ therefore follows a [χ χ distribution](https:\/\/en.wikipedia.org\/wiki\/Chi_distribution) with D−1 D − 1 degrees of freedom.\n\nLet’s suppose that we’ve gone and chosen a sample from our distribution and by good fortune ζ1 ζ 1 happens to be exactly equal to \\|μ\\| \\| μ \\|. In general the rest of the ζi ζ i in the sample won’t be exactly zero and so ρ ρ will be positive. If we plot ζ1 ζ 1 on the x x\\-axis and ρ ρ on the y y\\-axis, the situation will look like this:\n\n![](https:\/\/joe-antognini.github.io\/assets\/posts\/steins-paradox\/SteinsTriangle-1.png)\n\nThe sample from our distribution is the point at the top right of the triangle. If we just use this sample as our estimate of the mean then the loss of this estimate will simply be the squared distance between the point and the bottom right corner of the triangle, or ρ2 ρ 2.\n\n### How *can* we transform the sample?\nWe only have two points in this problem: the origin, and the sample x x. By symmetry this means that the only way by which we can make an estimate is to choose a point somewhere on the line between the origin and x x. \\[[2](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:5)\\] In other words, you don’t actually know what direction any of the ζi ζ i are in so you can’t simply move your guess down the right side of the triangle to reduce ρ ρ. The only direction you are allowed to move your sample is along the hypotenuse of this triangle.\n\n### How *should* we transform the sample?\nThe beauty of the James-Stein estimator is that even though we are constrained to move the sample along the hypotenuse, we can nevertheless reduce the distance between our guess and the true mean by shrinking it until the direction between our guess and the mean is perpendicular to the hypotenuse.\n\n![](https:\/\/joe-antognini.github.io\/assets\/posts\/steins-paradox\/SteinsTriangle-2.png)\n\nSome simple geometry reveals that this new point is located at:\n\n(1−ρ2\\|x\\|2)x\n\n(\n\n1\n\n−\n\nρ\n\n2\n\n\\|\n\nx\n\n\\|\n\n2\n\n)\n\nx\n\nThis is looking like the beginnings of the James-Stein estimator\\!\n\nWhat exactly is ρ ρ? Unfortunately ρ ρ is random variable, but let’s represent it by some central point of its distribution. Now ρ2 ρ 2 follows a χ2 χ 2 distribution with D−1 D − 1 degrees of freedom and the mode of this distribution is max(D−2,0) max ( D − 2 , 0 ). So if we simply represent this distribution by its mode, our estimator becomes\n\n(1−D−2\\|x\\|2)x,\n\n(\n\n1\n\n−\n\nD\n\n−\n\n2\n\n\\|\n\nx\n\n\\|\n\n2\n\n)\n\nx\n\n,\n\nfor D≥3 D ≥ 3 and is the naive estimator x x for D\\<3 D \\< 3, which is exactly the James-Stein estimator without the ReLU.\n\nTo be clear, this is a very hand-wavey argument. Representing the entire distribution by a single point is not particularly sophisticated, and there is no special reason to choose to use the mode instead of the mean or median except that it happens to give an answer that corresponds to the James-Stein estimator. \\[[3](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:2)\\]\n\n### Can we derive the James-Stein estimator rigorously?\nGiven how hand-wavey this argument was, you might wonder whether it’s possible to rigorously derive the James-Stein estimator by explicitly calculating the risk given the Gaussian and χ χ distributions, and then finding the function that minimizes the risk using the calculus of variations. However this approach will not work. The reason is that the James-Stein optimizer does not actually minimize the risk! In fact, despite its fame, the James-Stein estimator is itself inadmissible.\n\nAs we’ll see below, we can improve our estimator by wrapping it in a ReLU function. But the ReLU function introduces some difficulties, because there is a very general theorem which states that any admissible estimator must be a Bayes estimator for some prior, or the limit of Bayes estimators. \\[[4](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:1)\\] This ends up implying that the estimator must be a smooth function and the ReLU function is not smooth. So while the James-Stein estimator beats the naive estimator x x, there exists some other estimator which beats it\\!\n\n### Recovering the ReLU\nThere is one last piece to tackle: the ReLU function. Where does that come from?\n\nThe ReLU has no effect on our estimate as long as \\|x\\|≥D−2 \\| x \\| ≥ D − 2. This will generally be the case if \\|μ\\| \\| μ \\| is large. But if \\|μ\\| \\| μ \\| is small, then some of the points sampled will have \\|x\\|\\<D−2 \\| x \\| \\< D − 2. If we were to just use the same scaling we derived above, we would reflect our estimate through the origin and make it negative! This estimate has a worse loss than an estimate at the origin.\n\nThe reason for this is that once we have shrunk the sample all the way down to the origin we have already traded away all the variance for bias. In other words, we started with an estimate that had no bias and high variance, and ended up with an estimate that had high bias and no variance. But if we were to keep going past the origin, we’d continue to increase the bias but would start to increase the variance as well! This is strictly worse than keeping our estimate at the origin, so we are better off clamping the estimate at zero with a ReLU function.\n\n## Why does Stein’s paradox not hold in two dimensions?\nLet’s now turn to the second counterintuitive property of Stein’s paradox: what’s so special about three dimensions? We saw a hint that three dimensions is special since the mode of the χ2 χ 2 distribution is 0 for one and two dimensions, but is D−2 D − 2 for higher numbers of dimensions. But as I pointed out, that came about from representing the entire ρ2 ρ 2 distribution by its mode, and it’s not really obvious why we should pick the mode in particular rather than, say, the mean or median.\n\nThinking back to our geometric argument, it is not hard to see why the James-Stein estimator doesn’t help in one dimension — we arrived at the James-Stein estimator by separating the sample into two components, one along the direction to the true mean, and the other as a residual component perpendicular to the first. But in one dimension there is no residual component! By shrinking your estimate towards the origin you introduce some bias, but this is not counterbalanced by any reduction in variance.\n\nBut what about the two dimensional case? Here again we unfortunately must wave our hands. In the two dimensional case we do, in fact, reduce the variance by shrinking our estimate towards the origin, just not by enough to offset the bias we introduce. While the orthogonal component, ρ ρ, is not identically zero as it was in the one-dimensional case, it is strongly bunched up close to zero in the two dimensional case. The idea of the James-Stein estimator is to note that a certain proportion of a sample’s magnitude is due to a component orthogonal to the mean and to shrink the estimate to reduce that orthogonal component somewhat. But if the magnitude of the orthogonal component is too small, this won’t always work. There are some values of the mean for which the James-Stein estimator is better, but others where it is worse. It is not the better estimator no matter what.\n\n### A digression on random walks\nAs an aside, there is a [deep correspondence](http:\/\/stat.wharton.upenn.edu\/~lbrown\/Papers\/1971b%20Admissible%20estimators,%20recurrent%20diffusions,%20and%20insoluble%20boundary%20value%20problems.pdf) found by Lawrence Brown between random walks and the admissibility of the naive estimator, x x. Brown showed that the naive estimator is admissible if and only if a random walk returns to the origin an infinite number of times. Random walks in one or two dimensions do this, but random walks in three or more dimensions do not. At a high level this is because the random walk drifts away from the origin at a rate linear in the dimensionality of the space, but the volume subtended by the origin at a fixed distance decays exponentially with dimension. In one or two dimensions the volume subtended by the origin is large enough that you are guaranteed to eventually make your way back to it. But in higher dimensions, the volume subtended is smaller, so that as time progresses you are less and less likely to make your way back and the probability approaches zero even in the limit of an infinite number of steps.\n\nWhile rigorously proving this correspondence is non-trivial, both problems have at their core a comparison between the distance between a point and the origin and the volume of the unit sphere in the space.\n\n## There be dragons far from the origin\nSo that, at a high level, is why the James-Stein estimator works. We’ve been focused in this post on the simplest form of Stein’s paradox, which is a bit of a toy problem. But the artificiality of the problem shouldn’t befog us as to the significance of these ideas. The artificiality eases the analysis, but in fact the concepts that Stein’s paradox illustrates run much deeper through statistics and machine learning.\n\nAs we discussed earlier on, while we usually don’t think that there is anything special about our choice of origin, we do nevertheless use it to smuggle in some priors, even if those priors are weak and we perhaps do so unwittingly. This is also true of machine learning models, maybe even more so. For most machine learning models the origin really *is* special. For example, at the origin most garden-variety neural networks will exhibit particularly simple behavior — they will output all zeros independent of the input. \\[[5](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:3)\\] Any movement away from this special point will introduce complexity into the functional behavior of the model. Neural networks are, of course, non-linear, so it is not always true that points further from the origin in parameter space are more complicated, but it is *generally* true.\n\nMost ML practitioners appreciate that techniques like L2 regularization have the effect of making models simpler, and therefore less likely to overfit. But Stein’s paradox dramatically illustrates the relationship between the risk of an estimate and the dimensionality of the space — in high dimensional spaces there is *much* more volume further away from the origin than closer to it. We are oftentimes better off introducing bias in order to reduce variance because in high dimensional spaces shrinking a small amount towards the origin reduces an *enormous* volume of parameter space. In other words, for a large machine learning model, there are *vastly* more ways to overfit than there are to underfit so we should bias our models towards underfitting because that is a simpler problem to solve than overfitting. (Or maybe to put it another way, underfitting is a small number of problems, whereas overfitting is a stupendous number of problems.)\n\nThis phenomenon can manifest itself during neural network training. Suppose we have converged on a local minimum in the loss landscape after training with some variant of stochastic gradient descent. At convergence there will be equality between the stochastic component of SGD, for which the random walk is causing the parameters to drift away from the minimum, and the true gradient, which is pushing the parameters towards the minimum. \\[[6](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:4)\\] At this point the neural network is wandering on some high-dimensional ellipsoid around the minimum. But by the nature of high dimensional spaces, the neural network will *always* be further from the origin than the minimum, and therefore on the side of too much complexity. Thus this converged model will always slightly overfit.\n\nThe further the model recedes from the origin, the more inexplicable its functional behavior can become. The moral of this story is that far from the origin there be dragons, and in a high dimensional space, moving just a little further away from the origin introduces lots and lots of dragons. The James-Stein estimator advises you to stay close to the origin to keep your risk down and the dragons at bay.\n***\n## Footnotes\n1. It’s important to note that this does not imply that our original estimate is biased. In fact the naive estimator is unbiased and shrinking it towards the origin introduces bias. It is the distance of the naive estimator from the origin that is biased high. But by correcting for *this* bias we happen to reduce the risk. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:6)\n2. While there’s nothing stopping us from choosing some other random direction, this adds no new information and so is effectively the same as choosing a different origin. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:5)\n3. The median and mean of the χ χ distribution are more complicated but are asymptotically equal to the mode in the limit of a large number of degrees of freedom. But the asymptotic limit does not help you prove the case for D\\=3 D \\= 3.\n   In fact Brown & Zhao (2012) chose to use the mean instead and derive a (D−1)\/\\|x\\|2 ( D − 1 ) \/ \\| x \\| 2 term instead of (D−2)\/\\|x\\|2 ( D − 2 ) \/ \\| x \\| 2 as we do from the mode. They then argue that the extra −1\/\\|x\\|2 − 1 \/ \\| x \\| 2 is due to the variance in ζ1 ζ 1. I am not fully convinced of this argument; it seems to me that once one has represented the entire distribution by a central point one is already waving one’s hands. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:2)\n4. A Bayes estimator is determined by a particular choice of prior. A “limit of Bayes estimators” is found by taking the limit of the Bayes estimators found from a sequence of priors as the sequence tends to infinity. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:1)\n5. Of course the origin is not the only point that exhibits this behavior. I speculate that if we shrank to any point for which the model had functional simplicity we would observe the same benefits as shrinking towards the origin. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:3)\n6. Learning rate decay also mitigates this issue. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:4)\n***\n\n- [Contact Me](mailto:joe.antognini@gmail.com)\n- [@joe\\_antognini](http:\/\/twitter.com\/joe_antognini)\n\n© 2026 Joe Antognini. Powered by [Jekyll](http:\/\/jekyllrb.com\/) using the [Balzac](http:\/\/jekyll.gtat.me\/about) theme.","attrs_readable_markdown":"## The paradox\nStein’s paradox is among the most surprising results in statistics. The basic idea is easily stated, but it is difficult to understand how it could possibly be true. The premise is this: suppose that I have a Gaussian distribution with a variance of unity and some mean which I don’t tell you. I then draw a single sample from this distribution, give it to you, and ask you to guess the mean. What do you do? Well, you don’t have a lot of information to go on here, so you just guess that the mean is the number I gave you. This is a good guess! (We will make the notion of a “good guess” a little more precise later on.)\n\nNo big surprise there. Now we play again, but this time, my distribution is a *two*\\-dimensional Gaussian. The covariance is the identity matrix (so this is equivalent to sampling from two independent one-dimensional Gaussians). But again I have not told you the mean (which is now a two-dimensional vector). Once more I draw a single sample from the distribution, hand it over to you, and ask you to guess the mean. You simply guess that the mean is the sample I have given you. Once more you have guessed well\\!\n\nNow we do the same thing in three dimensions. I draw a single sample, hand it over to you, and ask you to guess the mean. Just as before, you guess that the mean is the sample I gave you. But this is no longer a good guess! Stein’s paradox is that if we play this game in three dimensions or more, a better guess is to say that the mean is this:\n\nμ ^ \\= ReLU ( 1 − D − 2 \\| x \\| 2 ) x ,\n\nwhere D is the dimensionality of the Gaussian, and x is the sample drawn from the distribution. This is the so-called “James-Stein estimator.”\n\nWho would have thought! What is going on here?\n\n## What makes a guess good?\nBefore we go on, we should clarify exactly what we mean by a “good guess.” We are trying to do what is called “parameter estimation” in statistics — based on a sample from a distribution, we want to infer some underlying parameter (or parameters) of the distribution. (In this case the parameter we are interested in is the mean.) In order to quantify how good or bad our estimate is we choose a function called a “loss function.” There is some freedom in choosing a loss function, but the mean squared error is a common choice and has a lot of valuable properties. Stein’s paradox assumes that we are using the mean squared error. So if we guess that the mean is μ ^ and the true value of the mean is μ, then the loss is\n\nL \\= \\| μ ^ − μ \\| 2 .\n\nNow, of course, we need some rule to go from the sample x to the estimate μ ^. This rule is just a function of some kind, say, f ( x ). This function has the special name of an “estimator.” We can choose whatever function we want here. Our original guess was to just use f ( x ) \\= x. But another choice here is to say f ( x ) \\= x \\+ 7, or f ( x ) \\= sin ⁡ ( x ) \/ x 71, or even just f ( x ) \\= 31. It doesn’t take much imagination to see that there are an infinite number of possible choices. But presumably some of these choices are better than others. How do we know which ones are good?\n\nStatisticians use the concept of *risk* for this purpose. Risk is simply the expected value of your loss function. One thing that can be a little confusing is that the risk is a function of *both* your choice of estimator *and* the true value of the parameter itself. So in the original game where you’re guessing the mean of a one-dimensional Gaussian, the risk will be a function of whatever rule you decide to use and the actual, unknown value of the mean.\n\nThe fact that the risk is a function of the true value of the parameter makes things a little tricky. If you’re trying to decide between two estimators, you might find that one estimator works better for certain values that the parameter can take, and the other works better for others. As a dumb example, let’s go back to guessing the mean of a one-dimensional Gaussian. Our original estimator was μ ^ \\= x. But another, perfectly valid, estimator is μ ^ \\= 7. In other words we ignore the sample entirely and say that the mean is 7 no matter what. Generally this doesn’t seem like a smart thing to do. But if the mean turns out to actually be pretty close to 7, on average this will be the better guess! Specifically, the risk of our initial estimator is\n\nR x \\= E \\[ ( x − μ ) 2 \\] \\= 1 2 π ∫ ( x − μ ) 2 e − ( x − μ ) 2 \/ 2 d x \\= 1\\.\n\nAnd the risk on our second, dumb estimator is\n\nR dumb \\= E \\[ ( 7 − μ ) 2 \\] \\= 1 2 π ∫ ( 7 − μ ) 2 e − ( x − μ ) 2 \/ 2 d x \\= ( 7 − μ ) 2 .\n\nAs long as the true mean, μ, happens to be between 6 and 8, the dumb estimator of just saying 7 actually has lower risk\\!\n\nBased on this example it might seem that we’re stuck. Since we don’t know the true value of the mean, we can’t generally say if one estimator is better than another. And indeed this is often the case. But there are certain situations where this is not true. If we have two estimators and one of them has a lower risk *for any possible value the parameter can take*, we can say that one is definitively better than the other. In statistical parlance, we say that the worse estimator is “inadmissable.”\n\nIn more precise terms, Stein’s paradox states that in three dimensions or more, the naive estimator (just guessing that the mean is x) is inadmissable because the risk of the James-Stein estimator is lower for any possible mean I could choose.\n\n## What is the James-Stein estimator doing?\nBefore we can understand *why* the James-Stein estimator gives you a better guess than the naive estimator x, we should understand *what* exactly it’s doing. The idea is that we take the naive guess x and then we scale it towards the origin by some amount. The factor by which we scale it is:\n\nReLU ( 1 − D − 2 \\| x \\| 2 ) .\n\nThe ReLU function simply takes the maximum of its argument and zero, so this scale factor will be either positive or zero. Let’s suppose that it’s positive. In this case we scale it towards the origin more if the magnitude of the sample is smaller and less if it is lower. In the limit of \\| x \\| → ∞ we don’t change it at all and our guess reduces to the naive estimator x.\n\nIn the other limit, if \\| x \\| is very small, then the ReLU function will kick in and just set the scale factor to zero. Hence anytime we get a sufficiently small sample we throw it out and just guess that the mean is zero instead. And all else being equal we will shrink our estimate more in higher dimensional spaces than in lower dimensional spaces.\n\nSo that is what the James-Stein estimator is doing. Why does it work? Before we can answer that we need to take a quick detour.\n\n## Samples in high dimensional spaces\nHigh dimensional spaces are counterintuitive. One of the counterintuitive properties of high dimensional distributions is this: a sample from a symmetric high dimensional distribution is highly likely to be further from the origin than the mean. Specifically, for an isotropic D\\-dimensional Gaussian, the difference between the average distance to a sample and the distance to the mean grows as ∼D.\n\nIt’s a little strange to put it this way, but isn’t so surprising with a little bit of thought. Even in two dimensions we can see that this is the case just by drawing it:\n\n![](https:\/\/joe-antognini.github.io\/assets\/posts\/steins-paradox\/circle-origin.png)\n\nThe shaded area of the circle is less than half the area, so we are less likely to choose a sample closer to the origin than the mean. What perhaps makes this counterintuitive is that in two dimensions a fairly large fraction of the circle is still shaded. But as the dimensionality increases, this fraction decreases exponentially. Once the dimensionality is even moderately large we are highly unlikely to sample a point in this shaded region.\n\nOne caveat here is that this effect decreases the larger the mean is. You can imagine that as we move the circle further away from the origin, the shaded fraction gets closer and closer to ½. So long as the mean is sufficiently large, the probability of sampling a point closer to the origin than the mean can be close to ½ even in high dimensional spaces; it just requires a very large mean. We can start to see here the connection to the James-Stein estimator, which also gets very close to the naive estimator x as the mean (and hence \\| x \\|) gets very large.\n\nWhat we are doing by shrinking the estimate towards the origin is correcting for the tendency of the typical sample to be slightly further away from the origin than the mean. This correction allows us to reduce the overall risk of the estimate. \\[[1](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:6)\\]\n\n## How arbitrary is the origin, really?\nStein’s paradox is particularly strange because there are actually two counterintuitive things going on:\n\n1. The origin is arbitrary, so why does moving your estimate towards the origin help?\n2. Why does this not work in one or two dimensions?\n\nLet’s take a look at the first of these. A central principle in physics is that of relativity — coordinate systems are arbitrary, so the laws of physics must be valid in all of them. Surely this is also true in statistics as well. We can choose the origin to be wherever we like, so it cannot contain any information. But this sensible assertion is false. Statistics is not physics.\n\nIf we truly had no information about the mean, what value would the sample have? If it could really be *anything* then presumably its value would be exceedingly large. After all, there are a lot of numbers, almost all of them are too big to write down, and the mean could be absolutely *any* of them. But if we pick a sample and find that its distance from the origin is 3.72 there must be something fishy going on. Clearly we have, in fact, managed to embed *some* information in our choice of coordinate system. The only reason we have gotten out sensible values in our sample is because we have some prior as to where the mean is expected to be.\n\nIn fact, in the limit of no information \\| x \\| will be infinitely large and the James-Stein estimator will reduce to the naive estimator, x. So it was our clever choice of origin that smuggled information into our estimator and allowed us to do better than the naive estimate.\n\nIt’s important to distinguish between the *direction* of the origin and the *distance* between the origin and the mean. The direction of the origin is unimportant. In fact, we could shrink our estimate in any direction and we would still get the benefits of the James-Stein estimator. The important thing is the distance between the origin and the mean — this is what has encoded some prior information that we can exploit to make a better prediction.\n\n## The bias-variance tradeoff\nThe way the James-Stein estimator works can be understood by looking at the bias-variance tradeoff. The bias-variance tradeoff states that the risk of an estimator can be decomposed into two components: a constant “bias” term, which reflects how far off the average value of the estimate is from the correct value; and an unbiased “variance” term, which accounts for the randomness of the sample.\n\nThe naive estimator x is unbiased but has a high variance. One reason that Stein’s paradox seems unnatural is that we tend to confuse unbiased estimators with estimators that minimize risk. But in high dimensional spaces, samples from an isotropic Gaussian encompass a tremendous volume. Although our naive estimate is unbiased it has very high variance.\n\nWhat the James-Stein estimator does is scale the overall distribution towards the origin, thereby shrinking the volume of the distribution (and hence its variance), at the cost of introducing a little bit of bias. Although the estimator is now biased its overall risk is lower.\n\n## Deriving the James-Stein estimator geometrically\nAt this point we have some idea as to why shrinking the estimate towards the origin could be helpful, but we haven’t yet figured out why the specific form of the James-Stein estimator seems to work. To do this we’ll follow an argument presented by [Brown & Zhao (2012)](https:\/\/projecteuclid.org\/download\/pdfview_1\/euclid.ss\/1331729980).\n\nRemember that the direction of the coordinate system is arbitrary. So let’s rotate into a new one where one axis points directly toward the mean, and the other D − 1 axes are pointed in arbitrary (orthogonal) directions. (Of course, *you* can’t do this because you don’t know where the true mean is. But *I* do, and I’ve decided to help you out.) In this coordinate system the mean is just ( \\| μ \\| , 0 , … , 0 ). Let’s write the sample in this coordinate system as ( ζ 1 , ζ 2 , … , ζ D ).\n\nThe loss for our guess can now be broken into two orthogonal components: one component, ( ζ 1 − \\| μ \\| ) 2, which tells us how far off our estimate is from the true value when it’s projected onto the correct direction of the mean; and a second component, ∑ i \\= 2 D ζ i 2, which tells us how far off our estimate is from this correct direction. Let’s call this second component the “residual” component and define it as\n\nρ ≡ ∑ i \\= 2 D ζ i 2 .\n\nOur underlying distribution is isotropic, so rotating the coordinate system doesn’t change the fact that each coordinate ζ i is a sample from an independent one-dimensional Gaussian with unit variance. And if i \\> 1 then the mean of this distribution is zero as well. ρ therefore follows a [χ distribution](https:\/\/en.wikipedia.org\/wiki\/Chi_distribution) with D − 1 degrees of freedom.\n\nLet’s suppose that we’ve gone and chosen a sample from our distribution and by good fortune ζ 1 happens to be exactly equal to \\| μ \\|. In general the rest of the ζ i in the sample won’t be exactly zero and so ρ will be positive. If we plot ζ 1 on the x\\-axis and ρ on the y\\-axis, the situation will look like this:\n\n![](https:\/\/joe-antognini.github.io\/assets\/posts\/steins-paradox\/SteinsTriangle-1.png)\n\nThe sample from our distribution is the point at the top right of the triangle. If we just use this sample as our estimate of the mean then the loss of this estimate will simply be the squared distance between the point and the bottom right corner of the triangle, or ρ 2.\n\n### How *can* we transform the sample?\nWe only have two points in this problem: the origin, and the sample x. By symmetry this means that the only way by which we can make an estimate is to choose a point somewhere on the line between the origin and x. \\[[2](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:5)\\] In other words, you don’t actually know what direction any of the ζ i are in so you can’t simply move your guess down the right side of the triangle to reduce ρ. The only direction you are allowed to move your sample is along the hypotenuse of this triangle.\n\n### How *should* we transform the sample?\nThe beauty of the James-Stein estimator is that even though we are constrained to move the sample along the hypotenuse, we can nevertheless reduce the distance between our guess and the true mean by shrinking it until the direction between our guess and the mean is perpendicular to the hypotenuse.\n\n![](https:\/\/joe-antognini.github.io\/assets\/posts\/steins-paradox\/SteinsTriangle-2.png)\n\nSome simple geometry reveals that this new point is located at:\n\n( 1 − ρ 2 \\| x \\| 2 ) x\n\nThis is looking like the beginnings of the James-Stein estimator\\!\n\nWhat exactly is ρ? Unfortunately ρ is random variable, but let’s represent it by some central point of its distribution. Now ρ 2 follows a χ 2 distribution with D − 1 degrees of freedom and the mode of this distribution is max ( D − 2 , 0 ). So if we simply represent this distribution by its mode, our estimator becomes\n\n( 1 − D − 2 \\| x \\| 2 ) x ,\n\nfor D ≥ 3 and is the naive estimator x for D \\< 3, which is exactly the James-Stein estimator without the ReLU.\n\nTo be clear, this is a very hand-wavey argument. Representing the entire distribution by a single point is not particularly sophisticated, and there is no special reason to choose to use the mode instead of the mean or median except that it happens to give an answer that corresponds to the James-Stein estimator. \\[[3](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:2)\\]\n\n### Can we derive the James-Stein estimator rigorously?\nGiven how hand-wavey this argument was, you might wonder whether it’s possible to rigorously derive the James-Stein estimator by explicitly calculating the risk given the Gaussian and χ distributions, and then finding the function that minimizes the risk using the calculus of variations. However this approach will not work. The reason is that the James-Stein optimizer does not actually minimize the risk! In fact, despite its fame, the James-Stein estimator is itself inadmissible.\n\nAs we’ll see below, we can improve our estimator by wrapping it in a ReLU function. But the ReLU function introduces some difficulties, because there is a very general theorem which states that any admissible estimator must be a Bayes estimator for some prior, or the limit of Bayes estimators. \\[[4](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:1)\\] This ends up implying that the estimator must be a smooth function and the ReLU function is not smooth. So while the James-Stein estimator beats the naive estimator x, there exists some other estimator which beats it\\!\n\n### Recovering the ReLU\nThere is one last piece to tackle: the ReLU function. Where does that come from?\n\nThe ReLU has no effect on our estimate as long as \\| x \\| ≥ D − 2. This will generally be the case if \\| μ \\| is large. But if \\| μ \\| is small, then some of the points sampled will have \\| x \\| \\< D − 2. If we were to just use the same scaling we derived above, we would reflect our estimate through the origin and make it negative! This estimate has a worse loss than an estimate at the origin.\n\nThe reason for this is that once we have shrunk the sample all the way down to the origin we have already traded away all the variance for bias. In other words, we started with an estimate that had no bias and high variance, and ended up with an estimate that had high bias and no variance. But if we were to keep going past the origin, we’d continue to increase the bias but would start to increase the variance as well! This is strictly worse than keeping our estimate at the origin, so we are better off clamping the estimate at zero with a ReLU function.\n\n## Why does Stein’s paradox not hold in two dimensions?\nLet’s now turn to the second counterintuitive property of Stein’s paradox: what’s so special about three dimensions? We saw a hint that three dimensions is special since the mode of the χ 2 distribution is 0 for one and two dimensions, but is D − 2 for higher numbers of dimensions. But as I pointed out, that came about from representing the entire ρ 2 distribution by its mode, and it’s not really obvious why we should pick the mode in particular rather than, say, the mean or median.\n\nThinking back to our geometric argument, it is not hard to see why the James-Stein estimator doesn’t help in one dimension — we arrived at the James-Stein estimator by separating the sample into two components, one along the direction to the true mean, and the other as a residual component perpendicular to the first. But in one dimension there is no residual component! By shrinking your estimate towards the origin you introduce some bias, but this is not counterbalanced by any reduction in variance.\n\nBut what about the two dimensional case? Here again we unfortunately must wave our hands. In the two dimensional case we do, in fact, reduce the variance by shrinking our estimate towards the origin, just not by enough to offset the bias we introduce. While the orthogonal component, ρ, is not identically zero as it was in the one-dimensional case, it is strongly bunched up close to zero in the two dimensional case. The idea of the James-Stein estimator is to note that a certain proportion of a sample’s magnitude is due to a component orthogonal to the mean and to shrink the estimate to reduce that orthogonal component somewhat. But if the magnitude of the orthogonal component is too small, this won’t always work. There are some values of the mean for which the James-Stein estimator is better, but others where it is worse. It is not the better estimator no matter what.\n\n### A digression on random walks\nAs an aside, there is a [deep correspondence](http:\/\/stat.wharton.upenn.edu\/~lbrown\/Papers\/1971b%20Admissible%20estimators,%20recurrent%20diffusions,%20and%20insoluble%20boundary%20value%20problems.pdf) found by Lawrence Brown between random walks and the admissibility of the naive estimator, x. Brown showed that the naive estimator is admissible if and only if a random walk returns to the origin an infinite number of times. Random walks in one or two dimensions do this, but random walks in three or more dimensions do not. At a high level this is because the random walk drifts away from the origin at a rate linear in the dimensionality of the space, but the volume subtended by the origin at a fixed distance decays exponentially with dimension. In one or two dimensions the volume subtended by the origin is large enough that you are guaranteed to eventually make your way back to it. But in higher dimensions, the volume subtended is smaller, so that as time progresses you are less and less likely to make your way back and the probability approaches zero even in the limit of an infinite number of steps.\n\nWhile rigorously proving this correspondence is non-trivial, both problems have at their core a comparison between the distance between a point and the origin and the volume of the unit sphere in the space.\n\n## There be dragons far from the origin\nSo that, at a high level, is why the James-Stein estimator works. We’ve been focused in this post on the simplest form of Stein’s paradox, which is a bit of a toy problem. But the artificiality of the problem shouldn’t befog us as to the significance of these ideas. The artificiality eases the analysis, but in fact the concepts that Stein’s paradox illustrates run much deeper through statistics and machine learning.\n\nAs we discussed earlier on, while we usually don’t think that there is anything special about our choice of origin, we do nevertheless use it to smuggle in some priors, even if those priors are weak and we perhaps do so unwittingly. This is also true of machine learning models, maybe even more so. For most machine learning models the origin really *is* special. For example, at the origin most garden-variety neural networks will exhibit particularly simple behavior — they will output all zeros independent of the input. \\[[5](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:3)\\] Any movement away from this special point will introduce complexity into the functional behavior of the model. Neural networks are, of course, non-linear, so it is not always true that points further from the origin in parameter space are more complicated, but it is *generally* true.\n\nMost ML practitioners appreciate that techniques like L2 regularization have the effect of making models simpler, and therefore less likely to overfit. But Stein’s paradox dramatically illustrates the relationship between the risk of an estimate and the dimensionality of the space — in high dimensional spaces there is *much* more volume further away from the origin than closer to it. We are oftentimes better off introducing bias in order to reduce variance because in high dimensional spaces shrinking a small amount towards the origin reduces an *enormous* volume of parameter space. In other words, for a large machine learning model, there are *vastly* more ways to overfit than there are to underfit so we should bias our models towards underfitting because that is a simpler problem to solve than overfitting. (Or maybe to put it another way, underfitting is a small number of problems, whereas overfitting is a stupendous number of problems.)\n\nThis phenomenon can manifest itself during neural network training. Suppose we have converged on a local minimum in the loss landscape after training with some variant of stochastic gradient descent. At convergence there will be equality between the stochastic component of SGD, for which the random walk is causing the parameters to drift away from the minimum, and the true gradient, which is pushing the parameters towards the minimum. \\[[6](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fn:4)\\] At this point the neural network is wandering on some high-dimensional ellipsoid around the minimum. But by the nature of high dimensional spaces, the neural network will *always* be further from the origin than the minimum, and therefore on the side of too much complexity. Thus this converged model will always slightly overfit.\n\nThe further the model recedes from the origin, the more inexplicable its functional behavior can become. The moral of this story is that far from the origin there be dragons, and in a high dimensional space, moving just a little further away from the origin introduces lots and lots of dragons. The James-Stein estimator advises you to stay close to the origin to keep your risk down and the dragons at bay.\n***\n## Footnotes\n1. It’s important to note that this does not imply that our original estimate is biased. In fact the naive estimator is unbiased and shrinking it towards the origin introduces bias. It is the distance of the naive estimator from the origin that is biased high. But by correcting for *this* bias we happen to reduce the risk. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:6)\n2. While there’s nothing stopping us from choosing some other random direction, this adds no new information and so is effectively the same as choosing a different origin. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:5)\n3. The median and mean of the χ distribution are more complicated but are asymptotically equal to the mode in the limit of a large number of degrees of freedom. But the asymptotic limit does not help you prove the case for D \\= 3.\n   In fact Brown & Zhao (2012) chose to use the mean instead and derive a ( D − 1 ) \/ \\| x \\| 2 term instead of ( D − 2 ) \/ \\| x \\| 2 as we do from the mode. They then argue that the extra − 1 \/ \\| x \\| 2 is due to the variance in ζ 1. I am not fully convinced of this argument; it seems to me that once one has represented the entire distribution by a central point one is already waving one’s hands. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:2)\n4. A Bayes estimator is determined by a particular choice of prior. A “limit of Bayes estimators” is found by taking the limit of the Bayes estimators found from a sequence of priors as the sequence tends to infinity. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:1)\n5. Of course the origin is not the only point that exhibits this behavior. I speculate that if we shrank to any point for which the model had functional simplicity we would observe the same benefits as shrinking towards the origin. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:3)\n6. Learning rate decay also mitigates this issue. [↩](https:\/\/joe-antognini.github.io\/machine-learning\/steins-paradox#fnref:4)\n***","meta_canonical":null,"ml_categories_json":"{\"\/Science\":897,\"\/Science\/Mathematics\":894,\"\/Science\/Mathematics\/Statistics\":778}","ml_types_json":"{\"\/Article\":998,\"\/Article\/Tutorial_or_Guide\":891}","ml_intent_types_json":"{\"Informational\":999}","meta_language":"en","attrs_author":null,"attrs_publish_time":0,"attrs_original_publish_time":1609631921,"attrs_is_republished":0,"attrs_nr_words":"5260","attrs_boilerpipe_nr_words":"4817","body_ext_links_number":8,"body_int_links_number":7,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":1,"src_redirect":"","download_time_msec":43,"download_ttfb_msec":43,"download_size":12706}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

1 month ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	1.2 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property

Value

URL

https://joe-antognini.github.io/machine-learning/steins-paradox

Last Crawled

2026-03-18 10:20:46 (1 month ago)

First Indexed

2021-01-02 23:58:41 (5 years ago)

HTTP Status Code

200

Content

Meta Title

Understanding Stein's paradox – Joe Antognini

Meta Description

An intuitive explanation of the James-Stein estimator.

Meta Canonical

null

Boilerpipe Text

The paradox Stein’s paradox is among the most surprising results in statistics. The basic idea is easily stated, but it is difficult to understand how it could possibly be true. The premise is this: suppose that I have a Gaussian distribution with a variance of unity and some mean which I don’t tell you. I then draw a single sample from this distribution, give it to you, and ask you to guess the mean. What do you do? Well, you don’t have a lot of information to go on here, so you just guess that the mean is the number I gave you. This is a good guess! (We will make the notion of a “good guess” a little more precise later on.) No big surprise there. Now we play again, but this time, my distribution is a two -dimensional Gaussian. The covariance is the identity matrix (so this is equivalent to sampling from two independent one-dimensional Gaussians). But again I have not told you the mean (which is now a two-dimensional vector). Once more I draw a single sample from the distribution, hand it over to you, and ask you to guess the mean. You simply guess that the mean is the sample I have given you. Once more you have guessed well! Now we do the same thing in three dimensions. I draw a single sample, hand it over to you, and ask you to guess the mean. Just as before, you guess that the mean is the sample I gave you. But this is no longer a good guess! Stein’s paradox is that if we play this game in three dimensions or more, a better guess is to say that the mean is this: μ ^ = ReLU ( 1 − D − 2 | x | 2 ) x , where D is the dimensionality of the Gaussian, and x is the sample drawn from the distribution. This is the so-called “James-Stein estimator.” Who would have thought! What is going on here? What makes a guess good? Before we go on, we should clarify exactly what we mean by a “good guess.” We are trying to do what is called “parameter estimation” in statistics — based on a sample from a distribution, we want to infer some underlying parameter (or parameters) of the distribution. (In this case the parameter we are interested in is the mean.) In order to quantify how good or bad our estimate is we choose a function called a “loss function.” There is some freedom in choosing a loss function, but the mean squared error is a common choice and has a lot of valuable properties. Stein’s paradox assumes that we are using the mean squared error. So if we guess that the mean is μ ^ and the true value of the mean is μ , then the loss is L = | μ ^ − μ | 2 . Now, of course, we need some rule to go from the sample x to the estimate μ ^ . This rule is just a function of some kind, say, f ( x ) . This function has the special name of an “estimator.” We can choose whatever function we want here. Our original guess was to just use f ( x ) = x . But another choice here is to say f ( x ) = x + 7 , or f ( x ) = sin ⁡ ( x ) / x 71 , or even just f ( x ) = 31 . It doesn’t take much imagination to see that there are an infinite number of possible choices. But presumably some of these choices are better than others. How do we know which ones are good? Statisticians use the concept of risk for this purpose. Risk is simply the expected value of your loss function. One thing that can be a little confusing is that the risk is a function of both your choice of estimator and the true value of the parameter itself. So in the original game where you’re guessing the mean of a one-dimensional Gaussian, the risk will be a function of whatever rule you decide to use and the actual, unknown value of the mean. The fact that the risk is a function of the true value of the parameter makes things a little tricky. If you’re trying to decide between two estimators, you might find that one estimator works better for certain values that the parameter can take, and the other works better for others. As a dumb example, let’s go back to guessing the mean of a one-dimensional Gaussian. Our original estimator was μ ^ = x . But another, perfectly valid, estimator is μ ^ = 7 . In other words we ignore the sample entirely and say that the mean is 7 no matter what. Generally this doesn’t seem like a smart thing to do. But if the mean turns out to actually be pretty close to 7, on average this will be the better guess! Specifically, the risk of our initial estimator is R x = E [ ( x − μ ) 2 ] = 1 2 π ∫ ( x − μ ) 2 e − ( x − μ ) 2 / 2 d x = 1. And the risk on our second, dumb estimator is R dumb = E [ ( 7 − μ ) 2 ] = 1 2 π ∫ ( 7 − μ ) 2 e − ( x − μ ) 2 / 2 d x = ( 7 − μ ) 2 . As long as the true mean, μ , happens to be between 6 and 8, the dumb estimator of just saying 7 actually has lower risk! Based on this example it might seem that we’re stuck. Since we don’t know the true value of the mean, we can’t generally say if one estimator is better than another. And indeed this is often the case. But there are certain situations where this is not true. If we have two estimators and one of them has a lower risk for any possible value the parameter can take , we can say that one is definitively better than the other. In statistical parlance, we say that the worse estimator is “inadmissable.” In more precise terms, Stein’s paradox states that in three dimensions or more, the naive estimator (just guessing that the mean is x ) is inadmissable because the risk of the James-Stein estimator is lower for any possible mean I could choose. What is the James-Stein estimator doing? Before we can understand why the James-Stein estimator gives you a better guess than the naive estimator x , we should understand what exactly it’s doing. The idea is that we take the naive guess x and then we scale it towards the origin by some amount. The factor by which we scale it is: ReLU ( 1 − D − 2 | x | 2 ) . The ReLU function simply takes the maximum of its argument and zero, so this scale factor will be either positive or zero. Let’s suppose that it’s positive. In this case we scale it towards the origin more if the magnitude of the sample is smaller and less if it is lower. In the limit of | x | → ∞ we don’t change it at all and our guess reduces to the naive estimator x . In the other limit, if | x | is very small, then the ReLU function will kick in and just set the scale factor to zero. Hence anytime we get a sufficiently small sample we throw it out and just guess that the mean is zero instead. And all else being equal we will shrink our estimate more in higher dimensional spaces than in lower dimensional spaces. So that is what the James-Stein estimator is doing. Why does it work? Before we can answer that we need to take a quick detour. Samples in high dimensional spaces High dimensional spaces are counterintuitive. One of the counterintuitive properties of high dimensional distributions is this: a sample from a symmetric high dimensional distribution is highly likely to be further from the origin than the mean. Specifically, for an isotropic D -dimensional Gaussian, the difference between the average distance to a sample and the distance to the mean grows as ∼ D . It’s a little strange to put it this way, but isn’t so surprising with a little bit of thought. Even in two dimensions we can see that this is the case just by drawing it: The shaded area of the circle is less than half the area, so we are less likely to choose a sample closer to the origin than the mean. What perhaps makes this counterintuitive is that in two dimensions a fairly large fraction of the circle is still shaded. But as the dimensionality increases, this fraction decreases exponentially. Once the dimensionality is even moderately large we are highly unlikely to sample a point in this shaded region. One caveat here is that this effect decreases the larger the mean is. You can imagine that as we move the circle further away from the origin, the shaded fraction gets closer and closer to ½. So long as the mean is sufficiently large, the probability of sampling a point closer to the origin than the mean can be close to ½ even in high dimensional spaces; it just requires a very large mean. We can start to see here the connection to the James-Stein estimator, which also gets very close to the naive estimator x as the mean (and hence | x | ) gets very large. What we are doing by shrinking the estimate towards the origin is correcting for the tendency of the typical sample to be slightly further away from the origin than the mean. This correction allows us to reduce the overall risk of the estimate. [ 1 ] How arbitrary is the origin, really? Stein’s paradox is particularly strange because there are actually two counterintuitive things going on: The origin is arbitrary, so why does moving your estimate towards the origin help? Why does this not work in one or two dimensions? Let’s take a look at the first of these. A central principle in physics is that of relativity — coordinate systems are arbitrary, so the laws of physics must be valid in all of them. Surely this is also true in statistics as well. We can choose the origin to be wherever we like, so it cannot contain any information. But this sensible assertion is false. Statistics is not physics. If we truly had no information about the mean, what value would the sample have? If it could really be anything then presumably its value would be exceedingly large. After all, there are a lot of numbers, almost all of them are too big to write down, and the mean could be absolutely any of them. But if we pick a sample and find that its distance from the origin is 3.72 there must be something fishy going on. Clearly we have, in fact, managed to embed some information in our choice of coordinate system. The only reason we have gotten out sensible values in our sample is because we have some prior as to where the mean is expected to be. In fact, in the limit of no information | x | will be infinitely large and the James-Stein estimator will reduce to the naive estimator, x . So it was our clever choice of origin that smuggled information into our estimator and allowed us to do better than the naive estimate. It’s important to distinguish between the direction of the origin and the distance between the origin and the mean. The direction of the origin is unimportant. In fact, we could shrink our estimate in any direction and we would still get the benefits of the James-Stein estimator. The important thing is the distance between the origin and the mean — this is what has encoded some prior information that we can exploit to make a better prediction. The bias-variance tradeoff The way the James-Stein estimator works can be understood by looking at the bias-variance tradeoff. The bias-variance tradeoff states that the risk of an estimator can be decomposed into two components: a constant “bias” term, which reflects how far off the average value of the estimate is from the correct value; and an unbiased “variance” term, which accounts for the randomness of the sample. The naive estimator x is unbiased but has a high variance. One reason that Stein’s paradox seems unnatural is that we tend to confuse unbiased estimators with estimators that minimize risk. But in high dimensional spaces, samples from an isotropic Gaussian encompass a tremendous volume. Although our naive estimate is unbiased it has very high variance. What the James-Stein estimator does is scale the overall distribution towards the origin, thereby shrinking the volume of the distribution (and hence its variance), at the cost of introducing a little bit of bias. Although the estimator is now biased its overall risk is lower. Deriving the James-Stein estimator geometrically At this point we have some idea as to why shrinking the estimate towards the origin could be helpful, but we haven’t yet figured out why the specific form of the James-Stein estimator seems to work. To do this we’ll follow an argument presented by Brown & Zhao (2012) . Remember that the direction of the coordinate system is arbitrary. So let’s rotate into a new one where one axis points directly toward the mean, and the other D − 1 axes are pointed in arbitrary (orthogonal) directions. (Of course, you can’t do this because you don’t know where the true mean is. But I do, and I’ve decided to help you out.) In this coordinate system the mean is just ( | μ | , 0 , … , 0 ) . Let’s write the sample in this coordinate system as ( ζ 1 , ζ 2 , … , ζ D ) . The loss for our guess can now be broken into two orthogonal components: one component, ( ζ 1 − | μ | ) 2 , which tells us how far off our estimate is from the true value when it’s projected onto the correct direction of the mean; and a second component, ∑ i = 2 D ζ i 2 , which tells us how far off our estimate is from this correct direction. Let’s call this second component the “residual” component and define it as ρ ≡ ∑ i = 2 D ζ i 2 . Our underlying distribution is isotropic, so rotating the coordinate system doesn’t change the fact that each coordinate ζ i is a sample from an independent one-dimensional Gaussian with unit variance. And if i > 1 then the mean of this distribution is zero as well. ρ therefore follows a χ distribution with D − 1 degrees of freedom. Let’s suppose that we’ve gone and chosen a sample from our distribution and by good fortune ζ 1 happens to be exactly equal to | μ | . In general the rest of the ζ i in the sample won’t be exactly zero and so ρ will be positive. If we plot ζ 1 on the x -axis and ρ on the y -axis, the situation will look like this: The sample from our distribution is the point at the top right of the triangle. If we just use this sample as our estimate of the mean then the loss of this estimate will simply be the squared distance between the point and the bottom right corner of the triangle, or ρ 2 . How can we transform the sample? We only have two points in this problem: the origin, and the sample x . By symmetry this means that the only way by which we can make an estimate is to choose a point somewhere on the line between the origin and x . [ 2 ] In other words, you don’t actually know what direction any of the ζ i are in so you can’t simply move your guess down the right side of the triangle to reduce ρ . The only direction you are allowed to move your sample is along the hypotenuse of this triangle. How should we transform the sample? The beauty of the James-Stein estimator is that even though we are constrained to move the sample along the hypotenuse, we can nevertheless reduce the distance between our guess and the true mean by shrinking it until the direction between our guess and the mean is perpendicular to the hypotenuse. Some simple geometry reveals that this new point is located at: ( 1 − ρ 2 | x | 2 ) x This is looking like the beginnings of the James-Stein estimator! What exactly is ρ ? Unfortunately ρ is random variable, but let’s represent it by some central point of its distribution. Now ρ 2 follows a χ 2 distribution with D − 1 degrees of freedom and the mode of this distribution is max ( D − 2 , 0 ) . So if we simply represent this distribution by its mode, our estimator becomes ( 1 − D − 2 | x | 2 ) x , for D ≥ 3 and is the naive estimator x for D < 3 , which is exactly the James-Stein estimator without the ReLU. To be clear, this is a very hand-wavey argument. Representing the entire distribution by a single point is not particularly sophisticated, and there is no special reason to choose to use the mode instead of the mean or median except that it happens to give an answer that corresponds to the James-Stein estimator. [ 3 ] Can we derive the James-Stein estimator rigorously? Given how hand-wavey this argument was, you might wonder whether it’s possible to rigorously derive the James-Stein estimator by explicitly calculating the risk given the Gaussian and χ distributions, and then finding the function that minimizes the risk using the calculus of variations. However this approach will not work. The reason is that the James-Stein optimizer does not actually minimize the risk! In fact, despite its fame, the James-Stein estimator is itself inadmissible. As we’ll see below, we can improve our estimator by wrapping it in a ReLU function. But the ReLU function introduces some difficulties, because there is a very general theorem which states that any admissible estimator must be a Bayes estimator for some prior, or the limit of Bayes estimators. [ 4 ] This ends up implying that the estimator must be a smooth function and the ReLU function is not smooth. So while the James-Stein estimator beats the naive estimator x , there exists some other estimator which beats it! Recovering the ReLU There is one last piece to tackle: the ReLU function. Where does that come from? The ReLU has no effect on our estimate as long as | x | ≥ D − 2 . This will generally be the case if | μ | is large. But if | μ | is small, then some of the points sampled will have | x | < D − 2 . If we were to just use the same scaling we derived above, we would reflect our estimate through the origin and make it negative! This estimate has a worse loss than an estimate at the origin. The reason for this is that once we have shrunk the sample all the way down to the origin we have already traded away all the variance for bias. In other words, we started with an estimate that had no bias and high variance, and ended up with an estimate that had high bias and no variance. But if we were to keep going past the origin, we’d continue to increase the bias but would start to increase the variance as well! This is strictly worse than keeping our estimate at the origin, so we are better off clamping the estimate at zero with a ReLU function. Why does Stein’s paradox not hold in two dimensions? Let’s now turn to the second counterintuitive property of Stein’s paradox: what’s so special about three dimensions? We saw a hint that three dimensions is special since the mode of the χ 2 distribution is 0 for one and two dimensions, but is D − 2 for higher numbers of dimensions. But as I pointed out, that came about from representing the entire ρ 2 distribution by its mode, and it’s not really obvious why we should pick the mode in particular rather than, say, the mean or median. Thinking back to our geometric argument, it is not hard to see why the James-Stein estimator doesn’t help in one dimension — we arrived at the James-Stein estimator by separating the sample into two components, one along the direction to the true mean, and the other as a residual component perpendicular to the first. But in one dimension there is no residual component! By shrinking your estimate towards the origin you introduce some bias, but this is not counterbalanced by any reduction in variance. But what about the two dimensional case? Here again we unfortunately must wave our hands. In the two dimensional case we do, in fact, reduce the variance by shrinking our estimate towards the origin, just not by enough to offset the bias we introduce. While the orthogonal component, ρ , is not identically zero as it was in the one-dimensional case, it is strongly bunched up close to zero in the two dimensional case. The idea of the James-Stein estimator is to note that a certain proportion of a sample’s magnitude is due to a component orthogonal to the mean and to shrink the estimate to reduce that orthogonal component somewhat. But if the magnitude of the orthogonal component is too small, this won’t always work. There are some values of the mean for which the James-Stein estimator is better, but others where it is worse. It is not the better estimator no matter what. A digression on random walks As an aside, there is a deep correspondence found by Lawrence Brown between random walks and the admissibility of the naive estimator, x . Brown showed that the naive estimator is admissible if and only if a random walk returns to the origin an infinite number of times. Random walks in one or two dimensions do this, but random walks in three or more dimensions do not. At a high level this is because the random walk drifts away from the origin at a rate linear in the dimensionality of the space, but the volume subtended by the origin at a fixed distance decays exponentially with dimension. In one or two dimensions the volume subtended by the origin is large enough that you are guaranteed to eventually make your way back to it. But in higher dimensions, the volume subtended is smaller, so that as time progresses you are less and less likely to make your way back and the probability approaches zero even in the limit of an infinite number of steps. While rigorously proving this correspondence is non-trivial, both problems have at their core a comparison between the distance between a point and the origin and the volume of the unit sphere in the space. There be dragons far from the origin So that, at a high level, is why the James-Stein estimator works. We’ve been focused in this post on the simplest form of Stein’s paradox, which is a bit of a toy problem. But the artificiality of the problem shouldn’t befog us as to the significance of these ideas. The artificiality eases the analysis, but in fact the concepts that Stein’s paradox illustrates run much deeper through statistics and machine learning. As we discussed earlier on, while we usually don’t think that there is anything special about our choice of origin, we do nevertheless use it to smuggle in some priors, even if those priors are weak and we perhaps do so unwittingly. This is also true of machine learning models, maybe even more so. For most machine learning models the origin really is special. For example, at the origin most garden-variety neural networks will exhibit particularly simple behavior — they will output all zeros independent of the input. [ 5 ] Any movement away from this special point will introduce complexity into the functional behavior of the model. Neural networks are, of course, non-linear, so it is not always true that points further from the origin in parameter space are more complicated, but it is generally true. Most ML practitioners appreciate that techniques like L2 regularization have the effect of making models simpler, and therefore less likely to overfit. But Stein’s paradox dramatically illustrates the relationship between the risk of an estimate and the dimensionality of the space — in high dimensional spaces there is much more volume further away from the origin than closer to it. We are oftentimes better off introducing bias in order to reduce variance because in high dimensional spaces shrinking a small amount towards the origin reduces an enormous volume of parameter space. In other words, for a large machine learning model, there are vastly more ways to overfit than there are to underfit so we should bias our models towards underfitting because that is a simpler problem to solve than overfitting. (Or maybe to put it another way, underfitting is a small number of problems, whereas overfitting is a stupendous number of problems.) This phenomenon can manifest itself during neural network training. Suppose we have converged on a local minimum in the loss landscape after training with some variant of stochastic gradient descent. At convergence there will be equality between the stochastic component of SGD, for which the random walk is causing the parameters to drift away from the minimum, and the true gradient, which is pushing the parameters towards the minimum. [ 6 ] At this point the neural network is wandering on some high-dimensional ellipsoid around the minimum. But by the nature of high dimensional spaces, the neural network will always be further from the origin than the minimum, and therefore on the side of too much complexity. Thus this converged model will always slightly overfit. The further the model recedes from the origin, the more inexplicable its functional behavior can become. The moral of this story is that far from the origin there be dragons, and in a high dimensional space, moving just a little further away from the origin introduces lots and lots of dragons. The James-Stein estimator advises you to stay close to the origin to keep your risk down and the dragons at bay. Footnotes It’s important to note that this does not imply that our original estimate is biased. In fact the naive estimator is unbiased and shrinking it towards the origin introduces bias. It is the distance of the naive estimator from the origin that is biased high. But by correcting for this bias we happen to reduce the risk. ↩ While there’s nothing stopping us from choosing some other random direction, this adds no new information and so is effectively the same as choosing a different origin. ↩ The median and mean of the χ distribution are more complicated but are asymptotically equal to the mode in the limit of a large number of degrees of freedom. But the asymptotic limit does not help you prove the case for D = 3 . In fact Brown & Zhao (2012) chose to use the mean instead and derive a ( D − 1 ) / | x | 2 term instead of ( D − 2 ) / | x | 2 as we do from the mode. They then argue that the extra − 1 / | x | 2 is due to the variance in ζ 1 . I am not fully convinced of this argument; it seems to me that once one has represented the entire distribution by a central point one is already waving one’s hands. ↩ A Bayes estimator is determined by a particular choice of prior. A “limit of Bayes estimators” is found by taking the limit of the Bayes estimators found from a sequence of priors as the sequence tends to infinity. ↩ Of course the origin is not the only point that exhibits this behavior. I speculate that if we shrank to any point for which the model had functional simplicity we would observe the same benefits as shrinking towards the origin. ↩ Learning rate decay also mitigates this issue. ↩

Markdown

# [Joe Antognini](https://joe-antognini.github.io/) [☰](https://joe-antognini.github.io/machine-learning/steins-paradox#nav) - [Publications](https://joe-antognini.github.io/publications) - [Projects](https://joe-antognini.github.io/projects) - [About](https://joe-antognini.github.io/about) # Understanding Stein's paradox January 2, 2021 ## The paradox Stein’s paradox is among the most surprising results in statistics. The basic idea is easily stated, but it is difficult to understand how it could possibly be true. The premise is this: suppose that I have a Gaussian distribution with a variance of unity and some mean which I don’t tell you. I then draw a single sample from this distribution, give it to you, and ask you to guess the mean. What do you do? Well, you don’t have a lot of information to go on here, so you just guess that the mean is the number I gave you. This is a good guess! (We will make the notion of a “good guess” a little more precise later on.) No big surprise there. Now we play again, but this time, my distribution is a *two*\-dimensional Gaussian. The covariance is the identity matrix (so this is equivalent to sampling from two independent one-dimensional Gaussians). But again I have not told you the mean (which is now a two-dimensional vector). Once more I draw a single sample from the distribution, hand it over to you, and ask you to guess the mean. You simply guess that the mean is the sample I have given you. Once more you have guessed well\! Now we do the same thing in three dimensions. I draw a single sample, hand it over to you, and ask you to guess the mean. Just as before, you guess that the mean is the sample I gave you. But this is no longer a good guess! Stein’s paradox is that if we play this game in three dimensions or more, a better guess is to say that the mean is this: μ^\=ReLU(1−D−2\|x\|2)x, μ ^ \= ReLU ( 1 − D − 2 \| x \| 2 ) x , where D D is the dimensionality of the Gaussian, and x x is the sample drawn from the distribution. This is the so-called “James-Stein estimator.” Who would have thought! What is going on here? ## What makes a guess good? Before we go on, we should clarify exactly what we mean by a “good guess.” We are trying to do what is called “parameter estimation” in statistics — based on a sample from a distribution, we want to infer some underlying parameter (or parameters) of the distribution. (In this case the parameter we are interested in is the mean.) In order to quantify how good or bad our estimate is we choose a function called a “loss function.” There is some freedom in choosing a loss function, but the mean squared error is a common choice and has a lot of valuable properties. Stein’s paradox assumes that we are using the mean squared error. So if we guess that the mean is μ^ μ ^ and the true value of the mean is μ μ, then the loss is L\=\|μ^−μ\|2. L \= \| μ ^ − μ \| 2 . Now, of course, we need some rule to go from the sample x x to the estimate μ^ μ ^. This rule is just a function of some kind, say, f(x) f ( x ). This function has the special name of an “estimator.” We can choose whatever function we want here. Our original guess was to just use f(x)\=x f ( x ) \= x. But another choice here is to say f(x)\=x\+7 f ( x ) \= x \+ 7, or f(x)\=sin(x)/x71 f ( x ) \= sin ⁡ ( x ) / x 71, or even just f(x)\=31 f ( x ) \= 31. It doesn’t take much imagination to see that there are an infinite number of possible choices. But presumably some of these choices are better than others. How do we know which ones are good? Statisticians use the concept of *risk* for this purpose. Risk is simply the expected value of your loss function. One thing that can be a little confusing is that the risk is a function of *both* your choice of estimator *and* the true value of the parameter itself. So in the original game where you’re guessing the mean of a one-dimensional Gaussian, the risk will be a function of whatever rule you decide to use and the actual, unknown value of the mean. The fact that the risk is a function of the true value of the parameter makes things a little tricky. If you’re trying to decide between two estimators, you might find that one estimator works better for certain values that the parameter can take, and the other works better for others. As a dumb example, let’s go back to guessing the mean of a one-dimensional Gaussian. Our original estimator was μ^\=x μ ^ \= x. But another, perfectly valid, estimator is μ^\=7 μ ^ \= 7. In other words we ignore the sample entirely and say that the mean is 7 no matter what. Generally this doesn’t seem like a smart thing to do. But if the mean turns out to actually be pretty close to 7, on average this will be the better guess! Specifically, the risk of our initial estimator is Rx\=\=\=E\[(x−μ)2\]12π−−√∫(x−μ)2e−(x−μ)2/2dx1\. R x \= E \[ ( x − μ ) 2 \] \= 1 2 π ∫ ( x − μ ) 2 e − ( x − μ ) 2 / 2 d x \= 1\. And the risk on our second, dumb estimator is Rdumb\=\=\=E\[(7−μ)2\]12π−−√∫(7−μ)2e−(x−μ)2/2dx(7−μ)2. R dumb \= E \[ ( 7 − μ ) 2 \] \= 1 2 π ∫ ( 7 − μ ) 2 e − ( x − μ ) 2 / 2 d x \= ( 7 − μ ) 2 . As long as the true mean, μ μ, happens to be between 6 and 8, the dumb estimator of just saying 7 actually has lower risk\! Based on this example it might seem that we’re stuck. Since we don’t know the true value of the mean, we can’t generally say if one estimator is better than another. And indeed this is often the case. But there are certain situations where this is not true. If we have two estimators and one of them has a lower risk *for any possible value the parameter can take*, we can say that one is definitively better than the other. In statistical parlance, we say that the worse estimator is “inadmissable.” In more precise terms, Stein’s paradox states that in three dimensions or more, the naive estimator (just guessing that the mean is x x) is inadmissable because the risk of the James-Stein estimator is lower for any possible mean I could choose. ## What is the James-Stein estimator doing? Before we can understand *why* the James-Stein estimator gives you a better guess than the naive estimator x x, we should understand *what* exactly it’s doing. The idea is that we take the naive guess x x and then we scale it towards the origin by some amount. The factor by which we scale it is: ReLU(1−D−2\|x\|2). ReLU ( 1 − D − 2 \| x \| 2 ) . The ReLU function simply takes the maximum of its argument and zero, so this scale factor will be either positive or zero. Let’s suppose that it’s positive. In this case we scale it towards the origin more if the magnitude of the sample is smaller and less if it is lower. In the limit of \|x\|→∞ \| x \| → ∞ we don’t change it at all and our guess reduces to the naive estimator x x. In the other limit, if \|x\| \| x \| is very small, then the ReLU function will kick in and just set the scale factor to zero. Hence anytime we get a sufficiently small sample we throw it out and just guess that the mean is zero instead. And all else being equal we will shrink our estimate more in higher dimensional spaces than in lower dimensional spaces. So that is what the James-Stein estimator is doing. Why does it work? Before we can answer that we need to take a quick detour. ## Samples in high dimensional spaces High dimensional spaces are counterintuitive. One of the counterintuitive properties of high dimensional distributions is this: a sample from a symmetric high dimensional distribution is highly likely to be further from the origin than the mean. Specifically, for an isotropic D D\-dimensional Gaussian, the difference between the average distance to a sample and the distance to the mean grows as ∼ ∼D−−√ D. It’s a little strange to put it this way, but isn’t so surprising with a little bit of thought. Even in two dimensions we can see that this is the case just by drawing it: ![](https://joe-antognini.github.io/assets/posts/steins-paradox/circle-origin.png) The shaded area of the circle is less than half the area, so we are less likely to choose a sample closer to the origin than the mean. What perhaps makes this counterintuitive is that in two dimensions a fairly large fraction of the circle is still shaded. But as the dimensionality increases, this fraction decreases exponentially. Once the dimensionality is even moderately large we are highly unlikely to sample a point in this shaded region. One caveat here is that this effect decreases the larger the mean is. You can imagine that as we move the circle further away from the origin, the shaded fraction gets closer and closer to ½. So long as the mean is sufficiently large, the probability of sampling a point closer to the origin than the mean can be close to ½ even in high dimensional spaces; it just requires a very large mean. We can start to see here the connection to the James-Stein estimator, which also gets very close to the naive estimator x x as the mean (and hence \|x\| \| x \|) gets very large. What we are doing by shrinking the estimate towards the origin is correcting for the tendency of the typical sample to be slightly further away from the origin than the mean. This correction allows us to reduce the overall risk of the estimate. \[[1](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:6)\] ## How arbitrary is the origin, really? Stein’s paradox is particularly strange because there are actually two counterintuitive things going on: 1. The origin is arbitrary, so why does moving your estimate towards the origin help? 2. Why does this not work in one or two dimensions? Let’s take a look at the first of these. A central principle in physics is that of relativity — coordinate systems are arbitrary, so the laws of physics must be valid in all of them. Surely this is also true in statistics as well. We can choose the origin to be wherever we like, so it cannot contain any information. But this sensible assertion is false. Statistics is not physics. If we truly had no information about the mean, what value would the sample have? If it could really be *anything* then presumably its value would be exceedingly large. After all, there are a lot of numbers, almost all of them are too big to write down, and the mean could be absolutely *any* of them. But if we pick a sample and find that its distance from the origin is 3.72 there must be something fishy going on. Clearly we have, in fact, managed to embed *some* information in our choice of coordinate system. The only reason we have gotten out sensible values in our sample is because we have some prior as to where the mean is expected to be. In fact, in the limit of no information \|x\| \| x \| will be infinitely large and the James-Stein estimator will reduce to the naive estimator, x x. So it was our clever choice of origin that smuggled information into our estimator and allowed us to do better than the naive estimate. It’s important to distinguish between the *direction* of the origin and the *distance* between the origin and the mean. The direction of the origin is unimportant. In fact, we could shrink our estimate in any direction and we would still get the benefits of the James-Stein estimator. The important thing is the distance between the origin and the mean — this is what has encoded some prior information that we can exploit to make a better prediction. ## The bias-variance tradeoff The way the James-Stein estimator works can be understood by looking at the bias-variance tradeoff. The bias-variance tradeoff states that the risk of an estimator can be decomposed into two components: a constant “bias” term, which reflects how far off the average value of the estimate is from the correct value; and an unbiased “variance” term, which accounts for the randomness of the sample. The naive estimator x x is unbiased but has a high variance. One reason that Stein’s paradox seems unnatural is that we tend to confuse unbiased estimators with estimators that minimize risk. But in high dimensional spaces, samples from an isotropic Gaussian encompass a tremendous volume. Although our naive estimate is unbiased it has very high variance. What the James-Stein estimator does is scale the overall distribution towards the origin, thereby shrinking the volume of the distribution (and hence its variance), at the cost of introducing a little bit of bias. Although the estimator is now biased its overall risk is lower. ## Deriving the James-Stein estimator geometrically At this point we have some idea as to why shrinking the estimate towards the origin could be helpful, but we haven’t yet figured out why the specific form of the James-Stein estimator seems to work. To do this we’ll follow an argument presented by [Brown & Zhao (2012)](https://projecteuclid.org/download/pdfview_1/euclid.ss/1331729980). Remember that the direction of the coordinate system is arbitrary. So let’s rotate into a new one where one axis points directly toward the mean, and the other D−1 D − 1 axes are pointed in arbitrary (orthogonal) directions. (Of course, *you* can’t do this because you don’t know where the true mean is. But *I* do, and I’ve decided to help you out.) In this coordinate system the mean is just (\|μ\|,0,…,0) ( \| μ \| , 0 , … , 0 ). Let’s write the sample in this coordinate system as (ζ1,ζ2,…,ζD) ( ζ 1 , ζ 2 , … , ζ D ). The loss for our guess can now be broken into two orthogonal components: one component, (ζ1−\|μ\|)2 ( ζ 1 − \| μ \| ) 2, which tells us how far off our estimate is from the true value when it’s projected onto the correct direction of the mean; and a second component, ∑Di\=2ζ2i ∑ i \= 2 D ζ i 2, which tells us how far off our estimate is from this correct direction. Let’s call this second component the “residual” component and define it as ρ≡∑i\=2Dζ2i−−−−−⎷. ρ ≡ ∑ i \= 2 D ζ i 2 . Our underlying distribution is isotropic, so rotating the coordinate system doesn’t change the fact that each coordinate ζi ζ i is a sample from an independent one-dimensional Gaussian with unit variance. And if i\>1 i \> 1 then the mean of this distribution is zero as well. ρ ρ therefore follows a [χ χ distribution](https://en.wikipedia.org/wiki/Chi_distribution) with D−1 D − 1 degrees of freedom. Let’s suppose that we’ve gone and chosen a sample from our distribution and by good fortune ζ1 ζ 1 happens to be exactly equal to \|μ\| \| μ \|. In general the rest of the ζi ζ i in the sample won’t be exactly zero and so ρ ρ will be positive. If we plot ζ1 ζ 1 on the x x\-axis and ρ ρ on the y y\-axis, the situation will look like this: ![](https://joe-antognini.github.io/assets/posts/steins-paradox/SteinsTriangle-1.png) The sample from our distribution is the point at the top right of the triangle. If we just use this sample as our estimate of the mean then the loss of this estimate will simply be the squared distance between the point and the bottom right corner of the triangle, or ρ2 ρ 2. ### How *can* we transform the sample? We only have two points in this problem: the origin, and the sample x x. By symmetry this means that the only way by which we can make an estimate is to choose a point somewhere on the line between the origin and x x. \[[2](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:5)\] In other words, you don’t actually know what direction any of the ζi ζ i are in so you can’t simply move your guess down the right side of the triangle to reduce ρ ρ. The only direction you are allowed to move your sample is along the hypotenuse of this triangle. ### How *should* we transform the sample? The beauty of the James-Stein estimator is that even though we are constrained to move the sample along the hypotenuse, we can nevertheless reduce the distance between our guess and the true mean by shrinking it until the direction between our guess and the mean is perpendicular to the hypotenuse. ![](https://joe-antognini.github.io/assets/posts/steins-paradox/SteinsTriangle-2.png) Some simple geometry reveals that this new point is located at: (1−ρ2\|x\|2)x ( 1 − ρ 2 \| x \| 2 ) x This is looking like the beginnings of the James-Stein estimator\! What exactly is ρ ρ? Unfortunately ρ ρ is random variable, but let’s represent it by some central point of its distribution. Now ρ2 ρ 2 follows a χ2 χ 2 distribution with D−1 D − 1 degrees of freedom and the mode of this distribution is max(D−2,0) max ( D − 2 , 0 ). So if we simply represent this distribution by its mode, our estimator becomes (1−D−2\|x\|2)x, ( 1 − D − 2 \| x \| 2 ) x , for D≥3 D ≥ 3 and is the naive estimator x x for D\<3 D \< 3, which is exactly the James-Stein estimator without the ReLU. To be clear, this is a very hand-wavey argument. Representing the entire distribution by a single point is not particularly sophisticated, and there is no special reason to choose to use the mode instead of the mean or median except that it happens to give an answer that corresponds to the James-Stein estimator. \[[3](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:2)\] ### Can we derive the James-Stein estimator rigorously? Given how hand-wavey this argument was, you might wonder whether it’s possible to rigorously derive the James-Stein estimator by explicitly calculating the risk given the Gaussian and χ χ distributions, and then finding the function that minimizes the risk using the calculus of variations. However this approach will not work. The reason is that the James-Stein optimizer does not actually minimize the risk! In fact, despite its fame, the James-Stein estimator is itself inadmissible. As we’ll see below, we can improve our estimator by wrapping it in a ReLU function. But the ReLU function introduces some difficulties, because there is a very general theorem which states that any admissible estimator must be a Bayes estimator for some prior, or the limit of Bayes estimators. \[[4](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:1)\] This ends up implying that the estimator must be a smooth function and the ReLU function is not smooth. So while the James-Stein estimator beats the naive estimator x x, there exists some other estimator which beats it\! ### Recovering the ReLU There is one last piece to tackle: the ReLU function. Where does that come from? The ReLU has no effect on our estimate as long as \|x\|≥D−2 \| x \| ≥ D − 2. This will generally be the case if \|μ\| \| μ \| is large. But if \|μ\| \| μ \| is small, then some of the points sampled will have \|x\|\<D−2 \| x \| \< D − 2. If we were to just use the same scaling we derived above, we would reflect our estimate through the origin and make it negative! This estimate has a worse loss than an estimate at the origin. The reason for this is that once we have shrunk the sample all the way down to the origin we have already traded away all the variance for bias. In other words, we started with an estimate that had no bias and high variance, and ended up with an estimate that had high bias and no variance. But if we were to keep going past the origin, we’d continue to increase the bias but would start to increase the variance as well! This is strictly worse than keeping our estimate at the origin, so we are better off clamping the estimate at zero with a ReLU function. ## Why does Stein’s paradox not hold in two dimensions? Let’s now turn to the second counterintuitive property of Stein’s paradox: what’s so special about three dimensions? We saw a hint that three dimensions is special since the mode of the χ2 χ 2 distribution is 0 for one and two dimensions, but is D−2 D − 2 for higher numbers of dimensions. But as I pointed out, that came about from representing the entire ρ2 ρ 2 distribution by its mode, and it’s not really obvious why we should pick the mode in particular rather than, say, the mean or median. Thinking back to our geometric argument, it is not hard to see why the James-Stein estimator doesn’t help in one dimension — we arrived at the James-Stein estimator by separating the sample into two components, one along the direction to the true mean, and the other as a residual component perpendicular to the first. But in one dimension there is no residual component! By shrinking your estimate towards the origin you introduce some bias, but this is not counterbalanced by any reduction in variance. But what about the two dimensional case? Here again we unfortunately must wave our hands. In the two dimensional case we do, in fact, reduce the variance by shrinking our estimate towards the origin, just not by enough to offset the bias we introduce. While the orthogonal component, ρ ρ, is not identically zero as it was in the one-dimensional case, it is strongly bunched up close to zero in the two dimensional case. The idea of the James-Stein estimator is to note that a certain proportion of a sample’s magnitude is due to a component orthogonal to the mean and to shrink the estimate to reduce that orthogonal component somewhat. But if the magnitude of the orthogonal component is too small, this won’t always work. There are some values of the mean for which the James-Stein estimator is better, but others where it is worse. It is not the better estimator no matter what. ### A digression on random walks As an aside, there is a [deep correspondence](http://stat.wharton.upenn.edu/~lbrown/Papers/1971b%20Admissible%20estimators,%20recurrent%20diffusions,%20and%20insoluble%20boundary%20value%20problems.pdf) found by Lawrence Brown between random walks and the admissibility of the naive estimator, x x. Brown showed that the naive estimator is admissible if and only if a random walk returns to the origin an infinite number of times. Random walks in one or two dimensions do this, but random walks in three or more dimensions do not. At a high level this is because the random walk drifts away from the origin at a rate linear in the dimensionality of the space, but the volume subtended by the origin at a fixed distance decays exponentially with dimension. In one or two dimensions the volume subtended by the origin is large enough that you are guaranteed to eventually make your way back to it. But in higher dimensions, the volume subtended is smaller, so that as time progresses you are less and less likely to make your way back and the probability approaches zero even in the limit of an infinite number of steps. While rigorously proving this correspondence is non-trivial, both problems have at their core a comparison between the distance between a point and the origin and the volume of the unit sphere in the space. ## There be dragons far from the origin So that, at a high level, is why the James-Stein estimator works. We’ve been focused in this post on the simplest form of Stein’s paradox, which is a bit of a toy problem. But the artificiality of the problem shouldn’t befog us as to the significance of these ideas. The artificiality eases the analysis, but in fact the concepts that Stein’s paradox illustrates run much deeper through statistics and machine learning. As we discussed earlier on, while we usually don’t think that there is anything special about our choice of origin, we do nevertheless use it to smuggle in some priors, even if those priors are weak and we perhaps do so unwittingly. This is also true of machine learning models, maybe even more so. For most machine learning models the origin really *is* special. For example, at the origin most garden-variety neural networks will exhibit particularly simple behavior — they will output all zeros independent of the input. \[[5](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:3)\] Any movement away from this special point will introduce complexity into the functional behavior of the model. Neural networks are, of course, non-linear, so it is not always true that points further from the origin in parameter space are more complicated, but it is *generally* true. Most ML practitioners appreciate that techniques like L2 regularization have the effect of making models simpler, and therefore less likely to overfit. But Stein’s paradox dramatically illustrates the relationship between the risk of an estimate and the dimensionality of the space — in high dimensional spaces there is *much* more volume further away from the origin than closer to it. We are oftentimes better off introducing bias in order to reduce variance because in high dimensional spaces shrinking a small amount towards the origin reduces an *enormous* volume of parameter space. In other words, for a large machine learning model, there are *vastly* more ways to overfit than there are to underfit so we should bias our models towards underfitting because that is a simpler problem to solve than overfitting. (Or maybe to put it another way, underfitting is a small number of problems, whereas overfitting is a stupendous number of problems.) This phenomenon can manifest itself during neural network training. Suppose we have converged on a local minimum in the loss landscape after training with some variant of stochastic gradient descent. At convergence there will be equality between the stochastic component of SGD, for which the random walk is causing the parameters to drift away from the minimum, and the true gradient, which is pushing the parameters towards the minimum. \[[6](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:4)\] At this point the neural network is wandering on some high-dimensional ellipsoid around the minimum. But by the nature of high dimensional spaces, the neural network will *always* be further from the origin than the minimum, and therefore on the side of too much complexity. Thus this converged model will always slightly overfit. The further the model recedes from the origin, the more inexplicable its functional behavior can become. The moral of this story is that far from the origin there be dragons, and in a high dimensional space, moving just a little further away from the origin introduces lots and lots of dragons. The James-Stein estimator advises you to stay close to the origin to keep your risk down and the dragons at bay. *** ## Footnotes 1. It’s important to note that this does not imply that our original estimate is biased. In fact the naive estimator is unbiased and shrinking it towards the origin introduces bias. It is the distance of the naive estimator from the origin that is biased high. But by correcting for *this* bias we happen to reduce the risk. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:6) 2. While there’s nothing stopping us from choosing some other random direction, this adds no new information and so is effectively the same as choosing a different origin. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:5) 3. The median and mean of the χ χ distribution are more complicated but are asymptotically equal to the mode in the limit of a large number of degrees of freedom. But the asymptotic limit does not help you prove the case for D\=3 D \= 3. In fact Brown & Zhao (2012) chose to use the mean instead and derive a (D−1)/\|x\|2 ( D − 1 ) / \| x \| 2 term instead of (D−2)/\|x\|2 ( D − 2 ) / \| x \| 2 as we do from the mode. They then argue that the extra −1/\|x\|2 − 1 / \| x \| 2 is due to the variance in ζ1 ζ 1. I am not fully convinced of this argument; it seems to me that once one has represented the entire distribution by a central point one is already waving one’s hands. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:2) 4. A Bayes estimator is determined by a particular choice of prior. A “limit of Bayes estimators” is found by taking the limit of the Bayes estimators found from a sequence of priors as the sequence tends to infinity. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:1) 5. Of course the origin is not the only point that exhibits this behavior. I speculate that if we shrank to any point for which the model had functional simplicity we would observe the same benefits as shrinking towards the origin. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:3) 6. Learning rate decay also mitigates this issue. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:4) *** - [Contact Me](mailto:joe.antognini@gmail.com) - [@joe\_antognini](http://twitter.com/joe_antognini) © 2026 Joe Antognini. Powered by [Jekyll](http://jekyllrb.com/) using the [Balzac](http://jekyll.gtat.me/about) theme.

Readable Markdown

## The paradox Stein’s paradox is among the most surprising results in statistics. The basic idea is easily stated, but it is difficult to understand how it could possibly be true. The premise is this: suppose that I have a Gaussian distribution with a variance of unity and some mean which I don’t tell you. I then draw a single sample from this distribution, give it to you, and ask you to guess the mean. What do you do? Well, you don’t have a lot of information to go on here, so you just guess that the mean is the number I gave you. This is a good guess! (We will make the notion of a “good guess” a little more precise later on.) No big surprise there. Now we play again, but this time, my distribution is a *two*\-dimensional Gaussian. The covariance is the identity matrix (so this is equivalent to sampling from two independent one-dimensional Gaussians). But again I have not told you the mean (which is now a two-dimensional vector). Once more I draw a single sample from the distribution, hand it over to you, and ask you to guess the mean. You simply guess that the mean is the sample I have given you. Once more you have guessed well\! Now we do the same thing in three dimensions. I draw a single sample, hand it over to you, and ask you to guess the mean. Just as before, you guess that the mean is the sample I gave you. But this is no longer a good guess! Stein’s paradox is that if we play this game in three dimensions or more, a better guess is to say that the mean is this: μ ^ \= ReLU ( 1 − D − 2 \| x \| 2 ) x , where D is the dimensionality of the Gaussian, and x is the sample drawn from the distribution. This is the so-called “James-Stein estimator.” Who would have thought! What is going on here? ## What makes a guess good? Before we go on, we should clarify exactly what we mean by a “good guess.” We are trying to do what is called “parameter estimation” in statistics — based on a sample from a distribution, we want to infer some underlying parameter (or parameters) of the distribution. (In this case the parameter we are interested in is the mean.) In order to quantify how good or bad our estimate is we choose a function called a “loss function.” There is some freedom in choosing a loss function, but the mean squared error is a common choice and has a lot of valuable properties. Stein’s paradox assumes that we are using the mean squared error. So if we guess that the mean is μ ^ and the true value of the mean is μ, then the loss is L \= \| μ ^ − μ \| 2 . Now, of course, we need some rule to go from the sample x to the estimate μ ^. This rule is just a function of some kind, say, f ( x ). This function has the special name of an “estimator.” We can choose whatever function we want here. Our original guess was to just use f ( x ) \= x. But another choice here is to say f ( x ) \= x \+ 7, or f ( x ) \= sin ⁡ ( x ) / x 71, or even just f ( x ) \= 31. It doesn’t take much imagination to see that there are an infinite number of possible choices. But presumably some of these choices are better than others. How do we know which ones are good? Statisticians use the concept of *risk* for this purpose. Risk is simply the expected value of your loss function. One thing that can be a little confusing is that the risk is a function of *both* your choice of estimator *and* the true value of the parameter itself. So in the original game where you’re guessing the mean of a one-dimensional Gaussian, the risk will be a function of whatever rule you decide to use and the actual, unknown value of the mean. The fact that the risk is a function of the true value of the parameter makes things a little tricky. If you’re trying to decide between two estimators, you might find that one estimator works better for certain values that the parameter can take, and the other works better for others. As a dumb example, let’s go back to guessing the mean of a one-dimensional Gaussian. Our original estimator was μ ^ \= x. But another, perfectly valid, estimator is μ ^ \= 7. In other words we ignore the sample entirely and say that the mean is 7 no matter what. Generally this doesn’t seem like a smart thing to do. But if the mean turns out to actually be pretty close to 7, on average this will be the better guess! Specifically, the risk of our initial estimator is R x \= E \[ ( x − μ ) 2 \] \= 1 2 π ∫ ( x − μ ) 2 e − ( x − μ ) 2 / 2 d x \= 1\. And the risk on our second, dumb estimator is R dumb \= E \[ ( 7 − μ ) 2 \] \= 1 2 π ∫ ( 7 − μ ) 2 e − ( x − μ ) 2 / 2 d x \= ( 7 − μ ) 2 . As long as the true mean, μ, happens to be between 6 and 8, the dumb estimator of just saying 7 actually has lower risk\! Based on this example it might seem that we’re stuck. Since we don’t know the true value of the mean, we can’t generally say if one estimator is better than another. And indeed this is often the case. But there are certain situations where this is not true. If we have two estimators and one of them has a lower risk *for any possible value the parameter can take*, we can say that one is definitively better than the other. In statistical parlance, we say that the worse estimator is “inadmissable.” In more precise terms, Stein’s paradox states that in three dimensions or more, the naive estimator (just guessing that the mean is x) is inadmissable because the risk of the James-Stein estimator is lower for any possible mean I could choose. ## What is the James-Stein estimator doing? Before we can understand *why* the James-Stein estimator gives you a better guess than the naive estimator x, we should understand *what* exactly it’s doing. The idea is that we take the naive guess x and then we scale it towards the origin by some amount. The factor by which we scale it is: ReLU ( 1 − D − 2 \| x \| 2 ) . The ReLU function simply takes the maximum of its argument and zero, so this scale factor will be either positive or zero. Let’s suppose that it’s positive. In this case we scale it towards the origin more if the magnitude of the sample is smaller and less if it is lower. In the limit of \| x \| → ∞ we don’t change it at all and our guess reduces to the naive estimator x. In the other limit, if \| x \| is very small, then the ReLU function will kick in and just set the scale factor to zero. Hence anytime we get a sufficiently small sample we throw it out and just guess that the mean is zero instead. And all else being equal we will shrink our estimate more in higher dimensional spaces than in lower dimensional spaces. So that is what the James-Stein estimator is doing. Why does it work? Before we can answer that we need to take a quick detour. ## Samples in high dimensional spaces High dimensional spaces are counterintuitive. One of the counterintuitive properties of high dimensional distributions is this: a sample from a symmetric high dimensional distribution is highly likely to be further from the origin than the mean. Specifically, for an isotropic D\-dimensional Gaussian, the difference between the average distance to a sample and the distance to the mean grows as ∼D. It’s a little strange to put it this way, but isn’t so surprising with a little bit of thought. Even in two dimensions we can see that this is the case just by drawing it: ![](https://joe-antognini.github.io/assets/posts/steins-paradox/circle-origin.png) The shaded area of the circle is less than half the area, so we are less likely to choose a sample closer to the origin than the mean. What perhaps makes this counterintuitive is that in two dimensions a fairly large fraction of the circle is still shaded. But as the dimensionality increases, this fraction decreases exponentially. Once the dimensionality is even moderately large we are highly unlikely to sample a point in this shaded region. One caveat here is that this effect decreases the larger the mean is. You can imagine that as we move the circle further away from the origin, the shaded fraction gets closer and closer to ½. So long as the mean is sufficiently large, the probability of sampling a point closer to the origin than the mean can be close to ½ even in high dimensional spaces; it just requires a very large mean. We can start to see here the connection to the James-Stein estimator, which also gets very close to the naive estimator x as the mean (and hence \| x \|) gets very large. What we are doing by shrinking the estimate towards the origin is correcting for the tendency of the typical sample to be slightly further away from the origin than the mean. This correction allows us to reduce the overall risk of the estimate. \[[1](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:6)\] ## How arbitrary is the origin, really? Stein’s paradox is particularly strange because there are actually two counterintuitive things going on: 1. The origin is arbitrary, so why does moving your estimate towards the origin help? 2. Why does this not work in one or two dimensions? Let’s take a look at the first of these. A central principle in physics is that of relativity — coordinate systems are arbitrary, so the laws of physics must be valid in all of them. Surely this is also true in statistics as well. We can choose the origin to be wherever we like, so it cannot contain any information. But this sensible assertion is false. Statistics is not physics. If we truly had no information about the mean, what value would the sample have? If it could really be *anything* then presumably its value would be exceedingly large. After all, there are a lot of numbers, almost all of them are too big to write down, and the mean could be absolutely *any* of them. But if we pick a sample and find that its distance from the origin is 3.72 there must be something fishy going on. Clearly we have, in fact, managed to embed *some* information in our choice of coordinate system. The only reason we have gotten out sensible values in our sample is because we have some prior as to where the mean is expected to be. In fact, in the limit of no information \| x \| will be infinitely large and the James-Stein estimator will reduce to the naive estimator, x. So it was our clever choice of origin that smuggled information into our estimator and allowed us to do better than the naive estimate. It’s important to distinguish between the *direction* of the origin and the *distance* between the origin and the mean. The direction of the origin is unimportant. In fact, we could shrink our estimate in any direction and we would still get the benefits of the James-Stein estimator. The important thing is the distance between the origin and the mean — this is what has encoded some prior information that we can exploit to make a better prediction. ## The bias-variance tradeoff The way the James-Stein estimator works can be understood by looking at the bias-variance tradeoff. The bias-variance tradeoff states that the risk of an estimator can be decomposed into two components: a constant “bias” term, which reflects how far off the average value of the estimate is from the correct value; and an unbiased “variance” term, which accounts for the randomness of the sample. The naive estimator x is unbiased but has a high variance. One reason that Stein’s paradox seems unnatural is that we tend to confuse unbiased estimators with estimators that minimize risk. But in high dimensional spaces, samples from an isotropic Gaussian encompass a tremendous volume. Although our naive estimate is unbiased it has very high variance. What the James-Stein estimator does is scale the overall distribution towards the origin, thereby shrinking the volume of the distribution (and hence its variance), at the cost of introducing a little bit of bias. Although the estimator is now biased its overall risk is lower. ## Deriving the James-Stein estimator geometrically At this point we have some idea as to why shrinking the estimate towards the origin could be helpful, but we haven’t yet figured out why the specific form of the James-Stein estimator seems to work. To do this we’ll follow an argument presented by [Brown & Zhao (2012)](https://projecteuclid.org/download/pdfview_1/euclid.ss/1331729980). Remember that the direction of the coordinate system is arbitrary. So let’s rotate into a new one where one axis points directly toward the mean, and the other D − 1 axes are pointed in arbitrary (orthogonal) directions. (Of course, *you* can’t do this because you don’t know where the true mean is. But *I* do, and I’ve decided to help you out.) In this coordinate system the mean is just ( \| μ \| , 0 , … , 0 ). Let’s write the sample in this coordinate system as ( ζ 1 , ζ 2 , … , ζ D ). The loss for our guess can now be broken into two orthogonal components: one component, ( ζ 1 − \| μ \| ) 2, which tells us how far off our estimate is from the true value when it’s projected onto the correct direction of the mean; and a second component, ∑ i \= 2 D ζ i 2, which tells us how far off our estimate is from this correct direction. Let’s call this second component the “residual” component and define it as ρ ≡ ∑ i \= 2 D ζ i 2 . Our underlying distribution is isotropic, so rotating the coordinate system doesn’t change the fact that each coordinate ζ i is a sample from an independent one-dimensional Gaussian with unit variance. And if i \> 1 then the mean of this distribution is zero as well. ρ therefore follows a [χ distribution](https://en.wikipedia.org/wiki/Chi_distribution) with D − 1 degrees of freedom. Let’s suppose that we’ve gone and chosen a sample from our distribution and by good fortune ζ 1 happens to be exactly equal to \| μ \|. In general the rest of the ζ i in the sample won’t be exactly zero and so ρ will be positive. If we plot ζ 1 on the x\-axis and ρ on the y\-axis, the situation will look like this: ![](https://joe-antognini.github.io/assets/posts/steins-paradox/SteinsTriangle-1.png) The sample from our distribution is the point at the top right of the triangle. If we just use this sample as our estimate of the mean then the loss of this estimate will simply be the squared distance between the point and the bottom right corner of the triangle, or ρ 2. ### How *can* we transform the sample? We only have two points in this problem: the origin, and the sample x. By symmetry this means that the only way by which we can make an estimate is to choose a point somewhere on the line between the origin and x. \[[2](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:5)\] In other words, you don’t actually know what direction any of the ζ i are in so you can’t simply move your guess down the right side of the triangle to reduce ρ. The only direction you are allowed to move your sample is along the hypotenuse of this triangle. ### How *should* we transform the sample? The beauty of the James-Stein estimator is that even though we are constrained to move the sample along the hypotenuse, we can nevertheless reduce the distance between our guess and the true mean by shrinking it until the direction between our guess and the mean is perpendicular to the hypotenuse. ![](https://joe-antognini.github.io/assets/posts/steins-paradox/SteinsTriangle-2.png) Some simple geometry reveals that this new point is located at: ( 1 − ρ 2 \| x \| 2 ) x This is looking like the beginnings of the James-Stein estimator\! What exactly is ρ? Unfortunately ρ is random variable, but let’s represent it by some central point of its distribution. Now ρ 2 follows a χ 2 distribution with D − 1 degrees of freedom and the mode of this distribution is max ( D − 2 , 0 ). So if we simply represent this distribution by its mode, our estimator becomes ( 1 − D − 2 \| x \| 2 ) x , for D ≥ 3 and is the naive estimator x for D \< 3, which is exactly the James-Stein estimator without the ReLU. To be clear, this is a very hand-wavey argument. Representing the entire distribution by a single point is not particularly sophisticated, and there is no special reason to choose to use the mode instead of the mean or median except that it happens to give an answer that corresponds to the James-Stein estimator. \[[3](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:2)\] ### Can we derive the James-Stein estimator rigorously? Given how hand-wavey this argument was, you might wonder whether it’s possible to rigorously derive the James-Stein estimator by explicitly calculating the risk given the Gaussian and χ distributions, and then finding the function that minimizes the risk using the calculus of variations. However this approach will not work. The reason is that the James-Stein optimizer does not actually minimize the risk! In fact, despite its fame, the James-Stein estimator is itself inadmissible. As we’ll see below, we can improve our estimator by wrapping it in a ReLU function. But the ReLU function introduces some difficulties, because there is a very general theorem which states that any admissible estimator must be a Bayes estimator for some prior, or the limit of Bayes estimators. \[[4](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:1)\] This ends up implying that the estimator must be a smooth function and the ReLU function is not smooth. So while the James-Stein estimator beats the naive estimator x, there exists some other estimator which beats it\! ### Recovering the ReLU There is one last piece to tackle: the ReLU function. Where does that come from? The ReLU has no effect on our estimate as long as \| x \| ≥ D − 2. This will generally be the case if \| μ \| is large. But if \| μ \| is small, then some of the points sampled will have \| x \| \< D − 2. If we were to just use the same scaling we derived above, we would reflect our estimate through the origin and make it negative! This estimate has a worse loss than an estimate at the origin. The reason for this is that once we have shrunk the sample all the way down to the origin we have already traded away all the variance for bias. In other words, we started with an estimate that had no bias and high variance, and ended up with an estimate that had high bias and no variance. But if we were to keep going past the origin, we’d continue to increase the bias but would start to increase the variance as well! This is strictly worse than keeping our estimate at the origin, so we are better off clamping the estimate at zero with a ReLU function. ## Why does Stein’s paradox not hold in two dimensions? Let’s now turn to the second counterintuitive property of Stein’s paradox: what’s so special about three dimensions? We saw a hint that three dimensions is special since the mode of the χ 2 distribution is 0 for one and two dimensions, but is D − 2 for higher numbers of dimensions. But as I pointed out, that came about from representing the entire ρ 2 distribution by its mode, and it’s not really obvious why we should pick the mode in particular rather than, say, the mean or median. Thinking back to our geometric argument, it is not hard to see why the James-Stein estimator doesn’t help in one dimension — we arrived at the James-Stein estimator by separating the sample into two components, one along the direction to the true mean, and the other as a residual component perpendicular to the first. But in one dimension there is no residual component! By shrinking your estimate towards the origin you introduce some bias, but this is not counterbalanced by any reduction in variance. But what about the two dimensional case? Here again we unfortunately must wave our hands. In the two dimensional case we do, in fact, reduce the variance by shrinking our estimate towards the origin, just not by enough to offset the bias we introduce. While the orthogonal component, ρ, is not identically zero as it was in the one-dimensional case, it is strongly bunched up close to zero in the two dimensional case. The idea of the James-Stein estimator is to note that a certain proportion of a sample’s magnitude is due to a component orthogonal to the mean and to shrink the estimate to reduce that orthogonal component somewhat. But if the magnitude of the orthogonal component is too small, this won’t always work. There are some values of the mean for which the James-Stein estimator is better, but others where it is worse. It is not the better estimator no matter what. ### A digression on random walks As an aside, there is a [deep correspondence](http://stat.wharton.upenn.edu/~lbrown/Papers/1971b%20Admissible%20estimators,%20recurrent%20diffusions,%20and%20insoluble%20boundary%20value%20problems.pdf) found by Lawrence Brown between random walks and the admissibility of the naive estimator, x. Brown showed that the naive estimator is admissible if and only if a random walk returns to the origin an infinite number of times. Random walks in one or two dimensions do this, but random walks in three or more dimensions do not. At a high level this is because the random walk drifts away from the origin at a rate linear in the dimensionality of the space, but the volume subtended by the origin at a fixed distance decays exponentially with dimension. In one or two dimensions the volume subtended by the origin is large enough that you are guaranteed to eventually make your way back to it. But in higher dimensions, the volume subtended is smaller, so that as time progresses you are less and less likely to make your way back and the probability approaches zero even in the limit of an infinite number of steps. While rigorously proving this correspondence is non-trivial, both problems have at their core a comparison between the distance between a point and the origin and the volume of the unit sphere in the space. ## There be dragons far from the origin So that, at a high level, is why the James-Stein estimator works. We’ve been focused in this post on the simplest form of Stein’s paradox, which is a bit of a toy problem. But the artificiality of the problem shouldn’t befog us as to the significance of these ideas. The artificiality eases the analysis, but in fact the concepts that Stein’s paradox illustrates run much deeper through statistics and machine learning. As we discussed earlier on, while we usually don’t think that there is anything special about our choice of origin, we do nevertheless use it to smuggle in some priors, even if those priors are weak and we perhaps do so unwittingly. This is also true of machine learning models, maybe even more so. For most machine learning models the origin really *is* special. For example, at the origin most garden-variety neural networks will exhibit particularly simple behavior — they will output all zeros independent of the input. \[[5](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:3)\] Any movement away from this special point will introduce complexity into the functional behavior of the model. Neural networks are, of course, non-linear, so it is not always true that points further from the origin in parameter space are more complicated, but it is *generally* true. Most ML practitioners appreciate that techniques like L2 regularization have the effect of making models simpler, and therefore less likely to overfit. But Stein’s paradox dramatically illustrates the relationship between the risk of an estimate and the dimensionality of the space — in high dimensional spaces there is *much* more volume further away from the origin than closer to it. We are oftentimes better off introducing bias in order to reduce variance because in high dimensional spaces shrinking a small amount towards the origin reduces an *enormous* volume of parameter space. In other words, for a large machine learning model, there are *vastly* more ways to overfit than there are to underfit so we should bias our models towards underfitting because that is a simpler problem to solve than overfitting. (Or maybe to put it another way, underfitting is a small number of problems, whereas overfitting is a stupendous number of problems.) This phenomenon can manifest itself during neural network training. Suppose we have converged on a local minimum in the loss landscape after training with some variant of stochastic gradient descent. At convergence there will be equality between the stochastic component of SGD, for which the random walk is causing the parameters to drift away from the minimum, and the true gradient, which is pushing the parameters towards the minimum. \[[6](https://joe-antognini.github.io/machine-learning/steins-paradox#fn:4)\] At this point the neural network is wandering on some high-dimensional ellipsoid around the minimum. But by the nature of high dimensional spaces, the neural network will *always* be further from the origin than the minimum, and therefore on the side of too much complexity. Thus this converged model will always slightly overfit. The further the model recedes from the origin, the more inexplicable its functional behavior can become. The moral of this story is that far from the origin there be dragons, and in a high dimensional space, moving just a little further away from the origin introduces lots and lots of dragons. The James-Stein estimator advises you to stay close to the origin to keep your risk down and the dragons at bay. *** ## Footnotes 1. It’s important to note that this does not imply that our original estimate is biased. In fact the naive estimator is unbiased and shrinking it towards the origin introduces bias. It is the distance of the naive estimator from the origin that is biased high. But by correcting for *this* bias we happen to reduce the risk. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:6) 2. While there’s nothing stopping us from choosing some other random direction, this adds no new information and so is effectively the same as choosing a different origin. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:5) 3. The median and mean of the χ distribution are more complicated but are asymptotically equal to the mode in the limit of a large number of degrees of freedom. But the asymptotic limit does not help you prove the case for D \= 3. In fact Brown & Zhao (2012) chose to use the mean instead and derive a ( D − 1 ) / \| x \| 2 term instead of ( D − 2 ) / \| x \| 2 as we do from the mode. They then argue that the extra − 1 / \| x \| 2 is due to the variance in ζ 1. I am not fully convinced of this argument; it seems to me that once one has represented the entire distribution by a central point one is already waving one’s hands. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:2) 4. A Bayes estimator is determined by a particular choice of prior. A “limit of Bayes estimators” is found by taking the limit of the Bayes estimators found from a sequence of priors as the sequence tends to infinity. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:1) 5. Of course the origin is not the only point that exhibits this behavior. I speculate that if we shrank to any point for which the model had functional simplicity we would observe the same benefits as shrinking towards the origin. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:3) 6. Learning rate decay also mitigates this issue. [↩](https://joe-antognini.github.io/machine-learning/steins-paradox#fnref:4) ***

ML Classification

ML Categories

/Science		89.7%
/Science/Mathematics		89.4%
/Science/Mathematics/Statistics		77.8%

Raw JSON

{
    "/Science": 897,
    "/Science/Mathematics": 894,
    "/Science/Mathematics/Statistics": 778
}

ML Page Types

/Article		99.8%
/Article/Tutorial_or_Guide		89.1%

Raw JSON

{
    "/Article": 998,
    "/Article/Tutorial_or_Guide": 891
}

ML Intent Types

Informational

99.9%

Raw JSON

{
    "Informational": 999
}

Content Metadata

Language

Author

null

Publish Time

not set

Original Publish Time

2021-01-02 23:58:41 (5 years ago)

Republished

Word Count (Total)

5,260

Word Count (Content)

4,817

Links

External Links

Internal Links

Technical SEO

Meta Nofollow

Meta Noarchive

JS Rendered

Yes

Redirect Target

null

Performance

Download Time (ms)

TTFB (ms)

Download Size (bytes)

12,706

Shard

143 (laksa)

Root Hash

2566890010099092343

Unparsed URL

io,github!joe-antognini,/machine-learning/steins-paradox s443