šŸ•·ļø Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 129 (from laksa094)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ā„¹ļø Skipped - page is already crawled

šŸ“„
INDEXABLE
āœ…
CRAWLED
25 days ago
šŸ¤–
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.9 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://link.springer.com/article/10.1007/s42081-023-00209-y
Last Crawled2026-03-24 04:11:39 (25 days ago)
First Indexed2023-09-20 15:47:31 (2 years ago)
HTTP Status Code200
Meta TitleMachine learning and the James–Stein estimator | Japanese Journal of Statistics and Data Science | Springer Nature Link
Meta DescriptionIt is now 62 years since the publication of James and Stein’s seminal article on the estimation of a multivariate normal mean vector. The paper made
Meta Canonicalnull
Boilerpipe Text
1 Introduction By and large, the statistics world is one of heuristics, approximations, and asymptotics. The James–Stein estimator arrived in that world in 1961 on a note of startling specificity: unseen parameters μ 1 , μ 2 , … , μ n produce independent observations x i ∼ ind N ( μ i , 1 ) , i = 1 , … , n , (1) n ≄ 3 . The James–Stein rule in its simplest form proposed estimating the μ i by μ ^ i J S = ( 1 āˆ’ n āˆ’ 2 S ) x i ( S = āˆ‘ i = 1 n x i 2 ) . (2) Formula ( 2 ) looked implausible: the estimate of μ i depended on the other observations x j , j ≠ i (through S ), as well as x i , despite the independence assumption. Nevertheless, James and Stein showed that Rule ( 2 ) always beat the obvious maximum likelihood estimates μ ^ i M L = x ĀÆ i ( i = 1 , … , n ) (3) in terms of total expected squared error E { āˆ‘ i = 1 n ( μ ^ i āˆ’ μ i ) 2 } . (4) That ā€œalwaysā€ was the shocking part: two centuries of statistical theory, ANOVA, regression, multivariate analysis, etc., depended on maximum likelihood estimation. Did everything have to be rethought? One path forward involved Bayesian thinking. If we assumed that the μ i themselves came from a normal distribution, μ i ∼ ind N ( 0 , A ) forĀ  i = 1 , … , n , (5) with variance A ≄ 0 , the Bayes estimates would be μ ^ i B a y e s = B x i ( B = A / ( A + 1 ) ) . (6) We don’t know A or B but B ^ = 1 āˆ’ ( n āˆ’ 2 ) / S (7) is B ’s unbiased estimate: we can rewrite ( 2 ) as μ ^ i J S = B ^ x i , (8) which at least looks more plausible. In the language introduced by Robbins ( 1956 ), formula ( 8 ) is an empirical Bayes estimator, another shocking post-war statistical innovation. Carl Morris and I wrote a series of papers in the 1970Ā s exploring Bayesian roots of the James–Stein estimator (Efron and Morris, 1973 ). Something is lost in the empirical Bayes formulation, namely the frequentist ā€œalwaysā€ of expected square error minimization, but a lot is gained in flexibility and scope, as discussed in Sect.Ā  2 . Fig. 1 Prostate data: 6033 x values; mean 0.003, sd = 1.135 ; curve is proportional to a N ( 0 , 1 ) density Full size image FigureĀ  1 illustrates an example of simultaneous estimation pursued in Sect.Ā 2.1 of Efron ( 2010 ). A microarray study has compared expression levels between prostate cancer patients and control subjects for n = 6033 genes. For each gene, a statistic x i has been calculated (essentially a ā€œ z -valueā€), x i ∼ N ( μ i , 1 ) , i = 1 , … , n , (9) where μ i measures the difference between cancer and control group levels. The solid curve in Fig.Ā  1 is a N ( 0 , 1 ) density scaled to have the same area as the histogram of the 6033 x values. A bad result from the researchers’ point of view would be a perfect fit of curve to histogram, which would imply all the genes have μ i = 0 , the ā€œnullā€ value of no difference between cancer patients and controls. That’s not what happened: the histogram has mildly heavy tails in both directions. The researchers were hoping to find genes with large values of ‖ μ i ‖ —ones that might be a clue to prostate cancer etiology—as suggested by the heavy tails. How encouraged should they be? Not very, according to the James–Stein rule. The 6033 x i values have mean 0.003, which I’ll take to be zero, and empirical variance σ ^ 2 = 1.289. (10) The James–Stein estimate ( 2 ) is μ ^ i J S = ( 1 āˆ’ n āˆ’ 2 n āˆ’ 1 1 σ ^ 2 ) x i = 0.224 ā‹… x i , (11) so even x i = 5 yields an estimate barely exceeding 1. SectionĀ  2 suggests a more optimistic analysis. 2 Tweedie’s formula The impressive precision of the James–Stein theorem came at a cost in generality. Efforts to extend the theorem, say to Poisson rather than normal observations, or to measures of loss other than total squared error, gave encouraging asymptotic results but not the James–Stein kind of finite sample frequentist dominance. Better progress was possible on the empirical Bayes side of the street. Tweedie’s formula (Efron, 2011 ) has been particularly useful. We wish to calculate Bayesian estimates μ i B a y e s = E { μ i ∣ x i } , i = 1 , … , n , (12) in the normal sampling model ( 1 ), starting from a given (possibly non-normal) prior Ļ€ ( μ ) , applying to all n cases. Let f ( x ) be the marginal density f ( x ) = ∫ R Ļ€ ( μ ) Ļ• ( x āˆ’ μ ) Ā  d μ , (13) with Ļ• the standard N ( 0 , 1 ) density and R the range of μ . (It isn’t necessary for Ļ€ ( ā‹… ) to be a continuous distribution but it simplifies notation.) Tweedie’s formula provides an elegant statement for μ i B a y e s , the posterior expectation of μ i given x i , μ i B a y e s = E { μ i ∣ x i } = x i + l ′ ( x i ) with l ′ ( x i ) = d d x log ⁔ ( f ( x i ) ) . (14) In the empirical Bayes situation ( 1 ), where the prior Ļ€ ( ā‹… ) is unknown, we can use the observed data x 1 , … , x n to estimate the marginal density f ( x ), say by f ^ ( x ) , giving empirical Bayes estimates μ ^ i = x i + l ^ ′ ( x i ) . (15) The Bayes estimate ( 14 ) can be thought of as the MLE x i plus a Bayesian correction term l ′ ( x i ) . When the prior Ļ€ ( μ ) is the N ( 0 , A ) distribution ( 5 ), μ i B a y e s equals B x i ( 6 ). Simple formulas for μ i B a y e s give out for most other choices of Ļ€ ( μ ) but now, in the machine learning era Footnote 1 of statistical research, numerical methods provide useful ways forward, as discussed next. The log polynomial class Footnote 2 of marginal densities defines f ( x ) by log ⁔ ( f β ( x ) ) = β 0 + β ⊤ c ( x ) . (16) Here c ( x ) = ( x , x 2 , … , x J ) ⊤ Ā andĀ  β = ( β 1 , … , β J ) ⊤ , (17) with β 0 chosen to make f β ( x ) integrate to 1. The choice J = 2 gives normal marginals; larger values of J allow for marginal non-normality. Fig. 2 Prostate data: Tweedie’s estimate of E { μ ∣ x } , 5 degrees of freedom; dashed curve is James–Stein estimate Full size image The choice J = 5 was applied to the prostate cancer data of Fig.Ā  1 : Tweedie’s formula ( 14 ) gave μ ^ ( x ) = E { μ ∣ x } , graphed as the solid curve in Fig.Ā  2 . It differs markedly from the James–Stein estimate J = 2 , the dashed line. At x = 4 for example, the J = 5 estimate is Footnote 3 E { μ ∣ x = 4 } = 2.555 (18) compared to 0.901 for the James–Stein estimate. The estimated curve E { μ ∣ x } is empirical Bayes in the same sense as ( 8 ): the parameter vector β was selected by maximum likelihood, as discussed next. With J = 5 , the prior was able to adapt to the ā€œfishing expeditionā€ nature of such microarray studies, where we expect most of the genes to be null or close to null, with μ i nearly zero (corresponding here to the flat part of the curve for x between āˆ’ 2 and 2) and, hopefully, a small proportion of interestingly large μ i s. The sample size n = 6033 has much to do with Fig.Ā  2 . James and Stein ( 1961 ) was usually considered in terms of small samples, perhaps n ≤ 20 , for which there would be little hope of seeing the detail in Fig.Ā  2 . The term ā€œmachine learning eraā€ seems less fanciful when considering the scale of problems statisticians are now asked to deal with, as well as the tools they use to solve them. It looks like it might be hard work computing Fig.Ā  2 but it’s not. The histogram in Fig.Ā  1 has 97 bins, with centerpoints v v = ( āˆ’ 4.4 , āˆ’ 4.3 , … , 5.1 , 5.2 ) . (19) Let y j be the count in bin j , that is, the number of the 6033 x i values falling into it, with the vector of counts being y y = ( y 1 , … , y 97 ) . (20) Then the single R command l l ^ = log ⁔ ( glm ( y y ∼ poly ( v v , 5 ) , poisson ) $ fit ) (21) provides a close approximation to the MLE of log ⁔ f ( x ) in ( 14 ); numerical differentiation of l l ^ gives Tweedie’s estimate. SectionĀ 3.4 of Efron ( 2023 ) shows why Poisson regression ( 21 ) is appropriate here. The James–Stein theorem depends on the independence assumption in ( 1 ), unlikely to be true in the microarray study, but the estimates ( 2 ) have a certain marginal validity even under dependence. This is clearer from the empirical Bayes point of view. The Tweedie estimate x i + l ^ ′ ( x i ) requires only that l ^ ′ ( x ) be close to l ′ ( x ) , not that it be estimated from independent x i s. Footnote 4 3 Shrinkage estimators James and Stein’s paper aroused excited interest in the statistics community when it arrived in 1961. Most of the excitement focused on the strict inadmissibility of the traditional maximum likelihood estimate demonstrated by the James–Stein rule. Other rules dominating the MLE were discovered, for instance the Bayes estimator of Strawderman ( 1971 ), that was itself admissible while rendering the MLE inadmissible. Big new ideas can take a while to make their true impact felt. The James–Stein rule had an influential side effect on subsequent theory and practice in that it demonstrated, in an inarguable way, the virtues of shrinkage estimation : given an ensemble of problems, individual estimates are shrunk toward a central point; that is, a deliberate bias is introduced, pulling estimates away from their MLEs for the sake of better group performances. Admissibility and inadmissibility aren’t much in the air these days, while shrinkage estimation has gone on to play a major role in modern practice. A spectacular success story is the lasso (Tibshirani, 1996 ). Lasso shrinkage is extreme, pulling some (often most) of the coefficient estimates all the way back to zero. Bayes and empirical Bayes rules tend to be strong shrinkers. Tweedie’s estimate in Fig.Ā  2 ( J = 5 ) shrinks the estimate of E { μ ∣ x = 4 } from its MLE value 4 down to 2.555. For μ between āˆ’ 1 and 1, the shrinkage is almost all the way to zero. The reader may have been surprised to see that neither Tweedie’s formula ( 14 ) for E { μ i ∣ x i } nor its empirical version ( 15 ) require estimation of the prior Ļ€ ( μ ) . This is a special property of the posterior expectation E { μ i ∣ x i } and isn’t available for say Pr { μ i ≄ 2 ∣ x i } , or most other Bayesian targets. ā€œBayesian deconvolutionā€ (Efron, 2016 ) uses low-dimensional parametric modeling of Ļ€ ( μ ) for general empirical Bayes computations. It was applied to finding a prior density Ļ€ ( μ ) that would give the distribution of x seen in Fig.Ā  1 , assuming the normal sampling model ( 1 ). The deconvolution model for Ļ€ ( μ ) used a delta function at μ = 0 (for the ā€œnullā€ genes) and a natural spline function with four degrees of freedom for the non-null cases. Fig. 3 Empirical Bayes conditional density of μ given μ not zero; Pr { μ = 0 } equals 0.825 Full size image The estimated prior Footnote 5 Ļ€ ^ ( μ ) is shown in Fig.Ā  3 ; it put probability 0.825 on μ = 0 , while the conditional distribution given μ ≠ 0 was a moderately heavy-tailed version of N ( 0 , 1.33 2 ) . Based on Ļ€ ^ ( μ ) we can form estimates of any Bayesian target, for instance Pr ^ { μ i ≄ 2 ∣ x i = 4 } = 0.80 . FigureĀ  3 is a direct descendent of the James–Stein rule, now 60-plus years on. A less-direct descendent, but still on the family tree, arrived in 1995. The false discovery rate paper by Benjamini and Hochberg concerned simultaneous hypothesis testing. Looking at Fig.Ā  1 , which of the n = 6033 genes can confidently be labeled as non-null, that is as having μ i ≠ 0 ? Suppose for convenience that the x i s are ordered from smallest to largest. The right-sided significance level for testing μ i = 0 is S 0 ( x i ) = 1 āˆ’ Φ ( x i ) , (22) where Φ is the standard normal cumulative distribution function. Of the 6033 genes, 401 had S i ≤ 0.05 , the usual rejection level for individual testing, but even if actually all of the genes were null we would expect 302 such rejections, so individual testing can’t be right. Benjamini and Hochberg proposed a novel simultaneous testing rule that safely controls the number of ā€œfalse discoveriesā€ — genes falsely labeled ā€non-nullā€ — while not being discouragingly strict. (My summary here won’t give the BH rule its full due; see Chapter 4 of Efron ( 2010 ) for a more complete description.) Let S ^ ( x ) be the observed proportion of x i s exceeding value x , and define F d r ^ ( x ) = Ļ€ 0 S 0 ( x ) / S ^ ( x ) , (23) where Ļ€ 0 is the proportion of null genes among all n . Footnote 6 For a fixed control level α , such as α = 0.1 , the BH rule says to reject the null hypothesis μ i = 0 for those genes having F d r ^ ( x i ) ≤ α . (24) The Benjamini–Hochberg theorem states that under independence assumptions like ( 1 ), the expected proportion of false discoveries by rule ( 24 ) is α . Fig. 4 Prostate data: left Fdr and right Fdr; dashes show 60 genes with F d r < 0.1 Full size image FigureĀ  4 shows F d r ^ for the prostate cancer data and also for the left-sided Fdr estimate, where significance is defined by S 0 ( x i ) = Φ ( x i ) rather than ( 22 ). I applied the BH rule with α = 0.1 which labeled 60 genes as non-null, 32 on the left and 28 on the right. The BH theorem says that we can expect 6 of the 60 to actually be null. The fdr story has evolved very much along the lines of its James–Stein predecessor. Intense initial interest focused on the exact frequentist control of false discovery rates. The Bayes and empirical Bayes implications came later: as at ( 5 ), we assume that each x i is a realization of a random variable x given by μ ∼ Ļ€ ( μ ) and x ∣ μ ∼ p ( x ∣ μ ) , (25) where p ( x ∣ μ ) is a known probability kernel which I’ll take here to be the normal sampling model ( 1 ). Then if S ( x ) is 1 minus the cdf of the marginal density ( 13 ), Bayes rule gives Pr { μ = 0 ∣ x } = Ļ€ 0 S 0 ( x ) / S ( x ) . (26) Comparing ( 26 ) with ( 23 ) says that the BH rule amounts to labeling case i as non-null if its obvious empirical Bayes estimate of nullness is less than α . This is less precise than the frequentist control theorem but, as with the James–Stein estimator, is more robust in not demanding independence among the x i s. The family resemblance between JS and BH is through shrinkage: in the BH case the shrinkage of significance levels. For instance, x i = 3 has individual significance level 0.001 against nullness, whereas F d r ^ = 0.164 for the prostate data, i.e, still with about a 1/6 chance of gene i being null. So what does machine learning have to do with the James–Stein estimator? Nothing to its birth but, as the articles in this volume show, a great deal to its downstream effects on statistical theory and practice. Charles Stein, who was a good applied statistician when he put his mind to it, might have enjoyed these developments, but maybe not; his heart was always with the mathematics. References Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57 (1), 289–300. Article Ā  MathSciNet Ā  Google Scholar Ā  Efron, B. (2010). Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction (Vol. 1). Cambridge: Cambridge University Press. Book Ā  Google Scholar Ā  Efron, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106 (496), 1602–1614. https://doi.org/10.1198/jasa.2011.tm11181 Article Ā  MathSciNet Ā  Google Scholar Ā  Efron, B. (2016). Empirical Bayes deconvolution estimates. Biometrika, 103 (1), 1–20. https://doi.org/10.1093/biomet/asv068 Article Ā  MathSciNet Ā  Google Scholar Ā  Efron, B. (2023). Exponential Families in Theory and Practice . Cambridge: Cambridge University Press. Google Scholar Ā  Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. Journal of the American Statistical Association, 68 , 117–130. MathSciNet Ā  Google Scholar Ā  James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob. (Vol. I, pp. 361–379). Berkeley: University of California Press. Narasimhan, B., & Efron, B. (2020). deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation. Journal of Statistical Software, 94 (11), 1–20. https://doi.org/10.18637/jss.v094.i11 Article Ā  Google Scholar Ā  Robbins, H. (1956). An empirical Bayes approach to statistics. In Proc. 3rd Berkeley Sympos. Math. Statist. and Prob. (Vol. I, pp. 157–163). Berkeley: University of California Press. Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Annals of Mathematical Statistics, 42 (1), 385–388. https://doi.org/10.1214/aoms/1177693528 Article Ā  MathSciNet Ā  Google Scholar Ā  Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58 (1), 267–288. Article Ā  MathSciNet Ā  Google Scholar Ā  Download references
Markdown
[Skip to main content](https://link.springer.com/article/10.1007/s42081-023-00209-y#main) Advertisement [![Advertisement](https://pubads.g.doubleclick.net/gampad/ad?iu=/270604982/springerlink/42081/article&sz=728x90&pos=top&articleid=s42081-023-00209-y)](https://pubads.g.doubleclick.net/gampad/jump?iu=/270604982/springerlink/42081/article&sz=728x90&pos=top&articleid=s42081-023-00209-y) [![Springer Nature Link](https://link.springer.com/oscar-static/images/darwin/header/img/logo-springer-nature-link-05805fde18.svg)](https://link.springer.com/) [Account](https://link.springer.com/article/10.1007/s42081-023-00209-y) [Menu]() [Find a journal](https://link.springer.com/journals/) [Publish with us](https://www.springernature.com/gp/authors) [Track your research](https://link.springernature.com/home/) [Search]() [Saved research](https://link.springer.com/saved-research) [Cart](https://order.springer.com/public/cart) ## Search ## Navigation - [Find a journal](https://link.springer.com/journals/) - [Publish with us](https://www.springernature.com/gp/authors) - [Track your research](https://link.springernature.com/home/) 1. [Home](https://link.springer.com/) 2. [Japanese Journal of Statistics and Data Science](https://link.springer.com/journal/42081) 3. Article # Machine learning and the James–Stein estimator - Original Paper - Stein Estimation and Statistical Shrinkage Methods - [Open access](https://www.springernature.com/gp/open-science/about/the-fundamentals-of-open-access-and-open-research) - Published: 30 June 2023 - Volume 7, pages 257–266, (2024) - [Cite this article](https://link.springer.com/article/10.1007/s42081-023-00209-y#citeas) You have full access to this [open access](https://www.springernature.com/gp/open-science/about/the-fundamentals-of-open-access-and-open-research) article [Download PDF](https://link.springer.com/content/pdf/10.1007/s42081-023-00209-y.pdf) [Save article](https://link.springer.com/article/10.1007/s42081-023-00209-y/save-research?_csrf=MqX1VwflVBF2wGV6B4ITqKc5OgaOhi_J) [View saved research](https://link.springer.com/saved-research) [![](https://media.springernature.com/w72/springer-static/cover-hires/journal/42081?as=webp) Japanese Journal of Statistics and Data Science](https://link.springer.com/journal/42081) [Aims and scope](https://link.springer.com/journal/42081/aims-and-scope) [Submit manuscript](https://www.editorialmanager.com/jjsd) Machine learning and the James–Stein estimator [Download PDF](https://link.springer.com/content/pdf/10.1007/s42081-023-00209-y.pdf) - [Bradley Efron](https://link.springer.com/article/10.1007/s42081-023-00209-y#auth-Bradley-Efron-Aff1-Aff2) [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Aff1),[2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Aff2) - 9362 Accesses - 7 Citations - 24 Altmetric - 1 Mention - [Explore all metrics](https://link.springer.com/article/10.1007/s42081-023-00209-y/metrics) ## Abstract It is now 62 years since the publication of James and Stein’s seminal article on the estimation of a multivariate normal mean vector. The paper made a spectacular first impression on the statistical community through its demonstration of inadmissability of the maximum likelihood estimator. It continues to be influential, but not for the initial reasons. Empirical Bayes shrinkage estimation, now a major topic, found its early justification in the James–Stein formula. Less obvious downstream topics include Tweedie’s formula and Benjamini and Hochberg’s false discovery rate algorithm. This is a short and mainly non-technical review of the James–Stein rule and its effects on the machine learning era of statistical innovation. ### Similar content being viewed by others ![](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1007%2Fs42081-023-00227-w/MediaObjects/42081_2023_227_Fig1_HTML.png) ### [Expansion estimators improving the bias and risk of James–Stein’s shrinkage estimator](https://link.springer.com/10.1007/s42081-023-00227-w) Article 16 December 2023 ![](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1007%2Fs10618-019-00622-6/MediaObjects/10618_2019_622_Fig1_HTML.png) ### [A new class of metrics for learning on real-valued and structured data](https://link.springer.com/10.1007/s10618-019-00622-6) Article 27 March 2019 ![](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1007%2Fs40953-021-00270-y/MediaObjects/40953_2021_270_Fig1_HTML.png) ### [Weak Versus Strong Dominance of Shrinkage Estimators](https://link.springer.com/10.1007/s40953-021-00270-y) Article 18 November 2021 ### Explore related subjects Discover the latest articles, books and news in related subjects, suggested using machine learning. - [Bayesian Inference](https://link.springer.com/subjects/bayesian-inference) - [Empiricism](https://link.springer.com/subjects/empiricism) - [Machine Learning](https://link.springer.com/subjects/machine-learning) - [Non-parametric Inference](https://link.springer.com/subjects/non-parametric-inference) - [Statistical Learning](https://link.springer.com/subjects/statistical-learning) - [Statistical Theory and Methods](https://link.springer.com/subjects/statistical-theory-and-methods) - [Empirical Bayesian Methods in Statistical Inference](https://link.springer.com/subjects/empirical-bayesian-methods-in-statistical-inference) ## 1 Introduction By and large, the statistics world is one of heuristics, approximations, and asymptotics. The James–Stein estimator arrived in that world in 1961 on a note of startling specificity: unseen parameters μ 1 , μ 2 , … , μ n produce independent observations x i ∼ ind N ( μ i , 1 ) , i \= 1 , … , n , (1) n ≄ 3. The James–Stein rule in its simplest form proposed estimating the μ i by μ ^ i J S \= ( 1 āˆ’ n āˆ’ 2 S ) x i ( S \= āˆ‘ i \= 1 n x i 2 ) . (2) Formula ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) looked implausible: the estimate of μ i depended on the *other* observations x j, j ≠ i (through *S*), as well as x i, despite the independence assumption. Nevertheless, James and Stein showed that Rule ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) *always* beat the obvious maximum likelihood estimates μ ^ i M L \= x ĀÆ i ( i \= 1 , … , n ) (3) in terms of total expected squared error E { āˆ‘ i \= 1 n ( μ ^ i āˆ’ μ i ) 2 } . (4) That ā€œalwaysā€ was the shocking part: two centuries of statistical theory, ANOVA, regression, multivariate analysis, etc., depended on maximum likelihood estimation. Did everything have to be rethought? One path forward involved Bayesian thinking. If we assumed that the μ i themselves came from a normal distribution, μ i ∼ ind N ( 0 , A ) for i \= 1 , … , n , (5) with variance A ≄ 0, the Bayes estimates would be μ ^ i B a y e s \= B x i ( B \= A / ( A \+ 1 ) ) . (6) We don’t know *A* or *B* but B ^ \= 1 āˆ’ ( n āˆ’ 2 ) / S (7) is *B*’s unbiased estimate: we can rewrite ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) as μ ^ i J S \= B ^ x i , (8) which at least looks more plausible. In the language introduced by Robbins ([1956](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR9 "Robbins, H. (1956). An empirical Bayes approach to statistics. In Proc. 3rd Berkeley Sympos. Math. Statist. and Prob. (Vol. I, pp. 157–163). Berkeley: University of California Press.")), formula ([8](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ8)) is an *empirical Bayes* estimator, another shocking post-war statistical innovation. Carl Morris and I wrote a series of papers in the 1970 s exploring Bayesian roots of the James–Stein estimator (Efron and Morris, [1973](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR6 "Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. Journal of the American Statistical Association, 68, 117–130.")). Something is lost in the empirical Bayes formulation, namely the frequentist ā€œalwaysā€ of expected square error minimization, but a lot is gained in flexibility and scope, as discussed in Sect. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec2). **Fig. 1** [![Fig. 1](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig1_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/1) Prostate data: 6033 *x* values; mean 0.003, sd \= 1\.135; curve is proportional to a N ( 0 , 1 ) density [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/1) Figure [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1) illustrates an example of simultaneous estimation pursued in Sect. 2.1 of Efron ([2010](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR2 "Efron, B. (2010). Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction (Vol. 1). Cambridge: Cambridge University Press.")). A microarray study has compared expression levels between prostate cancer patients and control subjects for n \= 6033 genes. For each gene, a statistic x i has been calculated (essentially a ā€œ*z*\-valueā€), x i ∼ N ( μ i , 1 ) , i \= 1 , … , n , (9) where μ i measures the difference between cancer and control group levels. The solid curve in Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1) is a N ( 0 , 1 ) density scaled to have the same area as the histogram of the 6033 *x* values. A bad result from the researchers’ point of view would be a perfect fit of curve to histogram, which would imply all the genes have μ i \= 0, the ā€œnullā€ value of no difference between cancer patients and controls. That’s not what happened: the histogram has mildly heavy tails in both directions. The researchers were hoping to find genes with large values of ‖ μ i ‖—ones that might be a clue to prostate cancer etiology—as suggested by the heavy tails. How encouraged should they be? Not very, according to the James–Stein rule. The 6033 x i values have mean 0.003, which I’ll take to be zero, and empirical variance σ ^ 2 \= 1\.289. (10) The James–Stein estimate ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) is μ ^ i J S \= ( 1 āˆ’ n āˆ’ 2 n āˆ’ 1 1 σ ^ 2 ) x i \= 0\.224 ā‹… x i , (11) so even x i \= 5 yields an estimate barely exceeding 1. Section [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec2) suggests a more optimistic analysis. ## 2 Tweedie’s formula The impressive precision of the James–Stein theorem came at a cost in generality. Efforts to extend the theorem, say to Poisson rather than normal observations, or to measures of loss other than total squared error, gave encouraging asymptotic results but not the James–Stein kind of finite sample frequentist dominance. Better progress was possible on the empirical Bayes side of the street. *Tweedie’s formula* (Efron, [2011](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR3 "Efron, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496), 1602–1614. https://doi.org/10.1198/jasa.2011.tm11181 ")) has been particularly useful. We wish to calculate Bayesian estimates μ i B a y e s \= E { μ i ∣ x i } , i \= 1 , … , n , (12) in the normal sampling model ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), starting from a given (possibly non-normal) prior Ļ€ ( μ ), applying to all *n* cases. Let *f*(*x*) be the marginal density f ( x ) \= ∫ R Ļ€ ( μ ) Ļ• ( x āˆ’ μ ) d μ , (13) with Ļ• the standard N ( 0 , 1 ) density and R the range of μ. (It isn’t necessary for Ļ€ ( ā‹… ) to be a continuous distribution but it simplifies notation.) Tweedie’s formula provides an elegant statement for μ i B a y e s, the posterior expectation of μ i given x i, μ i B a y e s \= E { μ i ∣ x i } \= x i \+ l ′ ( x i ) with l ′ ( x i ) \= d d x log ⁔ ( f ( x i ) ) . (14) In the empirical Bayes situation ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), where the prior Ļ€ ( ā‹… ) is unknown, we can use the observed data x 1 , … , x n to estimate the marginal density *f*(*x*), say by f ^ ( x ), giving empirical Bayes estimates μ ^ i \= x i \+ l ^ ′ ( x i ) . (15) The Bayes estimate ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)) can be thought of as the MLE x i plus a Bayesian correction term l ′ ( x i ). When the prior Ļ€ ( μ ) is the N ( 0 , A ) distribution ([5](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ5)), μ i B a y e s equals B x i ([6](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ6)). Simple formulas for μ i B a y e s give out for most other choices of Ļ€ ( μ ) but now, in the machine learning era[Footnote 1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn1) of statistical research, numerical methods provide useful ways forward, as discussed next. The *log polynomial class*[Footnote 2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn2) of marginal densities defines *f*(*x*) by log ⁔ ( f β ( x ) ) \= β 0 \+ β ⊤ c ( x ) . (16) Here c ( x ) \= ( x , x 2 , … , x J ) ⊤ and β \= ( β 1 , … , β J ) ⊤ , (17) with β 0 chosen to make f β ( x ) integrate to 1. The choice J \= 2 gives normal marginals; larger values of *J* allow for marginal non-normality. **Fig. 2** [![Fig. 2](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig2_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/2) Prostate data: Tweedie’s estimate of E { μ ∣ x }, 5 degrees of freedom; dashed curve is James–Stein estimate [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/2) The choice J \= 5 was applied to the prostate cancer data of Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1): Tweedie’s formula ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)) gave μ ^ ( x ) \= E { μ ∣ x }, graphed as the solid curve in Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2). It differs markedly from the James–Stein estimate J \= 2, the dashed line. At x \= 4 for example, the J \= 5 estimate is[Footnote 3](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn3) E { μ ∣ x \= 4 } \= 2\.555 (18) compared to 0.901 for the James–Stein estimate. The estimated curve E { μ ∣ x } is *empirical Bayes* in the same sense as ([8](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ8)): the parameter vector β was selected by maximum likelihood, as discussed next. With J \= 5, the prior was able to adapt to the ā€œfishing expeditionā€ nature of such microarray studies, where we expect most of the genes to be null or close to null, with μ i nearly zero (corresponding here to the flat part of the curve for *x* between āˆ’ 2 and 2) and, hopefully, a small proportion of interestingly large μ is. The sample size n \= 6033 has much to do with Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2). James and Stein ([1961](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR7 "James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob. (Vol. I, pp. 361–379). Berkeley: University of California Press.")) was usually considered in terms of small samples, perhaps n ≤ 20, for which there would be little hope of seeing the detail in Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2). The term ā€œmachine learning eraā€ seems less fanciful when considering the scale of problems statisticians are now asked to deal with, as well as the tools they use to solve them. It looks like it might be hard work computing Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2) but it’s not. The histogram in Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1) has 97 bins, with centerpoints v v \= ( āˆ’ 4\.4 , āˆ’ 4\.3 , … , 5\.1 , 5\.2 ) . (19) Let y j be the count in bin *j*, that is, the number of the 6033 x i values falling into it, with the vector of counts being y y \= ( y 1 , … , y 97 ) . (20) Then the single R command l l ^ \= log ⁔ ( glm ( y y ∼ poly ( v v , 5 ) , poisson ) \$ fit ) (21) provides a close approximation to the MLE of log ⁔ f ( x ) in ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)); numerical differentiation of l l ^ gives Tweedie’s estimate. Section 3.4 of Efron ([2023](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR5 "Efron, B. (2023). Exponential Families in Theory and Practice. Cambridge: Cambridge University Press.")) shows why Poisson regression ([21](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ21)) is appropriate here. The James–Stein theorem depends on the independence assumption in ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), unlikely to be true in the microarray study, but the estimates ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) have a certain marginal validity even under dependence. This is clearer from the empirical Bayes point of view. The Tweedie estimate x i \+ l ^ ′ ( x i ) requires only that l ^ ′ ( x ) be close to l ′ ( x ), not that it be estimated from independent x is.[Footnote 4](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn4) ## 3 Shrinkage estimators James and Stein’s paper aroused excited interest in the statistics community when it arrived in 1961. Most of the excitement focused on the strict inadmissibility of the traditional maximum likelihood estimate demonstrated by the James–Stein rule. Other rules dominating the MLE were discovered, for instance the Bayes estimator of Strawderman ([1971](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR10 "Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Annals of Mathematical Statistics, 42(1), 385–388. https://doi.org/10.1214/aoms/1177693528 ")), that was itself admissible while rendering the MLE inadmissible. Big new ideas can take a while to make their true impact felt. The James–Stein rule had an influential side effect on subsequent theory and practice in that it demonstrated, in an inarguable way, the virtues of *shrinkage estimation*: given an ensemble of problems, individual estimates are shrunk toward a central point; that is, a deliberate bias is introduced, pulling estimates away from their MLEs for the sake of better group performances. Admissibility and inadmissibility aren’t much in the air these days, while shrinkage estimation has gone on to play a major role in modern practice. A spectacular success story is the lasso (Tibshirani, [1996](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR11 "Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.")). Lasso shrinkage is extreme, pulling some (often most) of the coefficient estimates all the way back to zero. Bayes and empirical Bayes rules tend to be strong shrinkers. Tweedie’s estimate in Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2) (J \= 5) shrinks the estimate of E { μ ∣ x \= 4 } from its MLE value 4 down to 2.555. For μ between āˆ’ 1 and 1, the shrinkage is almost all the way to zero. The reader may have been surprised to see that neither Tweedie’s formula ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)) for E { μ i ∣ x i } nor its empirical version ([15](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ15)) require estimation of the prior Ļ€ ( μ ). This is a special property of the posterior expectation E { μ i ∣ x i } and isn’t available for say Pr { μ i ≄ 2 ∣ x i }, or most other Bayesian targets. ā€œBayesian deconvolutionā€ (Efron, [2016](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR4 "Efron, B. (2016). Empirical Bayes deconvolution estimates. Biometrika, 103(1), 1–20. https://doi.org/10.1093/biomet/asv068 ")) uses low-dimensional parametric modeling of Ļ€ ( μ ) for general empirical Bayes computations. It was applied to finding a prior density Ļ€ ( μ ) that would give the distribution of *x* seen in Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1), assuming the normal sampling model ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)). The deconvolution model for Ļ€ ( μ ) used a delta function at μ \= 0 (for the ā€œnullā€ genes) and a natural spline function with four degrees of freedom for the non-null cases. **Fig. 3** [![Fig. 3](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig3_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/3) Empirical Bayes conditional density of μ given μ not zero; Pr { μ \= 0 } equals 0.825 [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/3) The estimated prior[Footnote 5](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn5)Ļ€ ^ ( μ ) is shown in Fig. [3](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig3); it put probability 0.825 on μ \= 0, while the conditional distribution given μ ≠ 0 was a moderately heavy-tailed version of N ( 0 , 1\.33 2 ). Based on Ļ€ ^ ( μ ) we can form estimates of *any* Bayesian target, for instance Pr ^ { μ i ≄ 2 ∣ x i \= 4 } \= 0\.80. Figure [3](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig3) is a direct descendent of the James–Stein rule, now 60-plus years on. A less-direct descendent, but still on the family tree, arrived in 1995. The *false discovery rate* paper by Benjamini and Hochberg concerned simultaneous hypothesis testing. Looking at Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1), which of the n \= 6033 genes can confidently be labeled as non-null, that is as having μ i ≠ 0? Suppose for convenience that the x is are ordered from smallest to largest. The right-sided significance level for testing μ i \= 0 is S 0 ( x i ) \= 1 āˆ’ Φ ( x i ) , (22) where Φ is the standard normal cumulative distribution function. Of the 6033 genes, 401 had S i ≤ 0\.05, the usual rejection level for individual testing, but even if actually *all* of the genes were null we would expect 302 such rejections, so individual testing can’t be right. Benjamini and Hochberg proposed a novel simultaneous testing rule that safely controls the number of ā€œfalse discoveriesā€ — genes falsely labeled ā€non-nullā€ — while not being discouragingly strict. (My summary here won’t give the BH rule its full due; see Chapter 4 of Efron ([2010](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR2 "Efron, B. (2010). Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction (Vol. 1). Cambridge: Cambridge University Press.")) for a more complete description.) Let S ^ ( x ) be the observed proportion of x is exceeding value *x*, and define F d r ^ ( x ) \= Ļ€ 0 S 0 ( x ) / S ^ ( x ) , (23) where Ļ€ 0 is the proportion of null genes among all *n*.[Footnote 6](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn6) For a fixed control level α, such as α \= 0\.1, the BH rule says to reject the null hypothesis μ i \= 0 for those genes having F d r ^ ( x i ) ≤ α . (24) The Benjamini–Hochberg theorem states that under independence assumptions like ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), the expected proportion of false discoveries by rule ([24](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ24)) is α. **Fig. 4** [![Fig. 4](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig4_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/4) Prostate data: left Fdr and right Fdr; dashes show 60 genes with F d r \< 0\.1 [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/4) Figure [4](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig4) shows F d r ^ for the prostate cancer data and also for the left-sided Fdr estimate, where significance is defined by S 0 ( x i ) \= Φ ( x i ) rather than ([22](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ22)). I applied the BH rule with α \= 0\.1 which labeled 60 genes as non-null, 32 on the left and 28 on the right. The BH theorem says that we can expect 6 of the 60 to actually be null. The fdr story has evolved very much along the lines of its James–Stein predecessor. Intense initial interest focused on the exact frequentist control of false discovery rates. The Bayes and empirical Bayes implications came later: as at ([5](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ5)), we assume that each x i is a realization of a random variable *x* given by μ ∼ Ļ€ ( μ ) and x ∣ μ ∼ p ( x ∣ μ ) , (25) where p ( x ∣ μ ) is a known probability kernel which I’ll take here to be the normal sampling model ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)). Then if *S*(*x*) is 1 minus the cdf of the marginal density ([13](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ13)), Bayes rule gives Pr { μ \= 0 ∣ x } \= Ļ€ 0 S 0 ( x ) / S ( x ) . (26) Comparing ([26](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ26)) with ([23](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ23)) says that the BH rule amounts to labeling case *i* as non-null if its obvious empirical Bayes estimate of nullness is less than α. This is less precise than the frequentist control theorem but, as with the James–Stein estimator, is more robust in not demanding independence among the x is. The family resemblance between JS and BH is through shrinkage: in the BH case the shrinkage of significance levels. For instance, x i \= 3 has individual significance level 0.001 against nullness, whereas F d r ^ \= 0\.164 for the prostate data, i.e, still with about a 1/6 chance of gene *i* being null. So what does machine learning have to do with the James–Stein estimator? Nothing to its birth but, as the articles in this volume show, a great deal to its downstream effects on statistical theory and practice. Charles Stein, who was a good applied statistician when he put his mind to it, might have enjoyed these developments, but maybe not; his heart was always with the mathematics. ## Notes 1. Where algorithms can substitute for theorems. 2. For general use, a natural spline basis is preferable to polynomials, to control the behavior of log ⁔ Ļ€ ( μ ) at the extremes. 3. With an estimated bootstrap standard error of 0.192. 4. The accuracy of the Tweedie estimate *does* suffer under dependence, so the previously quoted bootstrap standard error is likely to be optimistic. 5. Estimated using the CRAN package deconvolveR (Narasimhan and Efron, [2020](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR8 "Narasimhan, B., & Efron, B. (2020). deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation. Journal of Statistical Software, 94(11), 1–20. https://doi.org/10.18637/jss.v094.i11 ")). 6. Ļ€ 0 can be estimated but in practice it is usually replaced by its upper bound 1 in applying rule ([24](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ24)). For cases like the prostate data where most of the genes are null, this doesn’t much affect the outcome. ## References - Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society Series B,* *57*(1), 289–300. [Article](https://doi.org/10.1111%2Fj.2517-6161.1995.tb02031.x) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=1325392) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Controlling%20the%20false%20discovery%20rate%3A%20A%20practical%20and%20powerful%20approach%20to%20multiple%20testing&journal=Journal%20of%20the%20Royal%20Statistical%20Society%20Series%20B&doi=10.1111%2Fj.2517-6161.1995.tb02031.x&volume=57&issue=1&pages=289-300&publication_year=1995&author=Benjamini%2CY&author=Hochberg%2CY) - Efron, B. (2010). *Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction* (Vol. 1). Cambridge: Cambridge University Press. [Book](https://doi.org/10.1017%2FCBO9780511761362) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Large-scale%20inference%3A%20Empirical%20Bayes%20methods%20for%20estimation%2C%20testing%2C%20and%20prediction&doi=10.1017%2FCBO9780511761362&publication_year=2010&author=Efron%2CB) - Efron, B. (2011). Tweedie’s formula and selection bias. *Journal of the American Statistical Association,* *106*(496), 1602–1614. <https://doi.org/10.1198/jasa.2011.tm11181> [Article](https://doi.org/10.1198%2Fjasa.2011.tm11181) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=2896860) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Tweedie%E2%80%99s%20formula%20and%20selection%20bias&journal=Journal%20of%20the%20American%20Statistical%20Association&doi=10.1198%2Fjasa.2011.tm11181&volume=106&issue=496&pages=1602-1614&publication_year=2011&author=Efron%2CB) - Efron, B. (2016). Empirical Bayes deconvolution estimates. *Biometrika,* *103*(1), 1–20. <https://doi.org/10.1093/biomet/asv068> [Article](https://doi.org/10.1093%2Fbiomet%2Fasv068) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=3465818) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Empirical%20Bayes%20deconvolution%20estimates&journal=Biometrika&doi=10.1093%2Fbiomet%2Fasv068&volume=103&issue=1&pages=1-20&publication_year=2016&author=Efron%2CB) - Efron, B. (2023). *Exponential Families in Theory and Practice*. Cambridge: Cambridge University Press. [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Exponential%20Families%20in%20Theory%20and%20Practice&publication_year=2023&author=Efron%2CB) - Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. *Journal of the American Statistical Association,* *68*, 117–130. [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=388597) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Stein%E2%80%99s%20estimation%20rule%20and%20its%20competitors%E2%80%94An%20empirical%20Bayes%20approach&journal=Journal%20of%20the%20American%20Statistical%20Association&volume=68&pages=117-130&publication_year=1973&author=Efron%2CB&author=Morris%2CC) - James, W., & Stein, C. (1961). Estimation with quadratic loss. In *Proc. 4th Berkeley Sympos. Math. Statist. and Prob.* (Vol. I, pp. 361–379). Berkeley: University of California Press. - Narasimhan, B., & Efron, B. (2020). deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation. *Journal of Statistical Software,* *94*(11), 1–20. <https://doi.org/10.18637/jss.v094.i11> [Article](https://doi.org/10.18637%2Fjss.v094.i11) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=deconvolveR%3A%20A%20G-Modeling%20Program%20for%20Deconvolution%20and%20Empirical%20Bayes%20Estimation&journal=Journal%20of%20Statistical%20Software&doi=10.18637%2Fjss.v094.i11&volume=94&issue=11&pages=1-20&publication_year=2020&author=Narasimhan%2CB&author=Efron%2CB) - Robbins, H. (1956). An empirical Bayes approach to statistics. In *Proc. 3rd Berkeley Sympos. Math. Statist. and Prob.* (Vol. I, pp. 157–163). Berkeley: University of California Press. - Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. *Annals of Mathematical Statistics,* *42*(1), 385–388. <https://doi.org/10.1214/aoms/1177693528> [Article](https://doi.org/10.1214%2Faoms%2F1177693528) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=397939) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Proper%20Bayes%20minimax%20estimators%20of%20the%20multivariate%20normal%20mean&journal=Annals%20of%20Mathematical%20Statistics&doi=10.1214%2Faoms%2F1177693528&volume=42&issue=1&pages=385-388&publication_year=1971&author=Strawderman%2CWE) - Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society Series B,* *58*(1), 267–288. [Article](https://doi.org/10.1111%2Fj.2517-6161.1996.tb02080.x) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=1379242) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Regression%20shrinkage%20and%20selection%20via%20the%20lasso&journal=Journal%20of%20the%20Royal%20Statistical%20Society%20Series%20B&doi=10.1111%2Fj.2517-6161.1996.tb02080.x&volume=58&issue=1&pages=267-288&publication_year=1996&author=Tibshirani%2CR) [Download references](https://citation-needed.springer.com/v2/references/10.1007/s42081-023-00209-y?format=refman&flavour=references) ## Funding No funds, grants, or other support was received. ## Author information ### Authors and Affiliations 1. Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, CA, 94305, USA Bradley Efron 2. Department of Biomedical Data Science, Stanford School of Medicine, 1265 Welch Road, Stanford, CA, 94305, USA Bradley Efron Authors 1. Bradley Efron [View author publications](https://link.springer.com/search?sortBy=newestFirst&contributor=Bradley%20Efron) Search author on:[PubMed](https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&term=Bradley%20Efron) [Google Scholar](https://scholar.google.co.uk/scholar?as_q=&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=%22Bradley%20Efron%22&as_publication=&as_ylo=&as_yhi=&as_allsubj=all&hl=en) ### Corresponding author Correspondence to [Bradley Efron](mailto:efron@stanford.edu). ## Ethics declarations ### Conflict of interest The author has no relevant financial or non-financial interests to disclose. ## Additional information ### Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Dedicated to the memory of Carl Morris. ## Rights and permissions **Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit <http://creativecommons.org/licenses/by/4.0/>. [Reprints and permissions](https://s100.copyright.com/AppDispatchServlet?title=Machine%20learning%20and%20the%20James%E2%80%93Stein%20estimator&author=Bradley%20Efron&contentID=10.1007%2Fs42081-023-00209-y&copyright=The%20Author%28s%29&publication=2520-8756&publicationDate=2023-06-30&publisherName=SpringerNature&orderBeanReset=true&oa=CC%20BY) ## About this article [![Check for updates. Verify currency and authenticity via CrossMark](data:image/svg+xml;base64,PHN2ZyBoZWlnaHQ9IjgxIiB3aWR0aD0iNTciIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyI+PGcgZmlsbD0ibm9uZSIgZmlsbC1ydWxlPSJldmVub2RkIj48cGF0aCBkPSJtMTcuMzUgMzUuNDUgMjEuMy0xNC4ydi0xNy4wM2gtMjEuMyIgZmlsbD0iIzk4OTg5OCIvPjxwYXRoIGQ9Im0zOC42NSAzNS40NS0yMS4zLTE0LjJ2LTE3LjAzaDIxLjMiIGZpbGw9IiM3NDc0NzQiLz48cGF0aCBkPSJtMjggLjVjLTEyLjk4IDAtMjMuNSAxMC41Mi0yMy41IDIzLjVzMTAuNTIgMjMuNSAyMy41IDIzLjUgMjMuNS0xMC41MiAyMy41LTIzLjVjMC02LjIzLTIuNDgtMTIuMjEtNi44OC0xNi42Mi00LjQxLTQuNC0xMC4zOS02Ljg4LTE2LjYyLTYuODh6bTAgNDEuMjVjLTkuOCAwLTE3Ljc1LTcuOTUtMTcuNzUtMTcuNzVzNy45NS0xNy43NSAxNy43NS0xNy43NSAxNy43NSA3Ljk1IDE3Ljc1IDE3Ljc1YzAgNC43MS0xLjg3IDkuMjItNS4yIDEyLjU1cy03Ljg0IDUuMi0xMi41NSA1LjJ6IiBmaWxsPSIjNTM1MzUzIi8+PHBhdGggZD0ibTQxIDM2Yy01LjgxIDYuMjMtMTUuMjMgNy40NS0yMi40MyAyLjktNy4yMS00LjU1LTEwLjE2LTEzLjU3LTcuMDMtMjEuNWwtNC45Mi0zLjExYy00Ljk1IDEwLjctMS4xOSAyMy40MiA4Ljc4IDI5LjcxIDkuOTcgNi4zIDIzLjA3IDQuMjIgMzAuNi00Ljg2eiIgZmlsbD0iIzljOWM5YyIvPjxwYXRoIGQ9Im0uMiA1OC40NWMwLS43NS4xMS0xLjQyLjMzLTIuMDFzLjUyLTEuMDkuOTEtMS41Yy4zOC0uNDEuODMtLjczIDEuMzQtLjk0LjUxLS4yMiAxLjA2LS4zMiAxLjY1LS4zMi41NiAwIDEuMDYuMTEgMS41MS4zNS40NC4yMy44MS41IDEuMS44MWwtLjkxIDEuMDFjLS4yNC0uMjQtLjQ5LS40Mi0uNzUtLjU2LS4yNy0uMTMtLjU4LS4yLS45My0uMi0uMzkgMC0uNzMuMDgtMS4wNS4yMy0uMzEuMTYtLjU4LjM3LS44MS42Ni0uMjMuMjgtLjQxLjYzLS41MyAxLjA0LS4xMy40MS0uMTkuODgtLjE5IDEuMzkgMCAxLjA0LjIzIDEuODYuNjggMi40Ni40NS41OSAxLjA2Ljg4IDEuODQuODguNDEgMCAuNzctLjA3IDEuMDctLjIzcy41OS0uMzkuODUtLjY4bC45MSAxYy0uMzguNDMtLjguNzYtMS4yOC45OS0uNDcuMjItMSAuMzQtMS41OC4zNC0uNTkgMC0xLjEzLS4xLTEuNjQtLjMxLS41LS4yLS45NC0uNTEtMS4zMS0uOTEtLjM4LS40LS42Ny0uOS0uODgtMS40OC0uMjItLjU5LS4zMy0xLjI2LS4zMy0yLjAyem04LjQtNS4zM2gxLjYxdjIuNTRsLS4wNSAxLjMzYy4yOS0uMjcuNjEtLjUxLjk2LS43MnMuNzYtLjMxIDEuMjQtLjMxYy43MyAwIDEuMjcuMjMgMS42MS43MS4zMy40Ny41IDEuMTQuNSAyLjAydjQuMzFoLTEuNjF2LTQuMWMwLS41Ny0uMDgtLjk3LS4yNS0xLjIxLS4xNy0uMjMtLjQ1LS4zNS0uODMtLjM1LS4zIDAtLjU2LjA4LS43OS4yMi0uMjMuMTUtLjQ5LjM2LS43OC42NHY0LjhoLTEuNjF6bTcuMzcgNi40NWMwLS41Ni4wOS0xLjA2LjI2LTEuNTEuMTgtLjQ1LjQyLS44My43MS0xLjE0LjI5LS4zLjYzLS41NCAxLjAxLS43MS4zOS0uMTcuNzgtLjI1IDEuMTgtLjI1LjQ3IDAgLjg4LjA4IDEuMjMuMjQuMzYuMTYuNjUuMzguODkuNjdzLjQyLjYzLjU0IDEuMDNjLjEyLjQxLjE4Ljg0LjE4IDEuMzIgMCAuMzItLjAyLjU3LS4wNy43NmgtNC4zNmMuMDcuNjIuMjkgMS4xLjY1IDEuNDQuMzYuMzMuODIuNSAxLjM4LjUuMjkgMCAuNTctLjA0LjgzLS4xM3MuNTEtLjIxLjc2LS4zN2wuNTUgMS4wMWMtLjMzLjIxLS42OS4zOS0xLjA5LjUzLS40MS4xNC0uODMuMjEtMS4yNi4yMS0uNDggMC0uOTItLjA4LTEuMzQtLjI1LS40MS0uMTYtLjc2LS40LTEuMDctLjctLjMxLS4zMS0uNTUtLjY5LS43Mi0xLjEzLS4xOC0uNDQtLjI2LS45NS0uMjYtMS41MnptNC42LS42MmMwLS41NS0uMTEtLjk4LS4zNC0xLjI4LS4yMy0uMzEtLjU4LS40Ny0xLjA2LS40Ny0uNDEgMC0uNzcuMTUtMS4wNy40NS0uMzEuMjktLjUuNzMtLjU4IDEuM3ptMi41LjYyYzAtLjU3LjA5LTEuMDguMjgtMS41My4xOC0uNDQuNDMtLjgyLjc1LTEuMTNzLjY5LS41NCAxLjEtLjcxYy40Mi0uMTYuODUtLjI0IDEuMzEtLjI0LjQ1IDAgLjg0LjA4IDEuMTcuMjNzLjYxLjM0Ljg1LjU3bC0uNzcgMS4wMmMtLjE5LS4xNi0uMzgtLjI4LS41Ni0uMzctLjE5LS4wOS0uMzktLjE0LS42MS0uMTQtLjU2IDAtMS4wMS4yMS0xLjM1LjYzLS4zNS40MS0uNTIuOTctLjUyIDEuNjcgMCAuNjkuMTcgMS4yNC41MSAxLjY2LjM0LjQxLjc4LjYyIDEuMzIuNjIuMjggMCAuNTQtLjA2Ljc4LS4xNy4yNC0uMTIuNDUtLjI2LjY0LS40MmwuNjcgMS4wM2MtLjMzLjI5LS42OS41MS0xLjA4LjY1LS4zOS4xNS0uNzguMjMtMS4xOC4yMy0uNDYgMC0uOS0uMDgtMS4zMS0uMjQtLjQtLjE2LS43NS0uMzktMS4wNS0uN3MtLjUzLS42OS0uNy0xLjEzYy0uMTctLjQ1LS4yNS0uOTYtLjI1LTEuNTN6bTYuOTEtNi40NWgxLjU4djYuMTdoLjA1bDIuNTQtMy4xNmgxLjc3bC0yLjM1IDIuOCAyLjU5IDQuMDdoLTEuNzVsLTEuNzctMi45OC0xLjA4IDEuMjN2MS43NWgtMS41OHptMTMuNjkgMS4yN2MtLjI1LS4xMS0uNS0uMTctLjc1LS4xNy0uNTggMC0uODcuMzktLjg3IDEuMTZ2Ljc1aDEuMzR2MS4yN2gtMS4zNHY1LjZoLTEuNjF2LTUuNmgtLjkydi0xLjJsLjkyLS4wN3YtLjcyYzAtLjM1LjA0LS42OC4xMy0uOTguMDgtLjMxLjIxLS41Ny40LS43OXMuNDItLjM5LjcxLS41MWMuMjgtLjEyLjYzLS4xOCAxLjA0LS4xOC4yNCAwIC40OC4wMi42OS4wNy4yMi4wNS40MS4xLjU3LjE3em0uNDggNS4xOGMwLS41Ny4wOS0xLjA4LjI3LTEuNTMuMTctLjQ0LjQxLS44Mi43Mi0xLjEzLjMtLjMxLjY1LS41NCAxLjA0LS43MS4zOS0uMTYuOC0uMjQgMS4yMy0uMjRzLjg0LjA4IDEuMjQuMjRjLjQuMTcuNzQuNCAxLjA0Ljcxcy41NC42OS43MiAxLjEzYy4xOS40NS4yOC45Ni4yOCAxLjUzcy0uMDkgMS4wOC0uMjggMS41M2MtLjE4LjQ0LS40Mi44Mi0uNzIgMS4xM3MtLjY0LjU0LTEuMDQuNy0uODEuMjQtMS4yNC4yNC0uODQtLjA4LTEuMjMtLjI0LS43NC0uMzktMS4wNC0uN2MtLjMxLS4zMS0uNTUtLjY5LS43Mi0xLjEzLS4xOC0uNDUtLjI3LS45Ni0uMjctMS41M3ptMS42NSAwYzAgLjY5LjE0IDEuMjQuNDMgMS42Ni4yOC40MS42OC42MiAxLjE4LjYyLjUxIDAgLjktLjIxIDEuMTktLjYyLjI5LS40Mi40NC0uOTcuNDQtMS42NiAwLS43LS4xNS0xLjI2LS40NC0xLjY3LS4yOS0uNDItLjY4LS42My0xLjE5LS42My0uNSAwLS45LjIxLTEuMTguNjMtLjI5LjQxLS40My45Ny0uNDMgMS42N3ptNi40OC0zLjQ0aDEuMzNsLjEyIDEuMjFoLjA1Yy4yNC0uNDQuNTQtLjc5Ljg4LTEuMDIuMzUtLjI0LjctLjM2IDEuMDctLjM2LjMyIDAgLjU5LjA1Ljc4LjE0bC0uMjggMS40LS4zMy0uMDljLS4xMS0uMDEtLjIzLS4wMi0uMzgtLjAyLS4yNyAwLS41Ni4xLS44Ni4zMXMtLjU1LjU4LS43NyAxLjF2NC4yaC0xLjYxem0tNDcuODcgMTVoMS42MXY0LjFjMCAuNTcuMDguOTcuMjUgMS4yLjE3LjI0LjQ0LjM1LjgxLjM1LjMgMCAuNTctLjA3LjgtLjIyLjIyLS4xNS40Ny0uMzkuNzMtLjczdi00LjdoMS42MXY2Ljg3aC0xLjMybC0uMTItMS4wMWgtLjA0Yy0uMy4zNi0uNjMuNjQtLjk4Ljg2LS4zNS4yMS0uNzYuMzItMS4yNC4zMi0uNzMgMC0xLjI3LS4yNC0xLjYxLS43MS0uMzMtLjQ3LS41LTEuMTQtLjUtMi4wMnptOS40NiA3LjQzdjIuMTZoLTEuNjF2LTkuNTloMS4zM2wuMTIuNzJoLjA1Yy4yOS0uMjQuNjEtLjQ1Ljk3LS42My4zNS0uMTcuNzItLjI2IDEuMS0uMjYuNDMgMCAuODEuMDggMS4xNS4yNC4zMy4xNy42MS40Ljg0LjcxLjI0LjMxLjQxLjY4LjUzIDEuMTEuMTMuNDIuMTkuOTEuMTkgMS40NCAwIC41OS0uMDkgMS4xMS0uMjUgMS41Ny0uMTYuNDctLjM4Ljg1LS42NSAxLjE2LS4yNy4zMi0uNTguNTYtLjk0LjczLS4zNS4xNi0uNzIuMjUtMS4xLjI1LS4zIDAtLjYtLjA3LS45LS4ycy0uNTktLjMxLS44Ny0uNTZ6bTAtMi4zYy4yNi4yMi41LjM3LjczLjQ1LjI0LjA5LjQ2LjEzLjY2LjEzLjQ2IDAgLjg0LS4yIDEuMTUtLjYuMzEtLjM5LjQ2LS45OC40Ni0xLjc3IDAtLjY5LS4xMi0xLjIyLS4zNS0xLjYxLS4yMy0uMzgtLjYxLS41Ny0xLjEzLS41Ny0uNDkgMC0uOTkuMjYtMS41Mi43N3ptNS44Ny0xLjY5YzAtLjU2LjA4LTEuMDYuMjUtMS41MS4xNi0uNDUuMzctLjgzLjY1LTEuMTQuMjctLjMuNTgtLjU0LjkzLS43MXMuNzEtLjI1IDEuMDgtLjI1Yy4zOSAwIC43My4wNyAxIC4yLjI3LjE0LjU0LjMyLjgxLjU1bC0uMDYtMS4xdi0yLjQ5aDEuNjF2OS44OGgtMS4zM2wtLjExLS43NGgtLjA2Yy0uMjUuMjUtLjU0LjQ2LS44OC42NC0uMzMuMTgtLjY5LjI3LTEuMDYuMjctLjg3IDAtMS41Ni0uMzItMi4wNy0uOTVzLS43Ni0xLjUxLS43Ni0yLjY1em0xLjY3LS4wMWMwIC43NC4xMyAxLjMxLjQgMS43LjI2LjM4LjY1LjU4IDEuMTUuNTguNTEgMCAuOTktLjI2IDEuNDQtLjc3di0zLjIxYy0uMjQtLjIxLS40OC0uMzYtLjctLjQ1LS4yMy0uMDgtLjQ2LS4xMi0uNy0uMTItLjQ1IDAtLjgyLjE5LTEuMTMuNTktLjMxLjM5LS40Ni45NS0uNDYgMS42OHptNi4zNSAxLjU5YzAtLjczLjMyLTEuMy45Ny0xLjcxLjY0LS40IDEuNjctLjY4IDMuMDgtLjg0IDAtLjE3LS4wMi0uMzQtLjA3LS41MS0uMDUtLjE2LS4xMi0uMy0uMjItLjQzcy0uMjItLjIyLS4zOC0uM2MtLjE1LS4wNi0uMzQtLjEtLjU4LS4xLS4zNCAwLS42OC4wNy0xIC4ycy0uNjMuMjktLjkzLjQ3bC0uNTktMS4wOGMuMzktLjI0LjgxLS40NSAxLjI4LS42My40Ny0uMTcuOTktLjI2IDEuNTQtLjI2Ljg2IDAgMS41MS4yNSAxLjkzLjc2cy42MyAxLjI1LjYzIDIuMjF2NC4wN2gtMS4zMmwtLjEyLS43NmgtLjA1Yy0uMy4yNy0uNjMuNDgtLjk4LjY2cy0uNzMuMjctMS4xNC4yN2MtLjYxIDAtMS4xLS4xOS0xLjQ4LS41Ni0uMzgtLjM2LS41Ny0uODUtLjU3LTEuNDZ6bTEuNTctLjEyYzAgLjMuMDkuNTMuMjcuNjcuMTkuMTQuNDIuMjEuNzEuMjEuMjggMCAuNTQtLjA3Ljc3LS4ycy40OC0uMzEuNzMtLjU2di0xLjU0Yy0uNDcuMDYtLjg2LjEzLTEuMTguMjMtLjMxLjA5LS41Ny4xOS0uNzYuMzFzLS4zMy4yNS0uNDEuNGMtLjA5LjE1LS4xMy4zMS0uMTMuNDh6bTYuMjktMy42M2gtLjk4di0xLjJsMS4wNi0uMDcuMi0xLjg4aDEuMzR2MS44OGgxLjc1djEuMjdoLTEuNzV2My4yOGMwIC44LjMyIDEuMi45NyAxLjIuMTIgMCAuMjQtLjAxLjM3LS4wNC4xMi0uMDMuMjQtLjA3LjM0LS4xMWwuMjggMS4xOWMtLjE5LjA2LS40LjEyLS42NC4xNy0uMjMuMDUtLjQ5LjA4LS43Ni4wOC0uNCAwLS43NC0uMDYtMS4wMi0uMTgtLjI3LS4xMy0uNDktLjMtLjY3LS41Mi0uMTctLjIxLS4zLS40OC0uMzctLjc4LS4wOC0uMy0uMTItLjY0LS4xMi0xLjAxem00LjM2IDIuMTdjMC0uNTYuMDktMS4wNi4yNy0xLjUxcy40MS0uODMuNzEtMS4xNGMuMjktLjMuNjMtLjU0IDEuMDEtLjcxLjM5LS4xNy43OC0uMjUgMS4xOC0uMjUuNDcgMCAuODguMDggMS4yMy4yNC4zNi4xNi42NS4zOC44OS42N3MuNDIuNjMuNTQgMS4wM2MuMTIuNDEuMTguODQuMTggMS4zMiAwIC4zMi0uMDIuNTctLjA3Ljc2aC00LjM3Yy4wOC42Mi4yOSAxLjEuNjUgMS40NC4zNi4zMy44Mi41IDEuMzguNS4zIDAgLjU4LS4wNC44NC0uMTMuMjUtLjA5LjUxLS4yMS43Ni0uMzdsLjU0IDEuMDFjLS4zMi4yMS0uNjkuMzktMS4wOS41M3MtLjgyLjIxLTEuMjYuMjFjLS40NyAwLS45Mi0uMDgtMS4zMy0uMjUtLjQxLS4xNi0uNzctLjQtMS4wOC0uNy0uMy0uMzEtLjU0LS42OS0uNzItMS4xMy0uMTctLjQ0LS4yNi0uOTUtLjI2LTEuNTJ6bTQuNjEtLjYyYzAtLjU1LS4xMS0uOTgtLjM0LTEuMjgtLjIzLS4zMS0uNTgtLjQ3LTEuMDYtLjQ3LS40MSAwLS43Ny4xNS0xLjA4LjQ1LS4zMS4yOS0uNS43My0uNTcgMS4zem0zLjAxIDIuMjNjLjMxLjI0LjYxLjQzLjkyLjU3LjMuMTMuNjMuMi45OC4yLjM4IDAgLjY1LS4wOC44My0uMjNzLjI3LS4zNS4yNy0uNmMwLS4xNC0uMDUtLjI2LS4xMy0uMzctLjA4LS4xLS4yLS4yLS4zNC0uMjgtLjE0LS4wOS0uMjktLjE2LS40Ny0uMjNsLS41My0uMjJjLS4yMy0uMDktLjQ2LS4xOC0uNjktLjMtLjIzLS4xMS0uNDQtLjI0LS42Mi0uNHMtLjMzLS4zNS0uNDUtLjU1Yy0uMTItLjIxLS4xOC0uNDYtLjE4LS43NSAwLS42MS4yMy0xLjEuNjgtMS40OS40NC0uMzggMS4wNi0uNTcgMS44My0uNTcuNDggMCAuOTEuMDggMS4yOS4yNXMuNzEuMzYuOTkuNTdsLS43NC45OGMtLjI0LS4xNy0uNDktLjMyLS43My0uNDItLjI1LS4xMS0uNTEtLjE2LS43OC0uMTYtLjM1IDAtLjYuMDctLjc2LjIxLS4xNy4xNS0uMjUuMzMtLjI1LjU0IDAgLjE0LjA0LjI2LjEyLjM2cy4xOC4xOC4zMS4yNmMuMTQuMDcuMjkuMTQuNDYuMjFsLjU0LjE5Yy4yMy4wOS40Ny4xOC43LjI5cy40NC4yNC42NC40Yy4xOS4xNi4zNC4zNS40Ni41OC4xMS4yMy4xNy41LjE3LjgyIDAgLjMtLjA2LjU4LS4xNy44My0uMTIuMjYtLjI5LjQ4LS41MS42OC0uMjMuMTktLjUxLjM0LS44NC40NS0uMzQuMTEtLjcyLjE3LTEuMTUuMTctLjQ4IDAtLjk1LS4wOS0xLjQxLS4yNy0uNDYtLjE5LS44Ni0uNDEtMS4yLS42OHoiIGZpbGw9IiM1MzUzNTMiLz48L2c+PC9zdmc+)](https://crossmark.crossref.org/dialog/?doi=10.1007/s42081-023-00209-y) ### Cite this article Efron, B. Machine learning and the James–Stein estimator. *Jpn J Stat Data Sci* **7**, 257–266 (2024). https://doi.org/10.1007/s42081-023-00209-y [Download citation](https://citation-needed.springer.com/v2/references/10.1007/s42081-023-00209-y?format=refman&flavour=citation) - Received: 24 March 2023 - Accepted: 13 May 2023 - Published: 30 June 2023 - Version of record: 30 June 2023 - Issue date: June 2024 - DOI: https://doi.org/10.1007/s42081-023-00209-y ### Share this article Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy shareable link to clipboard Provided by the Springer Nature SharedIt content-sharing initiative ### Keywords - [Empirical bayes](https://link.springer.com/search?query=Empirical%20bayes&facet-discipline="Statistics") - [Shrinkage](https://link.springer.com/search?query=Shrinkage&facet-discipline="Statistics") - [Tweedie’s formula](https://link.springer.com/search?query=Tweedie%E2%80%99s%20formula&facet-discipline="Statistics") - [Benjamini–Hochberg algorithm](https://link.springer.com/search?query=Benjamini%E2%80%93Hochberg%20algorithm&facet-discipline="Statistics") - Sections - Figures - References - [Abstract](https://link.springer.com/article/10.1007/s42081-023-00209-y#Abs1) - [1 Introduction](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec1) - [2 Tweedie’s formula](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec2) - [3 Shrinkage estimators](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec3) - [Notes](https://link.springer.com/article/10.1007/s42081-023-00209-y#notes) - [References](https://link.springer.com/article/10.1007/s42081-023-00209-y#Bib1) - [Funding](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fun) - [Author information](https://link.springer.com/article/10.1007/s42081-023-00209-y#author-information) - [Ethics declarations](https://link.springer.com/article/10.1007/s42081-023-00209-y#ethics) - [Additional information](https://link.springer.com/article/10.1007/s42081-023-00209-y#additional-information) - [Rights and permissions](https://link.springer.com/article/10.1007/s42081-023-00209-y#rightslink) - [About this article](https://link.springer.com/article/10.1007/s42081-023-00209-y#article-info) Advertisement - **Fig. 1** ![Fig. 1]() [View in article](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1)[Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/1) - **Fig. 2** ![Fig. 2]() [View in article](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2)[Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/2) - **Fig. 3** ![Fig. 3]() [View in article](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig3)[Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/3) - **Fig. 4** ![Fig. 4]() [View in article](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig4)[Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/4) 1. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society Series B,* *57*(1), 289–300. [Article](https://doi.org/10.1111%2Fj.2517-6161.1995.tb02031.x) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=1325392) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Controlling%20the%20false%20discovery%20rate%3A%20A%20practical%20and%20powerful%20approach%20to%20multiple%20testing&journal=Journal%20of%20the%20Royal%20Statistical%20Society%20Series%20B&doi=10.1111%2Fj.2517-6161.1995.tb02031.x&volume=57&issue=1&pages=289-300&publication_year=1995&author=Benjamini%2CY&author=Hochberg%2CY) 2. Efron, B. (2010). *Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction* (Vol. 1). Cambridge: Cambridge University Press. [Book](https://doi.org/10.1017%2FCBO9780511761362) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Large-scale%20inference%3A%20Empirical%20Bayes%20methods%20for%20estimation%2C%20testing%2C%20and%20prediction&doi=10.1017%2FCBO9780511761362&publication_year=2010&author=Efron%2CB) 3. Efron, B. (2011). Tweedie’s formula and selection bias. *Journal of the American Statistical Association,* *106*(496), 1602–1614. <https://doi.org/10.1198/jasa.2011.tm11181> [Article](https://doi.org/10.1198%2Fjasa.2011.tm11181) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=2896860) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Tweedie%E2%80%99s%20formula%20and%20selection%20bias&journal=Journal%20of%20the%20American%20Statistical%20Association&doi=10.1198%2Fjasa.2011.tm11181&volume=106&issue=496&pages=1602-1614&publication_year=2011&author=Efron%2CB) 4. Efron, B. (2016). Empirical Bayes deconvolution estimates. *Biometrika,* *103*(1), 1–20. <https://doi.org/10.1093/biomet/asv068> [Article](https://doi.org/10.1093%2Fbiomet%2Fasv068) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=3465818) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Empirical%20Bayes%20deconvolution%20estimates&journal=Biometrika&doi=10.1093%2Fbiomet%2Fasv068&volume=103&issue=1&pages=1-20&publication_year=2016&author=Efron%2CB) 5. Efron, B. (2023). *Exponential Families in Theory and Practice*. Cambridge: Cambridge University Press. [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Exponential%20Families%20in%20Theory%20and%20Practice&publication_year=2023&author=Efron%2CB) 6. Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. *Journal of the American Statistical Association,* *68*, 117–130. [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=388597) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Stein%E2%80%99s%20estimation%20rule%20and%20its%20competitors%E2%80%94An%20empirical%20Bayes%20approach&journal=Journal%20of%20the%20American%20Statistical%20Association&volume=68&pages=117-130&publication_year=1973&author=Efron%2CB&author=Morris%2CC) 7. James, W., & Stein, C. (1961). Estimation with quadratic loss. In *Proc. 4th Berkeley Sympos. Math. Statist. and Prob.* (Vol. I, pp. 361–379). Berkeley: University of California Press. 8. Narasimhan, B., & Efron, B. (2020). deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation. *Journal of Statistical Software,* *94*(11), 1–20. <https://doi.org/10.18637/jss.v094.i11> [Article](https://doi.org/10.18637%2Fjss.v094.i11) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=deconvolveR%3A%20A%20G-Modeling%20Program%20for%20Deconvolution%20and%20Empirical%20Bayes%20Estimation&journal=Journal%20of%20Statistical%20Software&doi=10.18637%2Fjss.v094.i11&volume=94&issue=11&pages=1-20&publication_year=2020&author=Narasimhan%2CB&author=Efron%2CB) 9. Robbins, H. (1956). An empirical Bayes approach to statistics. In *Proc. 3rd Berkeley Sympos. Math. Statist. and Prob.* (Vol. I, pp. 157–163). Berkeley: University of California Press. 10. Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. *Annals of Mathematical Statistics,* *42*(1), 385–388. <https://doi.org/10.1214/aoms/1177693528> [Article](https://doi.org/10.1214%2Faoms%2F1177693528) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=397939) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Proper%20Bayes%20minimax%20estimators%20of%20the%20multivariate%20normal%20mean&journal=Annals%20of%20Mathematical%20Statistics&doi=10.1214%2Faoms%2F1177693528&volume=42&issue=1&pages=385-388&publication_year=1971&author=Strawderman%2CWE) 11. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society Series B,* *58*(1), 267–288. [Article](https://doi.org/10.1111%2Fj.2517-6161.1996.tb02080.x) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=1379242) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Regression%20shrinkage%20and%20selection%20via%20the%20lasso&journal=Journal%20of%20the%20Royal%20Statistical%20Society%20Series%20B&doi=10.1111%2Fj.2517-6161.1996.tb02080.x&volume=58&issue=1&pages=267-288&publication_year=1996&author=Tibshirani%2CR) ### Discover content - [Journals A-Z](https://link.springer.com/journals/a/1) - [Books A-Z](https://link.springer.com/books/a/1) ### Publish with us - [Journal finder](https://link.springer.com/journals) - [Publish your research](https://www.springernature.com/gp/authors) - [Language editing](https://authorservices.springernature.com/go/sn/?utm_source=SNLinkfooter&utm_medium=Web&utm_campaign=SNReferral) - [Open access publishing](https://www.springernature.com/gp/open-science/about/the-fundamentals-of-open-access-and-open-research) ### Products and services - [Our products](https://www.springernature.com/gp/products) - [Librarians](https://www.springernature.com/gp/librarians) - [Societies](https://www.springernature.com/gp/societies) - [Partners and advertisers](https://www.springernature.com/gp/partners) ### Our brands - [Springer](https://link.springer.com/brands/springer) - [Nature Portfolio](https://www.nature.com/) - [BMC](https://link.springer.com/brands/bmc) - [Palgrave Macmillan](https://link.springer.com/brands/palgrave) - [Apress](https://link.springer.com/brands/apress) - [Discover](https://link.springer.com/brands/discover) - Your privacy choices/Manage cookies - [Your US state privacy rights](https://www.springernature.com/gp/legal/ccpa) - [Accessibility statement](https://link.springer.com/accessibility) - [Terms and conditions](https://link.springer.com/termsandconditions) - [Privacy policy](https://link.springer.com/privacystatement) - [Help and support](https://support.springernature.com/en/support/home) - [Legal notice](https://link.springer.com/legal-notice) - [Cancel contracts here](https://support.springernature.com/en/support/solutions/articles/6000255911-subscription-cancellations) Not affiliated [![Springer Nature](https://link.springer.com/oscar-static/images/logo-springernature-white-dbadd2cbd6.svg)](https://www.springernature.com/) Ā© 2026 Springer Nature
Readable Markdown
## 1 Introduction By and large, the statistics world is one of heuristics, approximations, and asymptotics. The James–Stein estimator arrived in that world in 1961 on a note of startling specificity: unseen parameters μ 1 , μ 2 , … , μ n produce independent observations x i ∼ ind N ( μ i , 1 ) , i \= 1 , … , n , (1) n ≄ 3. The James–Stein rule in its simplest form proposed estimating the μ i by μ ^ i J S \= ( 1 āˆ’ n āˆ’ 2 S ) x i ( S \= āˆ‘ i \= 1 n x i 2 ) . (2) Formula ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) looked implausible: the estimate of μ i depended on the *other* observations x j, j ≠ i (through *S*), as well as x i, despite the independence assumption. Nevertheless, James and Stein showed that Rule ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) *always* beat the obvious maximum likelihood estimates μ ^ i M L \= x ĀÆ i ( i \= 1 , … , n ) (3) in terms of total expected squared error E { āˆ‘ i \= 1 n ( μ ^ i āˆ’ μ i ) 2 } . (4) That ā€œalwaysā€ was the shocking part: two centuries of statistical theory, ANOVA, regression, multivariate analysis, etc., depended on maximum likelihood estimation. Did everything have to be rethought? One path forward involved Bayesian thinking. If we assumed that the μ i themselves came from a normal distribution, μ i ∼ ind N ( 0 , A ) for i \= 1 , … , n , (5) with variance A ≄ 0, the Bayes estimates would be μ ^ i B a y e s \= B x i ( B \= A / ( A \+ 1 ) ) . (6) We don’t know *A* or *B* but B ^ \= 1 āˆ’ ( n āˆ’ 2 ) / S (7) is *B*’s unbiased estimate: we can rewrite ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) as μ ^ i J S \= B ^ x i , (8) which at least looks more plausible. In the language introduced by Robbins ([1956](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR9 "Robbins, H. (1956). An empirical Bayes approach to statistics. In Proc. 3rd Berkeley Sympos. Math. Statist. and Prob. (Vol. I, pp. 157–163). Berkeley: University of California Press.")), formula ([8](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ8)) is an *empirical Bayes* estimator, another shocking post-war statistical innovation. Carl Morris and I wrote a series of papers in the 1970 s exploring Bayesian roots of the James–Stein estimator (Efron and Morris, [1973](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR6 "Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. Journal of the American Statistical Association, 68, 117–130.")). Something is lost in the empirical Bayes formulation, namely the frequentist ā€œalwaysā€ of expected square error minimization, but a lot is gained in flexibility and scope, as discussed in Sect. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec2). **Fig. 1** [![Fig. 1](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig1_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/1) Prostate data: 6033 *x* values; mean 0.003, sd \= 1\.135; curve is proportional to a N ( 0 , 1 ) density [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/1) Figure [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1) illustrates an example of simultaneous estimation pursued in Sect. 2.1 of Efron ([2010](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR2 "Efron, B. (2010). Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction (Vol. 1). Cambridge: Cambridge University Press.")). A microarray study has compared expression levels between prostate cancer patients and control subjects for n \= 6033 genes. For each gene, a statistic x i has been calculated (essentially a ā€œ*z*\-valueā€), x i ∼ N ( μ i , 1 ) , i \= 1 , … , n , (9) where μ i measures the difference between cancer and control group levels. The solid curve in Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1) is a N ( 0 , 1 ) density scaled to have the same area as the histogram of the 6033 *x* values. A bad result from the researchers’ point of view would be a perfect fit of curve to histogram, which would imply all the genes have μ i \= 0, the ā€œnullā€ value of no difference between cancer patients and controls. That’s not what happened: the histogram has mildly heavy tails in both directions. The researchers were hoping to find genes with large values of ‖ μ i ‖—ones that might be a clue to prostate cancer etiology—as suggested by the heavy tails. How encouraged should they be? Not very, according to the James–Stein rule. The 6033 x i values have mean 0.003, which I’ll take to be zero, and empirical variance σ ^ 2 \= 1\.289. (10) The James–Stein estimate ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) is μ ^ i J S \= ( 1 āˆ’ n āˆ’ 2 n āˆ’ 1 1 σ ^ 2 ) x i \= 0\.224 ā‹… x i , (11) so even x i \= 5 yields an estimate barely exceeding 1. Section [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Sec2) suggests a more optimistic analysis. ## 2 Tweedie’s formula The impressive precision of the James–Stein theorem came at a cost in generality. Efforts to extend the theorem, say to Poisson rather than normal observations, or to measures of loss other than total squared error, gave encouraging asymptotic results but not the James–Stein kind of finite sample frequentist dominance. Better progress was possible on the empirical Bayes side of the street. *Tweedie’s formula* (Efron, [2011](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR3 "Efron, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496), 1602–1614. https://doi.org/10.1198/jasa.2011.tm11181 ")) has been particularly useful. We wish to calculate Bayesian estimates μ i B a y e s \= E { μ i ∣ x i } , i \= 1 , … , n , (12) in the normal sampling model ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), starting from a given (possibly non-normal) prior Ļ€ ( μ ), applying to all *n* cases. Let *f*(*x*) be the marginal density f ( x ) \= ∫ R Ļ€ ( μ ) Ļ• ( x āˆ’ μ ) d μ , (13) with Ļ• the standard N ( 0 , 1 ) density and R the range of μ. (It isn’t necessary for Ļ€ ( ā‹… ) to be a continuous distribution but it simplifies notation.) Tweedie’s formula provides an elegant statement for μ i B a y e s, the posterior expectation of μ i given x i, μ i B a y e s \= E { μ i ∣ x i } \= x i \+ l ′ ( x i ) with l ′ ( x i ) \= d d x log ⁔ ( f ( x i ) ) . (14) In the empirical Bayes situation ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), where the prior Ļ€ ( ā‹… ) is unknown, we can use the observed data x 1 , … , x n to estimate the marginal density *f*(*x*), say by f ^ ( x ), giving empirical Bayes estimates μ ^ i \= x i \+ l ^ ′ ( x i ) . (15) The Bayes estimate ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)) can be thought of as the MLE x i plus a Bayesian correction term l ′ ( x i ). When the prior Ļ€ ( μ ) is the N ( 0 , A ) distribution ([5](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ5)), μ i B a y e s equals B x i ([6](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ6)). Simple formulas for μ i B a y e s give out for most other choices of Ļ€ ( μ ) but now, in the machine learning era[Footnote 1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn1) of statistical research, numerical methods provide useful ways forward, as discussed next. The *log polynomial class*[Footnote 2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn2) of marginal densities defines *f*(*x*) by log ⁔ ( f β ( x ) ) \= β 0 \+ β ⊤ c ( x ) . (16) Here c ( x ) \= ( x , x 2 , … , x J ) ⊤ and β \= ( β 1 , … , β J ) ⊤ , (17) with β 0 chosen to make f β ( x ) integrate to 1. The choice J \= 2 gives normal marginals; larger values of *J* allow for marginal non-normality. **Fig. 2** [![Fig. 2](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig2_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/2) Prostate data: Tweedie’s estimate of E { μ ∣ x }, 5 degrees of freedom; dashed curve is James–Stein estimate [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/2) The choice J \= 5 was applied to the prostate cancer data of Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1): Tweedie’s formula ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)) gave μ ^ ( x ) \= E { μ ∣ x }, graphed as the solid curve in Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2). It differs markedly from the James–Stein estimate J \= 2, the dashed line. At x \= 4 for example, the J \= 5 estimate is[Footnote 3](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn3) E { μ ∣ x \= 4 } \= 2\.555 (18) compared to 0.901 for the James–Stein estimate. The estimated curve E { μ ∣ x } is *empirical Bayes* in the same sense as ([8](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ8)): the parameter vector β was selected by maximum likelihood, as discussed next. With J \= 5, the prior was able to adapt to the ā€œfishing expeditionā€ nature of such microarray studies, where we expect most of the genes to be null or close to null, with μ i nearly zero (corresponding here to the flat part of the curve for *x* between āˆ’ 2 and 2) and, hopefully, a small proportion of interestingly large μ is. The sample size n \= 6033 has much to do with Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2). James and Stein ([1961](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR7 "James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob. (Vol. I, pp. 361–379). Berkeley: University of California Press.")) was usually considered in terms of small samples, perhaps n ≤ 20, for which there would be little hope of seeing the detail in Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2). The term ā€œmachine learning eraā€ seems less fanciful when considering the scale of problems statisticians are now asked to deal with, as well as the tools they use to solve them. It looks like it might be hard work computing Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2) but it’s not. The histogram in Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1) has 97 bins, with centerpoints v v \= ( āˆ’ 4\.4 , āˆ’ 4\.3 , … , 5\.1 , 5\.2 ) . (19) Let y j be the count in bin *j*, that is, the number of the 6033 x i values falling into it, with the vector of counts being y y \= ( y 1 , … , y 97 ) . (20) Then the single R command l l ^ \= log ⁔ ( glm ( y y ∼ poly ( v v , 5 ) , poisson ) \$ fit ) (21) provides a close approximation to the MLE of log ⁔ f ( x ) in ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)); numerical differentiation of l l ^ gives Tweedie’s estimate. Section 3.4 of Efron ([2023](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR5 "Efron, B. (2023). Exponential Families in Theory and Practice. Cambridge: Cambridge University Press.")) shows why Poisson regression ([21](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ21)) is appropriate here. The James–Stein theorem depends on the independence assumption in ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), unlikely to be true in the microarray study, but the estimates ([2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ2)) have a certain marginal validity even under dependence. This is clearer from the empirical Bayes point of view. The Tweedie estimate x i \+ l ^ ′ ( x i ) requires only that l ^ ′ ( x ) be close to l ′ ( x ), not that it be estimated from independent x is.[Footnote 4](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn4) ## 3 Shrinkage estimators James and Stein’s paper aroused excited interest in the statistics community when it arrived in 1961. Most of the excitement focused on the strict inadmissibility of the traditional maximum likelihood estimate demonstrated by the James–Stein rule. Other rules dominating the MLE were discovered, for instance the Bayes estimator of Strawderman ([1971](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR10 "Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Annals of Mathematical Statistics, 42(1), 385–388. https://doi.org/10.1214/aoms/1177693528 ")), that was itself admissible while rendering the MLE inadmissible. Big new ideas can take a while to make their true impact felt. The James–Stein rule had an influential side effect on subsequent theory and practice in that it demonstrated, in an inarguable way, the virtues of *shrinkage estimation*: given an ensemble of problems, individual estimates are shrunk toward a central point; that is, a deliberate bias is introduced, pulling estimates away from their MLEs for the sake of better group performances. Admissibility and inadmissibility aren’t much in the air these days, while shrinkage estimation has gone on to play a major role in modern practice. A spectacular success story is the lasso (Tibshirani, [1996](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR11 "Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.")). Lasso shrinkage is extreme, pulling some (often most) of the coefficient estimates all the way back to zero. Bayes and empirical Bayes rules tend to be strong shrinkers. Tweedie’s estimate in Fig. [2](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig2) (J \= 5) shrinks the estimate of E { μ ∣ x \= 4 } from its MLE value 4 down to 2.555. For μ between āˆ’ 1 and 1, the shrinkage is almost all the way to zero. The reader may have been surprised to see that neither Tweedie’s formula ([14](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ14)) for E { μ i ∣ x i } nor its empirical version ([15](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ15)) require estimation of the prior Ļ€ ( μ ). This is a special property of the posterior expectation E { μ i ∣ x i } and isn’t available for say Pr { μ i ≄ 2 ∣ x i }, or most other Bayesian targets. ā€œBayesian deconvolutionā€ (Efron, [2016](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR4 "Efron, B. (2016). Empirical Bayes deconvolution estimates. Biometrika, 103(1), 1–20. https://doi.org/10.1093/biomet/asv068 ")) uses low-dimensional parametric modeling of Ļ€ ( μ ) for general empirical Bayes computations. It was applied to finding a prior density Ļ€ ( μ ) that would give the distribution of *x* seen in Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1), assuming the normal sampling model ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)). The deconvolution model for Ļ€ ( μ ) used a delta function at μ \= 0 (for the ā€œnullā€ genes) and a natural spline function with four degrees of freedom for the non-null cases. **Fig. 3** [![Fig. 3](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig3_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/3) Empirical Bayes conditional density of μ given μ not zero; Pr { μ \= 0 } equals 0.825 [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/3) The estimated prior[Footnote 5](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn5)Ļ€ ^ ( μ ) is shown in Fig. [3](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig3); it put probability 0.825 on μ \= 0, while the conditional distribution given μ ≠ 0 was a moderately heavy-tailed version of N ( 0 , 1\.33 2 ). Based on Ļ€ ^ ( μ ) we can form estimates of *any* Bayesian target, for instance Pr ^ { μ i ≄ 2 ∣ x i \= 4 } \= 0\.80. Figure [3](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig3) is a direct descendent of the James–Stein rule, now 60-plus years on. A less-direct descendent, but still on the family tree, arrived in 1995. The *false discovery rate* paper by Benjamini and Hochberg concerned simultaneous hypothesis testing. Looking at Fig. [1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig1), which of the n \= 6033 genes can confidently be labeled as non-null, that is as having μ i ≠ 0? Suppose for convenience that the x is are ordered from smallest to largest. The right-sided significance level for testing μ i \= 0 is S 0 ( x i ) \= 1 āˆ’ Φ ( x i ) , (22) where Φ is the standard normal cumulative distribution function. Of the 6033 genes, 401 had S i ≤ 0\.05, the usual rejection level for individual testing, but even if actually *all* of the genes were null we would expect 302 such rejections, so individual testing can’t be right. Benjamini and Hochberg proposed a novel simultaneous testing rule that safely controls the number of ā€œfalse discoveriesā€ — genes falsely labeled ā€non-nullā€ — while not being discouragingly strict. (My summary here won’t give the BH rule its full due; see Chapter 4 of Efron ([2010](https://link.springer.com/article/10.1007/s42081-023-00209-y#ref-CR2 "Efron, B. (2010). Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction (Vol. 1). Cambridge: Cambridge University Press.")) for a more complete description.) Let S ^ ( x ) be the observed proportion of x is exceeding value *x*, and define F d r ^ ( x ) \= Ļ€ 0 S 0 ( x ) / S ^ ( x ) , (23) where Ļ€ 0 is the proportion of null genes among all *n*.[Footnote 6](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fn6) For a fixed control level α, such as α \= 0\.1, the BH rule says to reject the null hypothesis μ i \= 0 for those genes having F d r ^ ( x i ) ≤ α . (24) The Benjamini–Hochberg theorem states that under independence assumptions like ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)), the expected proportion of false discoveries by rule ([24](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ24)) is α. **Fig. 4** [![Fig. 4](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42081-023-00209-y/MediaObjects/42081_2023_209_Fig4_HTML.png)](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/4) Prostate data: left Fdr and right Fdr; dashes show 60 genes with F d r \< 0\.1 [Full size image](https://link.springer.com/article/10.1007/s42081-023-00209-y/figures/4) Figure [4](https://link.springer.com/article/10.1007/s42081-023-00209-y#Fig4) shows F d r ^ for the prostate cancer data and also for the left-sided Fdr estimate, where significance is defined by S 0 ( x i ) \= Φ ( x i ) rather than ([22](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ22)). I applied the BH rule with α \= 0\.1 which labeled 60 genes as non-null, 32 on the left and 28 on the right. The BH theorem says that we can expect 6 of the 60 to actually be null. The fdr story has evolved very much along the lines of its James–Stein predecessor. Intense initial interest focused on the exact frequentist control of false discovery rates. The Bayes and empirical Bayes implications came later: as at ([5](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ5)), we assume that each x i is a realization of a random variable *x* given by μ ∼ Ļ€ ( μ ) and x ∣ μ ∼ p ( x ∣ μ ) , (25) where p ( x ∣ μ ) is a known probability kernel which I’ll take here to be the normal sampling model ([1](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ1)). Then if *S*(*x*) is 1 minus the cdf of the marginal density ([13](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ13)), Bayes rule gives Pr { μ \= 0 ∣ x } \= Ļ€ 0 S 0 ( x ) / S ( x ) . (26) Comparing ([26](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ26)) with ([23](https://link.springer.com/article/10.1007/s42081-023-00209-y#Equ23)) says that the BH rule amounts to labeling case *i* as non-null if its obvious empirical Bayes estimate of nullness is less than α. This is less precise than the frequentist control theorem but, as with the James–Stein estimator, is more robust in not demanding independence among the x is. The family resemblance between JS and BH is through shrinkage: in the BH case the shrinkage of significance levels. For instance, x i \= 3 has individual significance level 0.001 against nullness, whereas F d r ^ \= 0\.164 for the prostate data, i.e, still with about a 1/6 chance of gene *i* being null. So what does machine learning have to do with the James–Stein estimator? Nothing to its birth but, as the articles in this volume show, a great deal to its downstream effects on statistical theory and practice. Charles Stein, who was a good applied statistician when he put his mind to it, might have enjoyed these developments, but maybe not; his heart was always with the mathematics. ## References - Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society Series B,* *57*(1), 289–300. [Article](https://doi.org/10.1111%2Fj.2517-6161.1995.tb02031.x) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=1325392) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Controlling%20the%20false%20discovery%20rate%3A%20A%20practical%20and%20powerful%20approach%20to%20multiple%20testing&journal=Journal%20of%20the%20Royal%20Statistical%20Society%20Series%20B&doi=10.1111%2Fj.2517-6161.1995.tb02031.x&volume=57&issue=1&pages=289-300&publication_year=1995&author=Benjamini%2CY&author=Hochberg%2CY) - Efron, B. (2010). *Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction* (Vol. 1). Cambridge: Cambridge University Press. [Book](https://doi.org/10.1017%2FCBO9780511761362) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Large-scale%20inference%3A%20Empirical%20Bayes%20methods%20for%20estimation%2C%20testing%2C%20and%20prediction&doi=10.1017%2FCBO9780511761362&publication_year=2010&author=Efron%2CB) - Efron, B. (2011). Tweedie’s formula and selection bias. *Journal of the American Statistical Association,* *106*(496), 1602–1614. <https://doi.org/10.1198/jasa.2011.tm11181> [Article](https://doi.org/10.1198%2Fjasa.2011.tm11181) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=2896860) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Tweedie%E2%80%99s%20formula%20and%20selection%20bias&journal=Journal%20of%20the%20American%20Statistical%20Association&doi=10.1198%2Fjasa.2011.tm11181&volume=106&issue=496&pages=1602-1614&publication_year=2011&author=Efron%2CB) - Efron, B. (2016). Empirical Bayes deconvolution estimates. *Biometrika,* *103*(1), 1–20. <https://doi.org/10.1093/biomet/asv068> [Article](https://doi.org/10.1093%2Fbiomet%2Fasv068) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=3465818) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Empirical%20Bayes%20deconvolution%20estimates&journal=Biometrika&doi=10.1093%2Fbiomet%2Fasv068&volume=103&issue=1&pages=1-20&publication_year=2016&author=Efron%2CB) - Efron, B. (2023). *Exponential Families in Theory and Practice*. Cambridge: Cambridge University Press. [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Exponential%20Families%20in%20Theory%20and%20Practice&publication_year=2023&author=Efron%2CB) - Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. *Journal of the American Statistical Association,* *68*, 117–130. [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=388597) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Stein%E2%80%99s%20estimation%20rule%20and%20its%20competitors%E2%80%94An%20empirical%20Bayes%20approach&journal=Journal%20of%20the%20American%20Statistical%20Association&volume=68&pages=117-130&publication_year=1973&author=Efron%2CB&author=Morris%2CC) - James, W., & Stein, C. (1961). Estimation with quadratic loss. In *Proc. 4th Berkeley Sympos. Math. Statist. and Prob.* (Vol. I, pp. 361–379). Berkeley: University of California Press. - Narasimhan, B., & Efron, B. (2020). deconvolveR: A G-Modeling Program for Deconvolution and Empirical Bayes Estimation. *Journal of Statistical Software,* *94*(11), 1–20. <https://doi.org/10.18637/jss.v094.i11> [Article](https://doi.org/10.18637%2Fjss.v094.i11) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=deconvolveR%3A%20A%20G-Modeling%20Program%20for%20Deconvolution%20and%20Empirical%20Bayes%20Estimation&journal=Journal%20of%20Statistical%20Software&doi=10.18637%2Fjss.v094.i11&volume=94&issue=11&pages=1-20&publication_year=2020&author=Narasimhan%2CB&author=Efron%2CB) - Robbins, H. (1956). An empirical Bayes approach to statistics. In *Proc. 3rd Berkeley Sympos. Math. Statist. and Prob.* (Vol. I, pp. 157–163). Berkeley: University of California Press. - Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. *Annals of Mathematical Statistics,* *42*(1), 385–388. <https://doi.org/10.1214/aoms/1177693528> [Article](https://doi.org/10.1214%2Faoms%2F1177693528) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=397939) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Proper%20Bayes%20minimax%20estimators%20of%20the%20multivariate%20normal%20mean&journal=Annals%20of%20Mathematical%20Statistics&doi=10.1214%2Faoms%2F1177693528&volume=42&issue=1&pages=385-388&publication_year=1971&author=Strawderman%2CWE) - Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society Series B,* *58*(1), 267–288. [Article](https://doi.org/10.1111%2Fj.2517-6161.1996.tb02080.x) [MathSciNet](http://www.ams.org/mathscinet-getitem?mr=1379242) [Google Scholar](http://scholar.google.com/scholar_lookup?&title=Regression%20shrinkage%20and%20selection%20via%20the%20lasso&journal=Journal%20of%20the%20Royal%20Statistical%20Society%20Series%20B&doi=10.1111%2Fj.2517-6161.1996.tb02080.x&volume=58&issue=1&pages=267-288&publication_year=1996&author=Tibshirani%2CR) [Download references](https://citation-needed.springer.com/v2/references/10.1007/s42081-023-00209-y?format=refman&flavour=references)
Shard129 (laksa)
Root Hash17645177711233004329
Unparsed URLcom,springer!link,/article/10.1007/s42081-023-00209-y s443