ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.4 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.britannica.com/science/probability-theory/The-central-limit-theorem |
| Last Crawled | 2026-04-03 23:36:53 (10 days ago) |
| First Indexed | 2018-02-19 19:49:50 (8 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Probability theory - Central Limit, Statistics, Mathematics | Britannica |
| Meta Description | Probability theory - Central Limit, Statistics, Mathematics: The desired useful approximation is given by the central limit theorem, which in the special case of the binomial distribution was first discovered by Abraham de Moivre about 1730. Let X1,…, Xn be independent random variables having a common distribution with expectation μ and variance σ2. The law of large numbers implies that the distribution of the random variable X̄n = n−1(X1 +⋯+ Xn) is essentially just the degenerate distribution of the constant μ, because E(X̄n) = μ and Var(X̄n) = σ2/n → 0 as n → ∞. The standardized random variable (X̄n − μ)/(σ/n) has mean 0 and variance |
| Meta Canonical | null |
| Boilerpipe Text | The
strong law of large numbers
The mathematical relation between these two experiments was recognized in 1909 by the French mathematician
Émile Borel
, who used the then new ideas of
measure theory
to give a precise
mathematical model
and to formulate what is now called the strong
law of large numbers
for fair
coin
tossing. His results can be described as follows. Let
e
denote a number chosen at random from [0, 1], and let
X
k
(
e
) be the
k
th coordinate in the expansion of
e
to the base 2. Then
X
1
,
X
2
,… are an
infinite
sequence of independent random variables taking the values 0 or 1 with probability 1/2 each. Moreover, the subset of [0, 1] consisting of those
e
for which the sequence
n
−1
[
X
1
(
e
) +⋯+
X
n
(
e
)] tends to 1/2 as
n
→ ∞ has probability 1. Symbolically:
The weak law of large numbers given in
equation
(11) says that for any ε > 0, for each sufficiently large value of
n
, there is only a small probability of observing a deviation of
X
n
=
n
−1
(
X
1
+⋯+
X
n
) from 1/2 which is larger than ε; nevertheless, it leaves open the possibility that sooner or later this rare event will occur if one continues to toss the coin and observe the sequence for a sufficiently long time. The strong law, however,
asserts
that the occurrence of even one value of
X
k
for
k
≥
n
that differs from 1/2 by more than ε is an event of arbitrarily small probability provided
n
is large enough. The
proof
of equation (
14
) and various subsequent generalizations is much more difficult than that of the weak law of large numbers. The adjectives “strong” and “weak” refer to the fact that the truth of a result such as equation (
14
) implies the truth of the corresponding version of equation (
11
), but not conversely.
Measure theory
During the two decades following 1909, measure theory was used in many concrete problems of probability theory, notably in the American mathematician
Norbert Wiener
’s
treatment
(1923) of the mathematical theory of
Brownian motion
, but the notion that all problems of probability theory could be formulated in terms of measure is customarily attributed to the Soviet mathematician
Andrey Nikolayevich Kolmogorov
in 1933.
The fundamental quantities of the measure theoretic foundation of probability theory are the sample space
S
, which as before is just the
set
of all possible outcomes of an experiment, and a distinguished class
M
of subsets of
S
, called events. Unlike the case of finite
S
, in general not every subset of
S
is an event. The class
M
must have certain properties described below. Each event is assigned a probability, which means mathematically that a probability is a
function
P
mapping
M
into the real numbers that satisfies certain conditions
derived
from one’s physical ideas about probability.
The properties of
M
are as follows: (i)
S
∊
M
; (ii) if
A
∊
M
, then
A
c
∊
M
; (iii) if
A
1
,
A
2
,… ∊
M
, then
A
1
∪
A
2
∪ ⋯ ∊
M
. Recalling that
M
is the domain of definition of the probability
P
, one can interpret (i) as saying that
P
(
S
) is defined, (ii) as saying that, if the probability of
A
is defined, then the probability of “not
A
” is also defined, and (iii) as saying that, if one can speak of the probability of each of a sequence of events
A
n
individually, then one can speak of the probability that at least one of the
A
n
occurs. A class of subsets of any set that has properties (i)–(iii) is called a
σ-field. From these properties one can prove others. For example, it follows at once from (i) and (ii) that Ø (the empty set) belongs to the class
M
. Since the intersection of any class of sets can be expressed as the
complement
of the union of the complements of those sets (
DeMorgan’s law
), it follows from (ii) and (iii) that, if
A
1
,
A
2
,… ∊
M
, then
A
1
∩
A
2
∩ ⋯ ∊
M
.
Given a set
S
and a σ-field
M
of subsets of
S
, a
probability measure is a function
P
that assigns to each set
A
∊
M
a nonnegative
real number
and that has the following two properties: (
a
)
P
(
S
) = 1 and (
b
) if
A
1
,
A
2
,… ∊
M
and
A
i
∩
A
j
= Ø for all
i
≠
j
, then
P
(
A
1
∪
A
2
∪ ⋯) =
P
(
A
1
) +
P
(
A
2
) +⋯. Property (
b
) is called the
axiom
of
countable additivity. It is clearly motivated by equation (1), which
suffices
for finite sample spaces because there are only finitely many events. In infinite sample spaces it implies, but is not implied by, equation (1). There is, however, nothing in one’s intuitive notion of probability that requires the acceptance of this property. Indeed, a few mathematicians have developed probability theory with only the weaker axiom of
finite additivity, but the absence of interesting models that fail to satisfy the axiom of countable additivity has led to its virtually universal acceptance.
To get a better feeling for this distinction, consider the experiment of tossing a
biased
coin having probability
p
of heads and
q
= 1 −
p
of tails until heads first appears. To be consistent with the idea that the tosses are independent, the probability that exactly
n
tosses are required equals
q
n
− 1
p
, since the first
n
− 1 tosses must be tails, and they must be followed by a head. One can imagine that this experiment never terminates—i.e., that the coin continues to turn up tails forever. By the axiom of countable additivity, however, the probability that heads occurs at some finite value of
n
equals
p
+
q
p
+
q
2
p
+ ⋯ =
p
/(1 −
q
) = 1, by the formula for the sum of an
infinite
geometric series
. Hence, the probability that the experiment goes on forever equals 0. Similarly, one can compute the probability that the number of tosses is odd, as
p
+
q
2
p
+
q
4
p
+ ⋯ =
p
/(1 −
q
2
) = 1/(1 +
q
). On the other hand, if only finite additivity were required, it would be possible to define the following admittedly bizarre probability. The sample space
S
is the set of all natural numbers, and the σ-field
M
is the class of all subsets of
S
. If an event
A
contains finitely many elements,
P
(
A
) = 0, and, if the complement of
A
contains finitely many elements,
P
(
A
) = 1. As a consequence of the deceptively
innocuous
axiom of choice
(which says that, given any collection
C
of nonempty sets, there exists a rule for selecting a unique point from each set in
C
), one can show that many finitely additive probabilities consistent with these requirements exist. However, one cannot be certain what the probability of getting an odd number is, because that set is neither finite nor its complement finite, nor can it be expressed as a finite disjoint union of sets whose probability is already defined.
It is a basic problem, and by no means a simple one, to show that the
intuitive
notion of choosing a number at random from [0, 1], as described above, is consistent with the preceding definitions. Since the probability of an interval is to be its length, the class of events
M
must contain all intervals; but in order to be a σ-field it must contain other sets, many of which are difficult to describe in an elementary way. One example is the event in equation (14), which must belong to
M
in order that one can talk about its probability. Also, although it seems clear that the length of a finite disjoint union of intervals is just the sum of their lengths, a rather subtle argument is required to show that length has the property of countable additivity. A basic
theorem
says that there is a suitable σ-field containing all the intervals and a unique probability defined on this σ-field for which the probability of an interval is its length. The σ-field is called the class of
Lebesgue-measurable sets, and the probability is called the
Lebesgue measure
, after the French mathematician and principal architect of measure theory,
Henri-Léon Lebesgue
.
In general, a σ-field need not be all subsets of the sample space
S
. The question of whether all subsets of [0, 1] are Lebesgue-measurable turns out to be a difficult problem that is intimately connected with the
foundations of mathematics
and in particular with the axiom of choice.
Probability density functions
For random variables having a
continuum
of possible values, the function that plays the same role as the probability distribution of a discrete
random variable
is called a probability density function. If the random variable is denoted by
X
, its probability density function
f
has the property that
for every interval (
a
,
b
]; i.e., the probability that
X
falls in (
a
,
b
] is the area under the
graph
of
f
between
a
and
b
(
see
the
figure
). For example, if
X
denotes the outcome of selecting a number at random from the interval [
r
,
s
], the probability density function of
X
is given by
f
(
x
) = 1/(
s
−
r
) for
r
<
x
<
s
and
f
(
x
) = 0 for
x
<
r
or
x
>
s
. The function
F
(
x
) defined by
F
(
x
) =
P
{
X
≤
x
} is called the distribution function, or
cumulative
distribution function
, of
X
. If
X
has a probability density function
f
(
x
), the relation between
f
and
F
is
F
′(
x
) =
f
(
x
) or equivalently
The distribution function
F
of a discrete random variable should not be confused with its probability distribution
f
. In this case the relation between
F
and
f
is
If a random variable
X
has a probability density function
f
(
x
), its “expectation” can be defined by
provided that this
integral
is convergent. It turns out to be simpler, however, not only to use Lebesgue’s theory of measure to define probabilities but also to use his theory of
integration
to define
expectation
. Accordingly, for any random variable
X
,
E
(
X
) is defined to be the
Lebesgue integral
of
X
with respect to the probability measure
P
, provided that the
integral
exists. In this way it is possible to provide a unified theory in which all random variables, both discrete and continuous, can be treated simultaneously. In order to follow this path, it is necessary to restrict the class of those functions
X
defined on
S
that are to be called random variables, just as it was necessary to restrict the class of subsets of
S
that are called events. The appropriate restriction is that a random variable must be a measurable function. The definition is taken over directly from the Lebesgue theory of
integration
and will not be discussed here. It can be shown that, whenever
X
has a probability density function, its expectation (provided it exists) is given by equation (
15
), which remains a useful formula for calculating
E
(
X
).
Some important probability density functions are the following:
The cumulative distribution function of the
normal distribution
with mean 0 and
variance
1 has already appeared as the function
G
defined following
equation (12
). The law of large numbers and the
central limit theorem
continue to hold for random variables on
infinite
sample spaces. A useful interpretation of the central limit theorem stated formally in equation (
12
) is as follows: The probability that the average (or sum) of a large number of independent, identically distributed random variables with finite variance falls in an interval (
c
1
,
c
2
] equals approximately the area between
c
1
and
c
2
underneath the graph of a normal density function chosen to have the same expectation and variance as the given average (or sum). The
figure
illustrates the normal approximation to the
binomial distribution
with
n
= 10 and
p
= 1/2.
The
exponential distribution
arises naturally in the study of the
Poisson distribution
introduced in equation (
13
). If
T
k
denotes the time interval between the emission of the
k
− 1st and
k
th particle, then
T
1
,
T
2
,… are independent random variables having an exponential distribution with
parameter
μ. This is obvious for
T
1
from the observation that {
T
1
>
t
} = {
N
(
t
) = 0}. Hence,
P
{
T
1
≤
t
} = 1 −
P
{
N
(
t
) = 0} = 1 − exp(−μ
t
), and by
differentiation
one obtains the exponential density function.
The
Cauchy distribution
does not have a mean value or a variance, because the integral (
15
) does not converge. As a result, it has a number of unusual properties. For example, if
X
1
,
X
2
,…,
X
n
are independent random variables having a Cauchy distribution, the average (
X
1
+⋯+
X
n
)/
n
also has a Cauchy distribution. The variability of the average is exactly the same as that of a single observation. Another random variable that does not have an expectation is the waiting time until the number of heads first equals the number of tails in tossing a fair coin. |
| Markdown | [](https://www.britannica.com/)
[](https://www.britannica.com/)
[SUBSCRIBE](https://premium.britannica.com/premium-membership/?utm_source=premium&utm_medium=global-nav&utm_campaign=blue-evergreen)
[SUBSCRIBE](https://premium.britannica.com/premium-membership/?utm_source=premium&utm_medium=global-nav-mobile&utm_campaign=blue-evergreen)
Login
https://premium.britannica.com/premium-membership/?utm\_source=premium\&utm\_medium=nav-login-box\&utm\_campaign=evergreen
[SUBSCRIBE](https://premium.britannica.com/premium-membership/?utm_source=premium&utm_medium=hamburger-menu&utm_campaign=blue)
[Ask the Chatbot](https://www.britannica.com/chatbot)
[Games & Quizzes](https://www.britannica.com/quiz/browse) [History & Society](https://www.britannica.com/History-Society) [Science & Tech](https://www.britannica.com/Science-Tech) [Biographies](https://www.britannica.com/Biographies) [Animals & Nature](https://www.britannica.com/Animals-Nature) [Geography & Travel](https://www.britannica.com/Geography-Travel) [Arts & Culture](https://www.britannica.com/Arts-Culture) [ProCon](https://www.britannica.com/procon) [Money](https://www.britannica.com/money) [Videos](https://www.britannica.com/videos)
[probability theory](https://www.britannica.com/science/probability-theory)
- [Introduction](https://www.britannica.com/science/probability-theory)
- [Experiments, sample space, events, and equally likely probabilities](https://www.britannica.com/science/probability-theory#ref32762)
- [Applications of simple probability experiments](https://www.britannica.com/science/probability-theory#ref32763)
- [The principle of additivity](https://www.britannica.com/science/probability-theory/The-principle-of-additivity)
- [Multinomial probability](https://www.britannica.com/science/probability-theory/The-principle-of-additivity#ref32765)
- [The birthday problem](https://www.britannica.com/science/probability-theory/The-birthday-problem)
- [Conditional probability](https://www.britannica.com/science/probability-theory/The-birthday-problem#ref32767)
- [Applications of conditional probability](https://www.britannica.com/science/probability-theory/Applications-of-conditional-probability)
- [Independence](https://www.britannica.com/science/probability-theory/Applications-of-conditional-probability#ref32769)
- [Bayes’s theorem](https://www.britannica.com/science/probability-theory/Applications-of-conditional-probability#ref32770)
- [Random variables, distributions, expectation, and variance](https://www.britannica.com/science/probability-theory/Applications-of-conditional-probability#ref32771)
- [Random variables](https://www.britannica.com/science/probability-theory/Applications-of-conditional-probability#ref32772)
- [Probability distribution](https://www.britannica.com/science/probability-theory/Probability-distribution)
- [Expected value](https://www.britannica.com/science/probability-theory/Probability-distribution#ref32774)
- [Variance](https://www.britannica.com/science/probability-theory/Probability-distribution#ref32775)
- [An alternative interpretation of probability](https://www.britannica.com/science/probability-theory/An-alternative-interpretation-of-probability)
- [The law of large numbers, the central limit theorem, and the Poisson approximation](https://www.britannica.com/science/probability-theory/An-alternative-interpretation-of-probability#ref32777)
- [The law of large numbers](https://www.britannica.com/science/probability-theory/An-alternative-interpretation-of-probability#ref32778)
- [The central limit theorem](https://www.britannica.com/science/probability-theory/The-central-limit-theorem)
- [The Poisson approximation](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref32780)
- [Infinite sample spaces and axiomatic probability](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref32781)
- [Infinite sample spaces](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref32782)
- [The strong law of large numbers](https://www.britannica.com/science/probability-theory/The-strong-law-of-large-numbers)
- [Measure theory](https://www.britannica.com/science/probability-theory/The-strong-law-of-large-numbers#ref32784)
- [Probability density functions](https://www.britannica.com/science/probability-theory/The-strong-law-of-large-numbers#ref32785)
- [Conditional expectation and least squares prediction](https://www.britannica.com/science/probability-theory/Conditional-expectation-and-least-squares-prediction)
- [The Poisson process and the Brownian motion process](https://www.britannica.com/science/probability-theory/Conditional-expectation-and-least-squares-prediction#ref32787)
- [The Poisson process](https://www.britannica.com/science/probability-theory/Conditional-expectation-and-least-squares-prediction#ref32788)
- [Brownian motion process](https://www.britannica.com/science/probability-theory/Brownian-motion-process)
- [Stochastic processes](https://www.britannica.com/science/probability-theory/Brownian-motion-process#ref32790)
- [Stationary processes](https://www.britannica.com/science/probability-theory/Brownian-motion-process#ref32791)
- [Markovian processes](https://www.britannica.com/science/probability-theory/Markovian-processes)
- [The Ehrenfest model of diffusion](https://www.britannica.com/science/probability-theory/Markovian-processes#ref32793)
- [The symmetric random walk](https://www.britannica.com/science/probability-theory/Markovian-processes#ref32794)
- [Queuing models](https://www.britannica.com/science/probability-theory/Markovian-processes#ref32795)
- [Insurance risk theory](https://www.britannica.com/science/probability-theory/Markovian-processes#ref32796)
- [Martingale theory](https://www.britannica.com/science/probability-theory/Markovian-processes#ref32797)
[References & Edit History](https://www.britannica.com/science/probability-theory/additional-info) [Related Topics](https://www.britannica.com/facts/probability-theory)
[Images](https://www.britannica.com/science/probability-theory/images-videos)
[](https://cdn.britannica.com/20/62920-004-EB46127C/Sample-space-pair-dice.jpg) [](https://cdn.britannica.com/91/75091-050-61F3EB47/test-results-accuracy-theorem-Bayes-HIV-positives.jpg) [](https://cdn.britannica.com/30/3630-004-89103408/Probability-density-function.jpg) [](https://cdn.britannica.com/31/3631-004-96A99F98/approximation-distribution.jpg)
At a Glance
[](https://www.britannica.com/summary/probability-theory)
[probability theory summary](https://www.britannica.com/summary/probability-theory)
Quizzes
[](https://www.britannica.com/quiz/Numbers-and-mathematics)
[Numbers and Mathematics](https://www.britannica.com/quiz/Numbers-and-mathematics)
[](https://www.britannica.com/quiz/define-it-math-terms)
[Define It: Math Terms](https://www.britannica.com/quiz/define-it-math-terms)
Related Questions
- [What was Carl Friedrich Gauss’s childhood like?](https://www.britannica.com/question/What-was-Carl-Friedrich-Gausss-childhood-like)
- [What awards did Carl Friedrich Gauss win?](https://www.britannica.com/question/What-awards-did-Carl-Friedrich-Gauss-win)
- [How was Carl Friedrich Gauss influential?](https://www.britannica.com/question/How-was-Carl-Friedrich-Gauss-influential)

Contents
Ask Anything
[Science](https://www.britannica.com/browse/Science) [Mathematics](https://www.britannica.com/browse/Mathematics)
CITE
Share
Feedback
External Websites
## The [central limit theorem](https://www.britannica.com/science/central-limit-theorem)
in [probability theory](https://www.britannica.com/science/probability-theory) in
# [The law of large numbers, the central limit theorem, and the Poisson approximation](https://www.britannica.com/science/probability-theory/An-alternative-interpretation-of-probability#ref32777)
Homework Help
Written by
[David O. Siegmund Professor of Statistics, Stanford University, California. Author of *Sequential Analysis; Tests and Confidence Intervals.*](https://www.britannica.com/contributor/David-O-Siegmund/3760)
David O. Siegmund
Fact-checked by
[Britannica Editors Encyclopaedia Britannica's editors oversee subject areas in which they have extensive knowledge, whether from years of experience gained by working on that content or via study for an advanced degree....](https://www.britannica.com/editor/The-Editors-of-Encyclopaedia-Britannica/4419)
Britannica Editors
Last updated
Mar. 1, 2026
•[History](https://www.britannica.com/science/probability-theory/additional-info#history)
 Britannica AI
Ask Anything
Table of Contents
Table of Contents
Ask Anything
The desired useful approximation is given by the [central limit theorem](https://www.britannica.com/science/central-limit-theorem), which in the special case of the [binomial distribution](https://www.britannica.com/science/binomial-distribution) was first discovered by [Abraham de Moivre](https://www.britannica.com/biography/Abraham-de-Moivre) about 1730. Let *X*1,…, *X**n* be independent random variables having a common distribution with [expectation](https://www.britannica.com/topic/expected-value) μ and [variance](https://www.britannica.com/topic/variance) σ2. The [law of large numbers](https://www.britannica.com/science/law-of-large-numbers) implies that the distribution of the random variable X̄*n* = *n*−1(*X*1 +⋯+ *X**n*) is essentially just the [degenerate](https://www.britannica.com/dictionary/degenerate) distribution of the [constant](https://www.britannica.com/topic/constant) μ, because *E*(X̄*n*) = μ and Var(X̄*n*) = σ2/*n* → 0 as *n* → ∞. The standardized random variable (X̄*n* − μ)/(σ/Square root of√*n*) has mean 0 and variance 1. The central limit theorem gives the remarkable result that, for any real numbers *a* and *b*, as *n* → ∞,where
Thus, if *n* is large, the standardized average has a distribution that is approximately the same, regardless of the original distribution of the *X*s. The [equation](https://www.britannica.com/science/equation) also illustrates clearly the square root law: the accuracy of X̄*n* as an estimator of μ is inversely proportional to the [square root](https://www.britannica.com/science/square-root) of the sample size *n*.
Use of equation (12) to evaluate approximately the probability on the left-hand side of equation (11), by setting *b* = −*a* = εSquare root of√*n*/σ, yields the approximation *G*(εSquare root of√*n*/σ) − *G*(−εSquare root of√*n*/σ). Since *G*(2) − *G*(−2) is approximately 0.95, *n* must be about 4σ2/ε2 in order that the difference \|X̄*n* − μ\| will be less than ε with probability 0.95. For the special case of the binomial distribution, one can again use the [inequality](https://www.britannica.com/science/inequality) σ2 = *p*(1 − *p*) ≤ 1/4 and now conclude that about 1,100 balls must be drawn from the urn in order that the [empirical](https://www.merriam-webster.com/dictionary/empirical) proportion of red balls drawn will be within 0.03 of the true proportion of red balls with probability about 0.95. The frequently appearing statement in newspapers in the [United States](https://www.britannica.com/place/United-States) that a given [opinion poll](https://www.britannica.com/topic/public-opinion-poll) involving a sample of about 1,100 persons has a sampling [error](https://www.britannica.com/science/error-mathematics) of no more than 3 percent is based on this kind of calculation. The qualification that this 3 percent [sampling error](https://www.britannica.com/science/sampling-error) may be exceeded in about 5 percent of the cases is often omitted. (The actual situation in opinion polls or sample surveys generally is more complicated. The sample is drawn without replacement, so, strictly speaking, the binomial distribution is not applicable. However, the “urn”—i.e., the population from which the sample is drawn—is extremely large, in many cases infinitely large for practical purposes. Hence, the [composition](https://www.merriam-webster.com/dictionary/composition) of the urn is effectively the same throughout the sampling process, and the binomial distribution applies as an approximation. Also, the population is usually stratified into relatively [homogeneous](https://www.merriam-webster.com/dictionary/homogeneous) groups, and the survey is designed to take advantage of this stratification. To pursue the [analogy](https://www.merriam-webster.com/dictionary/analogy) with urn models, one can imagine the balls to be in several urns in varying proportions, and one must decide how to [allocate](https://www.merriam-webster.com/dictionary/allocate) the *n* draws from the various urns so as to estimate efficiently the overall proportion of red balls.)
Considerable effort has been put into generalizing both the law of large numbers and the central limit theorem, so that it is unnecessary for the variables to be either independent or identically distributed.
The [law of large numbers](https://www.britannica.com/science/law-of-large-numbers) discussed above is often called the “weak law of large numbers,” to distinguish it from the “strong law,” a conceptually different result discussed below in the section on [infinite](https://www.merriam-webster.com/dictionary/infinite) probability spaces.
## The Poisson approximation
The weak law of large numbers and the central limit theorem give information about the distribution of the proportion of successes in a large number of independent trials when the probability of success on each trial is *p*. In the mathematical formulation of these results, it is assumed that *p* is an arbitrary, but fixed, number in the interval (0, 1) and *n* → ∞, so that the expected number of successes in the *n* trials, *n**p*, also increases toward +∞ with *n*. A rather different kind of approximation is of interest when *n* is large and the probability *p* of success on a single trial is inversely proportional to *n*, so that *n**p* = μ is a fixed number even though *n* → ∞. An example is the following simple model of [radioactive decay](https://www.britannica.com/science/radioactivity) of a source consisting of a large number of atoms, which independently of one another [decay](https://www.britannica.com/dictionary/decay) by spontaneously emitting a particle. The time scale is divided into a large number of very small intervals of equal lengths, and in each interval, independently of what happens in the other intervals, the source emits one or no particle with probability *p* or *q* = 1 − *p* respectively. It is assumed that the intervals are so small that the probability of two or more particles being emitted in a single interval is negligible. One now imagines that the size of the intervals shrinks to 0, so that the number of trials up to any fixed time *t* becomes infinite. It is reasonable to assume that the probability of emission during a short time interval is proportional to the length of the interval. The result is a different kind of approximation to the binomial distribution, called the [Poisson distribution](https://www.britannica.com/topic/Poisson-distribution) (after the French mathematician [Siméon-Denis Poisson](https://www.britannica.com/biography/Simeon-Denis-Poisson)) or the law of small numbers.
Assume, then, that a [biased](https://www.merriam-webster.com/dictionary/biased) [coin](https://www.britannica.com/money/coin) having probability *p* = μδ of heads is tossed once in each time interval of length δ, so that by time *t* the total number of tosses is an integer *n* approximately equal to *t*/δ. Introducing these values into the binomial equation and passing to the limit as δ → 0 gives as the distribution for *N*(*t*) the number of radioactive particles emitted in time *t*:
The right-hand side of this equation is the [Poisson distribution](https://www.britannica.com/topic/Poisson-distribution). Its mean and variance are both equal to μ*t*. Although the Poisson approximation is not comparable to the central limit theorem in importance, it nevertheless provides one of the basic building blocks in the theory of [stochastic processes](https://www.britannica.com/science/stochastic-process).
## Infinite sample spaces and axiomatic probability
## Infinite sample spaces
The experiments described in the preceding discussion involve finite sample spaces for the most part, although the central limit theorem and the Poisson approximation involve limiting operations and hence lead to [integrals](https://www.merriam-webster.com/dictionary/integrals) and [infinite series](https://www.britannica.com/science/infinite-series). In a finite sample space, calculation of the probability of an event *A* is conceptually straightforward because the principle of additivity tells one to calculate the probability of a complicated event as the sum of the probabilities of the individual experimental outcomes whose union defines the event.
Experiments having a [continuum](https://www.merriam-webster.com/dictionary/continuum) of possible outcomes—for example, that of selecting a number at random from the interval \[*r*, *s*\]—involve subtle mathematical difficulties that were not satisfactorily resolved until the 20th century. If one chooses a number at random from \[*r*, *s*\], the probability that the number falls in any interval \[*x*, *y*\] must be proportional to the length of that interval; and, since the probability of the entire sample space \[*r*, *s*\] equals 1, the constant of [proportionality](https://www.britannica.com/science/proportionality) equals 1/(*s* − *r*). Hence, the probability of obtaining a number in the interval \[*x*, *y*\] equals (*y* − *x*)/(*s* − *r*). From this and the principle of additivity one can determine the probability of any event that can be expressed as a [finite](https://www.britannica.com/dictionary/finite) union of intervals. There are, however, very complicated sets having no simple relation to the intervals—e.g., the rational numbers—and it is not immediately clear what the probabilities of these sets should be. Also, the probability of selecting exactly the number *x* must be 0, because the [set](https://www.britannica.com/topic/set-mathematics-and-logic) consisting of *x* alone is contained in the interval \[*x*, *x* + 1/*n*\] for all *n* and hence must have no larger probability than 1/\[*n*(*s* − *r*)\], no matter how large *n* is. Consequently, it makes no sense to try to compute the probability of an event by “adding” the probabilities of the individual outcomes making up the event, because each individual outcome has probability 0.
A closely related experiment, although at first there appears to be no connection, arises as follows. Suppose that a coin is tossed *n* times, and let *X**k* = 1 or 0 according as the outcome of the *k*th toss is heads or tails. The weak law of large numbers given above says that a certain sequence of numbers—namely the sequence of probabilities given in equation (11) and defined in terms of these *n* *X*s—converges to 1 as *n* → ∞. In order to formulate this result, it is only necessary to imagine that one can toss the coin *n* times and that this finite number of tosses can be arbitrarily large. In other words, there is a sequence of experiments, but each one involves a finite sample space. It is also natural to ask whether the sequence of [random variables](https://www.britannica.com/topic/random-variable) (*X*1 +⋯+ *X**n*)/*n* converges as *n* → ∞. However, this question cannot even be formulated mathematically unless infinitely many *X*s can be defined on the same sample space, which in turn requires that the underlying experiment involve an actual [infinity](https://www.britannica.com/science/infinity-mathematics) of coin tosses.
For the [conceptual](https://www.merriam-webster.com/dictionary/conceptual) experiment of tossing a fair coin infinitely many times, the sequence of zeros and ones, (*X*1, *X*2,…), can be identified with that [real number](https://www.britannica.com/science/real-number) that has the *X*s as the coefficients of its expansion in the base 2, namely *X*1/21 + *X*2/22 + *X*3/23 +⋯. For example, the outcome of getting heads on the first two tosses and tails thereafter corresponds to the real number 1/2 + 1/4 + 0/8 +⋯ = 3/4. (There are some technical mathematical difficulties that arise from the fact that some numbers have two representations. Obviously 1/2 = 1/2 + 0/4 +⋯, and the formula for the sum of an [infinite](https://www.britannica.com/dictionary/infinite) [geometric series](https://www.britannica.com/science/geometric-series) shows that it also equals 0/2 + 1/4 + 1/8 +⋯. It can be shown that these difficulties do not pose a serious problem, and they are ignored in the subsequent discussion.) For any particular specification *i*1, *i*2,…, *i**n* of zeros and ones, the event {*X*1 = *i*1, *X*2 = *i*2,…, *X**n* = *i**n*} must have probability 1/2*n* in order to be consistent with the experiment of tossing the coin only *n* times. Moreover, this event corresponds to the interval of real numbers \[*i*1/21 + *i*2/22 +⋯+ *i**n*/2*n*, *i*1/21 + *i*2/22 +⋯+ *i**n*/2*n* + 1/2*n*\] of length 1/2*n*, since any continuation *X**n* + 1, *X**n* + 2,… corresponds to a number that is at least 0 and at most 1/2*n* + 1 + 1/2*n* + 2 +⋯ = 1/2*n* by the formula for an infinite geometric series. It follows that the [mathematical model](https://www.britannica.com/science/mathematical-model) for choosing a number at random from \[0, 1\] and that of tossing a fair coin infinitely many times assign the same probabilities to all intervals of the form \[*k*/2*n*, 1/2*n*\].
Britannica AI
*chevron\_right*
Probability theory - Central Limit, Statistics, Mathematics
*close*
[AI-generated answers](https://www.britannica.com/about-britannica-ai) from Britannica articles. AI makes mistakes, so verify using Britannica articles.
# The strong law of large numbers
The mathematical relation between these two experiments was recognized in 1909 by the French mathematician [Émile Borel](https://www.britannica.com/biography/Emile-Borel), who used the then new ideas of [measure theory](https://www.britannica.com/science/analysis-mathematics/Some-key-ideas-of-complex-analysis#ref218295) to give a precise [mathematical model](https://www.britannica.com/science/mathematical-model) and to formulate what is now called the strong [law of large numbers](https://www.britannica.com/science/law-of-large-numbers) for fair [coin](https://www.britannica.com/money/coin) tossing. His results can be described as follows. Let *e* denote a number chosen at random from \[0, 1\], and let *X**k*(*e*) be the *k*th coordinate in the expansion of *e* to the base 2. Then *X*1, *X*2,… are an [infinite](https://www.merriam-webster.com/dictionary/infinite) sequence of independent random variables taking the values 0 or 1 with probability 1/2 each. Moreover, the subset of \[0, 1\] consisting of those *e* for which the sequence *n*−1\[*X*1(*e*) +⋯+ *X**n*(*e*)\] tends to 1/2 as *n* → ∞ has probability 1. Symbolically:
The weak law of large numbers given in [equation](https://www.britannica.com/science/equation) (11) says that for any ε \> 0, for each sufficiently large value of *n*, there is only a small probability of observing a deviation of *X**n* = *n*−1(*X*1 +⋯+ *X**n*) from 1/2 which is larger than ε; nevertheless, it leaves open the possibility that sooner or later this rare event will occur if one continues to toss the coin and observe the sequence for a sufficiently long time. The strong law, however, [asserts](https://www.britannica.com/dictionary/asserts) that the occurrence of even one value of *X**k* for *k* ≥ *n* that differs from 1/2 by more than ε is an event of arbitrarily small probability provided *n* is large enough. The [proof](https://www.britannica.com/topic/proof-logic) of equation ([14](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14390)) and various subsequent generalizations is much more difficult than that of the weak law of large numbers. The adjectives “strong” and “weak” refer to the fact that the truth of a result such as equation ([14](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14390)) implies the truth of the corresponding version of equation ([11](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14393)), but not conversely.
## [Measure theory](https://www.britannica.com/science/measure-theory)
During the two decades following 1909, measure theory was used in many concrete problems of probability theory, notably in the American mathematician [Norbert Wiener](https://www.britannica.com/biography/Norbert-Wiener)’s [treatment](https://www.britannica.com/dictionary/treatment) (1923) of the mathematical theory of [Brownian motion](https://www.britannica.com/science/Brownian-motion), but the notion that all problems of probability theory could be formulated in terms of measure is customarily attributed to the Soviet mathematician [Andrey Nikolayevich Kolmogorov](https://www.britannica.com/biography/Andrey-Nikolayevich-Kolmogorov) in 1933.
The fundamental quantities of the measure theoretic foundation of probability theory are the sample space *S*, which as before is just the [set](https://www.britannica.com/topic/set-mathematics-and-logic) of all possible outcomes of an experiment, and a distinguished class *M* of subsets of *S*, called events. Unlike the case of finite *S*, in general not every subset of *S* is an event. The class *M* must have certain properties described below. Each event is assigned a probability, which means mathematically that a probability is a [function](https://www.britannica.com/science/function-mathematics) *P* [mapping](https://www.britannica.com/science/mapping) *M* into the real numbers that satisfies certain conditions [derived](https://www.britannica.com/dictionary/derived) from one’s physical ideas about probability.
The properties of *M* are as follows: (i) *S* ∊ *M*; (ii) if *A* ∊ *M*, then *A**c* ∊ *M*; (iii) if *A*1, *A*2,… ∊ *M*, then *A*1 ∪ *A*2 ∪ ⋯ ∊ *M*. Recalling that *M* is the domain of definition of the probability *P*, one can interpret (i) as saying that *P*(*S*) is defined, (ii) as saying that, if the probability of *A* is defined, then the probability of “not *A*” is also defined, and (iii) as saying that, if one can speak of the probability of each of a sequence of events *A**n* individually, then one can speak of the probability that at least one of the *A**n* occurs. A class of subsets of any set that has properties (i)–(iii) is called a σ-field. From these properties one can prove others. For example, it follows at once from (i) and (ii) that Ø (the empty set) belongs to the class *M*. Since the intersection of any class of sets can be expressed as the [complement](https://www.britannica.com/dictionary/complement) of the union of the complements of those sets ([DeMorgan’s law](https://www.britannica.com/topic/De-Morgan-laws)), it follows from (ii) and (iii) that, if *A*1, *A*2,… ∊ *M*, then *A*1 ∩ *A*2 ∩ ⋯ ∊ *M*.
Given a set *S* and a σ-field *M* of subsets of *S*, a probability measure is a function *P* that assigns to each set *A* ∊ *M* a nonnegative [real number](https://www.britannica.com/science/real-number) and that has the following two properties: (*a*) *P*(*S*) = 1 and (*b*) if *A*1, *A*2,… ∊ *M* and *A**i* ∩ *A**j* = Ø for all *i* ≠ *j*, then *P*(*A*1 ∪ *A*2 ∪ ⋯) = *P*(*A*1) + *P*(*A*2) +⋯. Property (*b*) is called the [axiom](https://www.britannica.com/topic/axiom) of countable additivity. It is clearly motivated by equation (1), which [suffices](https://www.merriam-webster.com/dictionary/suffices) for finite sample spaces because there are only finitely many events. In infinite sample spaces it implies, but is not implied by, equation (1). There is, however, nothing in one’s intuitive notion of probability that requires the acceptance of this property. Indeed, a few mathematicians have developed probability theory with only the weaker axiom of finite additivity, but the absence of interesting models that fail to satisfy the axiom of countable additivity has led to its virtually universal acceptance.
To get a better feeling for this distinction, consider the experiment of tossing a [biased](https://www.merriam-webster.com/dictionary/biased) coin having probability *p* of heads and *q* = 1 − *p* of tails until heads first appears. To be consistent with the idea that the tosses are independent, the probability that exactly *n* tosses are required equals *q**n* − 1*p*, since the first *n* − 1 tosses must be tails, and they must be followed by a head. One can imagine that this experiment never terminates—i.e., that the coin continues to turn up tails forever. By the axiom of countable additivity, however, the probability that heads occurs at some finite value of *n* equals *p* + *q**p* + *q*2*p* + ⋯ = *p*/(1 − *q*) = 1, by the formula for the sum of an [infinite](https://www.britannica.com/dictionary/infinite) [geometric series](https://www.britannica.com/science/geometric-series). Hence, the probability that the experiment goes on forever equals 0. Similarly, one can compute the probability that the number of tosses is odd, as *p* + *q*2*p* + *q*4*p* + ⋯ = *p*/(1 − *q*2) = 1/(1 + *q*). On the other hand, if only finite additivity were required, it would be possible to define the following admittedly bizarre probability. The sample space *S* is the set of all natural numbers, and the σ-field *M* is the class of all subsets of *S*. If an event *A* contains finitely many elements, *P*(*A*) = 0, and, if the complement of *A* contains finitely many elements, *P*(*A*) = 1. As a consequence of the deceptively [innocuous](https://www.merriam-webster.com/dictionary/innocuous) [axiom of choice](https://www.britannica.com/science/axiom-of-choice) (which says that, given any collection *C* of nonempty sets, there exists a rule for selecting a unique point from each set in *C*), one can show that many finitely additive probabilities consistent with these requirements exist. However, one cannot be certain what the probability of getting an odd number is, because that set is neither finite nor its complement finite, nor can it be expressed as a finite disjoint union of sets whose probability is already defined.
It is a basic problem, and by no means a simple one, to show that the [intuitive](https://www.britannica.com/dictionary/intuitive) notion of choosing a number at random from \[0, 1\], as described above, is consistent with the preceding definitions. Since the probability of an interval is to be its length, the class of events *M* must contain all intervals; but in order to be a σ-field it must contain other sets, many of which are difficult to describe in an elementary way. One example is the event in equation (14), which must belong to *M* in order that one can talk about its probability. Also, although it seems clear that the length of a finite disjoint union of intervals is just the sum of their lengths, a rather subtle argument is required to show that length has the property of countable additivity. A basic [theorem](https://www.britannica.com/topic/theorem) says that there is a suitable σ-field containing all the intervals and a unique probability defined on this σ-field for which the probability of an interval is its length. The σ-field is called the class of Lebesgue-measurable sets, and the probability is called the [Lebesgue measure](https://www.britannica.com/science/Lebesgue-measure), after the French mathematician and principal architect of measure theory, [Henri-Léon Lebesgue](https://www.britannica.com/biography/Henri-Leon-Lebesgue).
In general, a σ-field need not be all subsets of the sample space *S*. The question of whether all subsets of \[0, 1\] are Lebesgue-measurable turns out to be a difficult problem that is intimately connected with the [foundations of mathematics](https://www.britannica.com/science/foundations-of-mathematics) and in particular with the axiom of choice.
## [Probability density functions](https://www.britannica.com/science/density-function)
[](https://cdn.britannica.com/30/3630-004-89103408/Probability-density-function.jpg)
[probability density function](https://cdn.britannica.com/30/3630-004-89103408/Probability-density-function.jpg)
(more)
For random variables having a [continuum](https://www.merriam-webster.com/dictionary/continuum) of possible values, the function that plays the same role as the probability distribution of a discrete [random variable](https://www.britannica.com/topic/random-variable) is called a probability density function. If the random variable is denoted by *X*, its probability density function *f* has the property thatfor every interval (*a*, *b*\]; i.e., the probability that *X* falls in (*a*, *b*\] is the area under the [graph](https://www.britannica.com/science/graph-mathematics) of *f* between *a* and *b* (*see* the figure). For example, if *X* denotes the outcome of selecting a number at random from the interval \[*r*, *s*\], the probability density function of *X* is given by *f*(*x*) = 1/(*s* − *r*) for *r* \< *x* \< *s* and *f*(*x*) = 0 for *x* \< *r* or *x* \> *s*. The function *F*(*x*) defined by *F*(*x*) = *P*{*X* ≤ *x*} is called the distribution function, or [cumulative](https://www.merriam-webster.com/dictionary/cumulative) [distribution function](https://www.britannica.com/science/distribution-function), of *X*. If *X* has a probability density function *f*(*x*), the relation between *f* and *F* is *F*′(*x*) = *f*(*x*) or equivalently
The distribution function *F* of a discrete random variable should not be confused with its probability distribution *f*. In this case the relation between *F* and *f* is
If a random variable *X* has a probability density function *f*(*x*), its “expectation” can be defined byprovided that this [integral](https://www.britannica.com/science/integral-mathematics) is convergent. It turns out to be simpler, however, not only to use Lebesgue’s theory of measure to define probabilities but also to use his theory of [integration](https://www.britannica.com/science/integration-mathematics) to define [expectation](https://www.britannica.com/topic/expected-value). Accordingly, for any random variable *X*, *E*(*X*) is defined to be the [Lebesgue integral](https://www.britannica.com/science/Lebesgue-integral) of *X* with respect to the probability measure *P*, provided that the [integral](https://www.merriam-webster.com/dictionary/integral) exists. In this way it is possible to provide a unified theory in which all random variables, both discrete and continuous, can be treated simultaneously. In order to follow this path, it is necessary to restrict the class of those functions *X* defined on *S* that are to be called random variables, just as it was necessary to restrict the class of subsets of *S* that are called events. The appropriate restriction is that a random variable must be a measurable function. The definition is taken over directly from the Lebesgue theory of [integration](https://www.merriam-webster.com/dictionary/integration) and will not be discussed here. It can be shown that, whenever *X* has a probability density function, its expectation (provided it exists) is given by equation ([15](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14389)), which remains a useful formula for calculating *E*(*X*).
Some important probability density functions are the following:
[](https://cdn.britannica.com/31/3631-004-96A99F98/approximation-distribution.jpg)
[normal approximation to the binomial distribution](https://cdn.britannica.com/31/3631-004-96A99F98/approximation-distribution.jpg)
(more)
The cumulative distribution function of the [normal distribution](https://www.britannica.com/topic/normal-distribution) with mean 0 and [variance](https://www.britannica.com/topic/variance) 1 has already appeared as the function *G* defined following [equation (12](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14392)). The law of large numbers and the [central limit theorem](https://www.britannica.com/science/central-limit-theorem) continue to hold for random variables on [infinite](https://www.britannica.com/science/infinite-set) sample spaces. A useful interpretation of the central limit theorem stated formally in equation ([12](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14392)) is as follows: The probability that the average (or sum) of a large number of independent, identically distributed random variables with finite variance falls in an interval (*c*1, *c*2\] equals approximately the area between *c*1 and *c*2 underneath the graph of a normal density function chosen to have the same expectation and variance as the given average (or sum). The figure illustrates the normal approximation to the [binomial distribution](https://www.britannica.com/science/binomial-distribution) with *n* = 10 and *p* = 1/2.
The [exponential distribution](https://www.britannica.com/science/exponential-distribution) arises naturally in the study of the [Poisson distribution](https://www.britannica.com/topic/Poisson-distribution) introduced in equation ([13](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14391)). If *T**k* denotes the time interval between the emission of the *k* − 1st and *k*th particle, then *T*1, *T*2,… are independent random variables having an exponential distribution with [parameter](https://www.merriam-webster.com/dictionary/parameter) μ. This is obvious for *T*1 from the observation that {*T*1 \> *t*} = {*N*(*t*) = 0}. Hence, *P*{*T*1 ≤ *t*} = 1 − *P*{*N*(*t*) = 0} = 1 − exp(−μ*t*), and by [differentiation](https://www.britannica.com/science/differentiation-mathematics) one obtains the exponential density function.
The [Cauchy distribution](https://www.britannica.com/science/Cauchy-distribution) does not have a mean value or a variance, because the integral ([15](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14389)) does not converge. As a result, it has a number of unusual properties. For example, if *X*1, *X*2,…, *X**n* are independent random variables having a Cauchy distribution, the average (*X*1 +⋯+ *X**n*)/*n* also has a Cauchy distribution. The variability of the average is exactly the same as that of a single observation. Another random variable that does not have an expectation is the waiting time until the number of heads first equals the number of tails in tossing a fair coin.
Load Next Page
Feedback
Thank you for your feedback
Our editors will review what you’ve submitted and determine whether to revise the article.
*print* Print
Please select which sections you would like to print:
*verified*Cite
While every effort has been made to follow citation style rules, there may be some discrepancies. Please refer to the appropriate style manual or other sources if you have any questions.
Select Citation Style
Siegmund, David O.. "probability theory". *Encyclopedia Britannica*, 1 Mar. 2026, https://www.britannica.com/science/probability-theory. Accessed 3 April 2026.
Copy Citation
Share
Share to social media
[Facebook](https://www.facebook.com/BRITANNICA/) [X](https://x.com/britannica)
URL
<https://www.britannica.com/science/probability-theory>
External Websites
- [Physics LibreTexts - Basics of Probability Theory](https://phys.libretexts.org/Courses/University_of_California_Davis/UCD%3A_Physics_9HC__Introduction_to_Waves_Physical_Optics_and_Quantum_Theory/4%3A_The_Universe_is_Inherently_Probabilistic/4.1%3A_Basics_of_Probability_Theory)
- [National Center for Biotechnology Information - PubMed Central - A brief introduction to probability](https://pmc.ncbi.nlm.nih.gov/articles/PMC5864684/)
- [Massachusetts Institute of Technology - Department of Mathematics - Probability Theory](https://math.mit.edu/research/highschool/primes/circle/documents/2022/Elena%20&%20Alice.pdf)
- [Stanford Encyclopedia of Philosophy - Quantum Logic and Probability Theory](https://plato.stanford.edu/entries/qt-quantlog/)
- [Stanford University - Review of Probability Theory](https://cs229.stanford.edu/section/cs229-prob.pdf)
- [North Dakota State University - Probability theory (PDF)](https://www.ndsu.edu/pubweb/~novozhil/Teaching/767%20Data/18_pdfsam_notes.pdf)
- [Indian Academy of Sciences - What is Probability Theory?](https://www.ias.ac.in/article/fulltext/reso/020/04/0292-0310)
- [University of California - Department of Statistics - Probability: Philosophy and Mathematical Background](https://www.stat.berkeley.edu/~stark/SticiGui/Text/probabilityPhilosophy.htm)
- [OpenStax - Principles of Data Science - Probability Theory](https://openstax.org/books/principles-data-science/pages/3-4-probability-theory) |
| Readable Markdown | ## The strong law of large numbers
The mathematical relation between these two experiments was recognized in 1909 by the French mathematician [Émile Borel](https://www.britannica.com/biography/Emile-Borel), who used the then new ideas of [measure theory](https://www.britannica.com/science/analysis-mathematics/Some-key-ideas-of-complex-analysis#ref218295) to give a precise [mathematical model](https://www.britannica.com/science/mathematical-model) and to formulate what is now called the strong [law of large numbers](https://www.britannica.com/science/law-of-large-numbers) for fair [coin](https://www.britannica.com/money/coin) tossing. His results can be described as follows. Let *e* denote a number chosen at random from \[0, 1\], and let *X**k*(*e*) be the *k*th coordinate in the expansion of *e* to the base 2. Then *X*1, *X*2,… are an [infinite](https://www.merriam-webster.com/dictionary/infinite) sequence of independent random variables taking the values 0 or 1 with probability 1/2 each. Moreover, the subset of \[0, 1\] consisting of those *e* for which the sequence *n*−1\[*X*1(*e*) +⋯+ *X**n*(*e*)\] tends to 1/2 as *n* → ∞ has probability 1. Symbolically:
The weak law of large numbers given in [equation](https://www.britannica.com/science/equation) (11) says that for any ε \> 0, for each sufficiently large value of *n*, there is only a small probability of observing a deviation of *X**n* = *n*−1(*X*1 +⋯+ *X**n*) from 1/2 which is larger than ε; nevertheless, it leaves open the possibility that sooner or later this rare event will occur if one continues to toss the coin and observe the sequence for a sufficiently long time. The strong law, however, [asserts](https://www.britannica.com/dictionary/asserts) that the occurrence of even one value of *X**k* for *k* ≥ *n* that differs from 1/2 by more than ε is an event of arbitrarily small probability provided *n* is large enough. The [proof](https://www.britannica.com/topic/proof-logic) of equation ([14](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14390)) and various subsequent generalizations is much more difficult than that of the weak law of large numbers. The adjectives “strong” and “weak” refer to the fact that the truth of a result such as equation ([14](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14390)) implies the truth of the corresponding version of equation ([11](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14393)), but not conversely.
## [Measure theory](https://www.britannica.com/science/measure-theory)
During the two decades following 1909, measure theory was used in many concrete problems of probability theory, notably in the American mathematician [Norbert Wiener](https://www.britannica.com/biography/Norbert-Wiener)’s [treatment](https://www.britannica.com/dictionary/treatment) (1923) of the mathematical theory of [Brownian motion](https://www.britannica.com/science/Brownian-motion), but the notion that all problems of probability theory could be formulated in terms of measure is customarily attributed to the Soviet mathematician [Andrey Nikolayevich Kolmogorov](https://www.britannica.com/biography/Andrey-Nikolayevich-Kolmogorov) in 1933.
The fundamental quantities of the measure theoretic foundation of probability theory are the sample space *S*, which as before is just the [set](https://www.britannica.com/topic/set-mathematics-and-logic) of all possible outcomes of an experiment, and a distinguished class *M* of subsets of *S*, called events. Unlike the case of finite *S*, in general not every subset of *S* is an event. The class *M* must have certain properties described below. Each event is assigned a probability, which means mathematically that a probability is a [function](https://www.britannica.com/science/function-mathematics) *P* [mapping](https://www.britannica.com/science/mapping) *M* into the real numbers that satisfies certain conditions [derived](https://www.britannica.com/dictionary/derived) from one’s physical ideas about probability.
The properties of *M* are as follows: (i) *S* ∊ *M*; (ii) if *A* ∊ *M*, then *A**c* ∊ *M*; (iii) if *A*1, *A*2,… ∊ *M*, then *A*1 ∪ *A*2 ∪ ⋯ ∊ *M*. Recalling that *M* is the domain of definition of the probability *P*, one can interpret (i) as saying that *P*(*S*) is defined, (ii) as saying that, if the probability of *A* is defined, then the probability of “not *A*” is also defined, and (iii) as saying that, if one can speak of the probability of each of a sequence of events *A**n* individually, then one can speak of the probability that at least one of the *A**n* occurs. A class of subsets of any set that has properties (i)–(iii) is called a σ-field. From these properties one can prove others. For example, it follows at once from (i) and (ii) that Ø (the empty set) belongs to the class *M*. Since the intersection of any class of sets can be expressed as the [complement](https://www.britannica.com/dictionary/complement) of the union of the complements of those sets ([DeMorgan’s law](https://www.britannica.com/topic/De-Morgan-laws)), it follows from (ii) and (iii) that, if *A*1, *A*2,… ∊ *M*, then *A*1 ∩ *A*2 ∩ ⋯ ∊ *M*.
Given a set *S* and a σ-field *M* of subsets of *S*, a probability measure is a function *P* that assigns to each set *A* ∊ *M* a nonnegative [real number](https://www.britannica.com/science/real-number) and that has the following two properties: (*a*) *P*(*S*) = 1 and (*b*) if *A*1, *A*2,… ∊ *M* and *A**i* ∩ *A**j* = Ø for all *i* ≠ *j*, then *P*(*A*1 ∪ *A*2 ∪ ⋯) = *P*(*A*1) + *P*(*A*2) +⋯. Property (*b*) is called the [axiom](https://www.britannica.com/topic/axiom) of countable additivity. It is clearly motivated by equation (1), which [suffices](https://www.merriam-webster.com/dictionary/suffices) for finite sample spaces because there are only finitely many events. In infinite sample spaces it implies, but is not implied by, equation (1). There is, however, nothing in one’s intuitive notion of probability that requires the acceptance of this property. Indeed, a few mathematicians have developed probability theory with only the weaker axiom of finite additivity, but the absence of interesting models that fail to satisfy the axiom of countable additivity has led to its virtually universal acceptance.
To get a better feeling for this distinction, consider the experiment of tossing a [biased](https://www.merriam-webster.com/dictionary/biased) coin having probability *p* of heads and *q* = 1 − *p* of tails until heads first appears. To be consistent with the idea that the tosses are independent, the probability that exactly *n* tosses are required equals *q**n* − 1*p*, since the first *n* − 1 tosses must be tails, and they must be followed by a head. One can imagine that this experiment never terminates—i.e., that the coin continues to turn up tails forever. By the axiom of countable additivity, however, the probability that heads occurs at some finite value of *n* equals *p* + *q**p* + *q*2*p* + ⋯ = *p*/(1 − *q*) = 1, by the formula for the sum of an [infinite](https://www.britannica.com/dictionary/infinite) [geometric series](https://www.britannica.com/science/geometric-series). Hence, the probability that the experiment goes on forever equals 0. Similarly, one can compute the probability that the number of tosses is odd, as *p* + *q*2*p* + *q*4*p* + ⋯ = *p*/(1 − *q*2) = 1/(1 + *q*). On the other hand, if only finite additivity were required, it would be possible to define the following admittedly bizarre probability. The sample space *S* is the set of all natural numbers, and the σ-field *M* is the class of all subsets of *S*. If an event *A* contains finitely many elements, *P*(*A*) = 0, and, if the complement of *A* contains finitely many elements, *P*(*A*) = 1. As a consequence of the deceptively [innocuous](https://www.merriam-webster.com/dictionary/innocuous) [axiom of choice](https://www.britannica.com/science/axiom-of-choice) (which says that, given any collection *C* of nonempty sets, there exists a rule for selecting a unique point from each set in *C*), one can show that many finitely additive probabilities consistent with these requirements exist. However, one cannot be certain what the probability of getting an odd number is, because that set is neither finite nor its complement finite, nor can it be expressed as a finite disjoint union of sets whose probability is already defined.
It is a basic problem, and by no means a simple one, to show that the [intuitive](https://www.britannica.com/dictionary/intuitive) notion of choosing a number at random from \[0, 1\], as described above, is consistent with the preceding definitions. Since the probability of an interval is to be its length, the class of events *M* must contain all intervals; but in order to be a σ-field it must contain other sets, many of which are difficult to describe in an elementary way. One example is the event in equation (14), which must belong to *M* in order that one can talk about its probability. Also, although it seems clear that the length of a finite disjoint union of intervals is just the sum of their lengths, a rather subtle argument is required to show that length has the property of countable additivity. A basic [theorem](https://www.britannica.com/topic/theorem) says that there is a suitable σ-field containing all the intervals and a unique probability defined on this σ-field for which the probability of an interval is its length. The σ-field is called the class of Lebesgue-measurable sets, and the probability is called the [Lebesgue measure](https://www.britannica.com/science/Lebesgue-measure), after the French mathematician and principal architect of measure theory, [Henri-Léon Lebesgue](https://www.britannica.com/biography/Henri-Leon-Lebesgue).
In general, a σ-field need not be all subsets of the sample space *S*. The question of whether all subsets of \[0, 1\] are Lebesgue-measurable turns out to be a difficult problem that is intimately connected with the [foundations of mathematics](https://www.britannica.com/science/foundations-of-mathematics) and in particular with the axiom of choice.
## [Probability density functions](https://www.britannica.com/science/density-function)
For random variables having a [continuum](https://www.merriam-webster.com/dictionary/continuum) of possible values, the function that plays the same role as the probability distribution of a discrete [random variable](https://www.britannica.com/topic/random-variable) is called a probability density function. If the random variable is denoted by *X*, its probability density function *f* has the property thatfor every interval (*a*, *b*\]; i.e., the probability that *X* falls in (*a*, *b*\] is the area under the [graph](https://www.britannica.com/science/graph-mathematics) of *f* between *a* and *b* (*see* the figure). For example, if *X* denotes the outcome of selecting a number at random from the interval \[*r*, *s*\], the probability density function of *X* is given by *f*(*x*) = 1/(*s* − *r*) for *r* \< *x* \< *s* and *f*(*x*) = 0 for *x* \< *r* or *x* \> *s*. The function *F*(*x*) defined by *F*(*x*) = *P*{*X* ≤ *x*} is called the distribution function, or [cumulative](https://www.merriam-webster.com/dictionary/cumulative) [distribution function](https://www.britannica.com/science/distribution-function), of *X*. If *X* has a probability density function *f*(*x*), the relation between *f* and *F* is *F*′(*x*) = *f*(*x*) or equivalently
The distribution function *F* of a discrete random variable should not be confused with its probability distribution *f*. In this case the relation between *F* and *f* is
If a random variable *X* has a probability density function *f*(*x*), its “expectation” can be defined byprovided that this [integral](https://www.britannica.com/science/integral-mathematics) is convergent. It turns out to be simpler, however, not only to use Lebesgue’s theory of measure to define probabilities but also to use his theory of [integration](https://www.britannica.com/science/integration-mathematics) to define [expectation](https://www.britannica.com/topic/expected-value). Accordingly, for any random variable *X*, *E*(*X*) is defined to be the [Lebesgue integral](https://www.britannica.com/science/Lebesgue-integral) of *X* with respect to the probability measure *P*, provided that the [integral](https://www.merriam-webster.com/dictionary/integral) exists. In this way it is possible to provide a unified theory in which all random variables, both discrete and continuous, can be treated simultaneously. In order to follow this path, it is necessary to restrict the class of those functions *X* defined on *S* that are to be called random variables, just as it was necessary to restrict the class of subsets of *S* that are called events. The appropriate restriction is that a random variable must be a measurable function. The definition is taken over directly from the Lebesgue theory of [integration](https://www.merriam-webster.com/dictionary/integration) and will not be discussed here. It can be shown that, whenever *X* has a probability density function, its expectation (provided it exists) is given by equation ([15](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14389)), which remains a useful formula for calculating *E*(*X*).
Some important probability density functions are the following:
The cumulative distribution function of the [normal distribution](https://www.britannica.com/topic/normal-distribution) with mean 0 and [variance](https://www.britannica.com/topic/variance) 1 has already appeared as the function *G* defined following [equation (12](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14392)). The law of large numbers and the [central limit theorem](https://www.britannica.com/science/central-limit-theorem) continue to hold for random variables on [infinite](https://www.britannica.com/science/infinite-set) sample spaces. A useful interpretation of the central limit theorem stated formally in equation ([12](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14392)) is as follows: The probability that the average (or sum) of a large number of independent, identically distributed random variables with finite variance falls in an interval (*c*1, *c*2\] equals approximately the area between *c*1 and *c*2 underneath the graph of a normal density function chosen to have the same expectation and variance as the given average (or sum). The figure illustrates the normal approximation to the [binomial distribution](https://www.britannica.com/science/binomial-distribution) with *n* = 10 and *p* = 1/2.
The [exponential distribution](https://www.britannica.com/science/exponential-distribution) arises naturally in the study of the [Poisson distribution](https://www.britannica.com/topic/Poisson-distribution) introduced in equation ([13](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14391)). If *T**k* denotes the time interval between the emission of the *k* − 1st and *k*th particle, then *T*1, *T*2,… are independent random variables having an exponential distribution with [parameter](https://www.merriam-webster.com/dictionary/parameter) μ. This is obvious for *T*1 from the observation that {*T*1 \> *t*} = {*N*(*t*) = 0}. Hence, *P*{*T*1 ≤ *t*} = 1 − *P*{*N*(*t*) = 0} = 1 − exp(−μ*t*), and by [differentiation](https://www.britannica.com/science/differentiation-mathematics) one obtains the exponential density function.
The [Cauchy distribution](https://www.britannica.com/science/Cauchy-distribution) does not have a mean value or a variance, because the integral ([15](https://www.britannica.com/science/probability-theory/The-central-limit-theorem#ref-14389)) does not converge. As a result, it has a number of unusual properties. For example, if *X*1, *X*2,…, *X**n* are independent random variables having a Cauchy distribution, the average (*X*1 +⋯+ *X**n*)/*n* also has a Cauchy distribution. The variability of the average is exactly the same as that of a single observation. Another random variable that does not have an expectation is the waiting time until the number of heads first equals the number of tails in tossing a fair coin. |
| Shard | 62 (laksa) |
| Root Hash | 5455945239613777662 |
| Unparsed URL | com,britannica!www,/science/probability-theory/The-central-limit-theorem s443 |