πŸ•·οΈ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 143 (from laksa157)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

πŸ“„
INDEXABLE
βœ…
CRAWLED
20 days ago
πŸ€–
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.7 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta/
Last Crawled2026-03-28 05:01:11 (20 days ago)
First Indexed2021-01-11 02:55:37 (5 years ago)
HTTP Status Code200
Meta TitleBeta Distribution
Meta Descriptionnull
Meta Canonicalnull
Boilerpipe Text
The Beta distribution is the distribution most often used as the distribution of probabilities. In this section we are going to have a very meta discussion about how we represent probabilities. Until now probabilities have just been numbers in the range 0 to 1. However, if we have uncertainty about our probability, it would make sense to represent our probabilities as random variables (and thus articulate the relative likelihood of our belief). Beta Random Variable Notation: X ∼ Beta ( a , b ) Description: A belief distribution over the value of a probability p from a Binomial distribution after observing a βˆ’ 1 successes and b βˆ’ 1 fails. Parameters: a > 0 , the number successes + 1 b > 0 , the number of fails + 1 Support: x ∈ [ 0 , 1 ] PDF equation: f ( x ) = B ( a , b ) β‹… x a βˆ’ 1 β‹… ( 1 βˆ’ x ) b βˆ’ 1 where B ( a , b ) = Ξ“ ( a ) Ξ“ ( b ) Ξ“ ( a , b ) CDF equation: No closed form Expectation: E [ X ] = a a + b Variance: Var ( X ) = a b ( a + b ) 2 ( a + b + 1 ) PDF graph: Parameter a : Parameter b : What is your Belief in p After 9 Heads in 10 Flips? Imagine we have a coin and we would like to know its true probability of coming up heads, p . We flip the coin 10 times and observe 9 heads and 1 tail. What is your belief in p based off this evidence? Using the definition of probability we could guess that p β‰ˆ 9 10 . That number is a very rough estimate, especially since it is only based off 10 coin flips. Moreover the "point-value" 9 10 does not have the ability to articulate how uncertain it is. Could we instead have a random variable for the true probability? Formally, let X represent the true probability of the coin coming up heads. We don't use the symbol P for random variables, so X will have to do. If X = 0.7 then the probability of heads is 0.7. X must be a continuous random variable with support [ 0 , 1 ] since probabilities are continuous values which must be between 0 and 1. Before flipping the coin, we could say that our belief about the coin's heads probability is uniform: X ∼ Uni ( 0 , 1 ) . Let H be a random variable for the number of heads and let T be a random variable for the number of tails observed. What is P ( X = x | H = 9 , T = 1 ) ? That probability is hard to think about! However it is much easier to reason about the probability with the condition reveresed: P ( H = 9 , T = 1 | X = x ) . This term asks the question: what is the probability of seeing 9 heads and 1 tail in 10 coin flips, given that the true probability of a heads is x . Convince yourself that this probability is just a binomial probability mass function with n = 10 experiements, and p = x evaluated at k = 9 heads: P ( H = 9 , T = 1 | X = x ) = ( 10 9 ) x 9 ( 1 βˆ’ x ) 1 We are presented with a perfect context for Bayes' theorem with random variables . We know a conditional probability in one direction and we would like to know it in the other: f ( X = x | H = 9 , T = 1 ) = P ( H = 9 , T = 1 | X = x ) β‹… f ( X = x ) P ( H = 9 , T = 1 ) Bayes Theorem = ( 10 9 ) x 9 ( 1 βˆ’ x ) 1 β‹… f ( X = x ) P ( H = 9 , T = 1 ) Binomial PMF = ( 10 9 ) x 9 ( 1 βˆ’ x ) 1 β‹… 1 P ( H = 9 , T = 1 ) Uniform PDF = ( 10 9 ) P ( H = 9 , T = 1 ) x 9 ( 1 βˆ’ x ) 1 Constants to front = K β‹… x 9 ( 1 βˆ’ x ) 1 Rename constant Lets take a look at that function. For now we can let K = 110 . Regardless of K we will get the same shape, just scaled: What a beautiful image. It tells us relatively likelihood over the probability that is governing our coinflips. Here are a few observations from this chart: Even after only 10 coin flips we are very confident that the true probability is > 0.5 It is almost 10 times more likely that X = 0.9 as it is that X = 0.6 . f ( X = 1 ) = 0 , which makes sense. How could we have flipped that one tail if the probability of heads was 1? Wait but why? In the derivation above for f ( X = x | H = 9 , T = 1 ) we made the claim that P ( H = 9 , T = 1 ) is a constant. A lot of folks find that hard to believe. Why is that the case? It may be helpful to juxtapose P ( H = 9 , T = 1 ) with P ( H = 9 , T = 1 | X = x ) . The later says "what is the probability of 9 heads, given the true probability is x ". The former says "what is the probability of 9 heads, under all possible assignments of x ". If you wanted to calculate P ( H = 9 , T = 1 ) you could use the law of total probability: P ( H = 9 , T = 1 ) = ∫ y = 0 1 P ( H = 9 , T = 1 | X = y ) f ( X = y ) That is a hard number to calculate, but it is in fact a constant with respect to x . Beta Derivation Let's generalize the derivation from the previous section, using h for the number of observed heads and t the number of observed tails. If we let H = h be the event that we saw h heads, and let T = t be the event that we saw t tails in h + t coinflips. We want to calculate the probability density function f ( X = x | H = h , T = t ) . We can use the exact same series of steps, starting with Bayes Theorem: f ( X = x | H = h , T = t ) = P ( H = h , T = t | X = x ) f ( X = x ) P ( H = h , T = t ) Bayes Theorem = ( h + t h ) x h ( 1 βˆ’ x ) t P ( H = h , T = t ) Binomial PMF, Uniform PDF = ( h + t h ) P ( H = h , T = t ) x h ( 1 βˆ’ x ) t Moving terms around = 1 c β‹… x h ( 1 βˆ’ x ) t whereΒ  c = ∫ 0 1 x h ( 1 βˆ’ x ) t d x The equation that we arrived at when using a Bayesian approach to estimating our probability defines a probability density function and thus a random variable. The random variable is called a Beta distribution, and it is defined as follows: The Probability Density Function (PDF) for X ∼ Beta ( a , b ) is: f ( X = x ) = { 1 B ( a , b ) x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 ifΒ  0 < x < 1 0 otherwise whereΒ  B ( a , b ) = ∫ 0 1 x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 d x A Beta distribution has E [ X ] = a a + b and V a r ( X ) = a b ( a + b ) 2 ( a + b + 1 ) . All modern programming languages have a package for calculating Beta CDFs. You will not be expected to compute the CDF by hand in CS109. To model our estimate of the probability of a coin coming up heads: set a = h + 1 and b = t + 1 . Beta is used as a random variable to represent a belief distribution of probabilities in contexts beyond estimating coin flips. For example perhaps a drug has been given to 6 patients, 4 of whom have been cured. We could express our belief in the probability that the drug can cure patients as X ∼ Beta ( a = 5 , b = 3 ) : Notice how the most likely belief for the probability of curing a patient, is 4 / 6 , the fraction of patients cured. This distribution shows that we hold a non-zero belief that the probability could be something other than 4 / 6 . It is unlikely that the probability is 0.01 or 0.09, but reasonably likely that it could be 0.5. Beta as a Prior You can set X ∼ Beta ( a , b ) as a prior to reflect how biased you think the coin is apriori to flipping it. This is a subjective judgment that represent a + b βˆ’ 2 "imaginary" trials with a βˆ’ 1 heads and b βˆ’ 1 tails. If you then observe h + t real trials with h heads you can update your belief. Your new belief would be, X ∼ Beta ( a + h , b + t ) . Using the prior Beta ( 1 , 1 ) = Uni ( 0 , 1 ) is the same as saying we haven't seen any "imaginary" trials, so apriori we know nothing about the coin. Here is the proof for the distribution of X when the prior was a Beta too: If our prior belief is X ∼ Beta ( a , b ) , then our posterior is Beta ( a + h , b + t ) : f ( X = x | H = h , T = t ) = P ( H = h , T = t | X = x ) f ( X = x ) P ( H = h , T = t ) Bayes Theorem = ( h + t h ) x h ( 1 βˆ’ x ) t β‹… 1 c β‹… x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 P ( H = h , T = t ) Beta PMF, Uniform PDF = K β‹… x h ( 1 βˆ’ x ) t β‹… x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 Combine Constants = K β‹… x a + h βˆ’ 1 ( 1 βˆ’ x ) b + t βˆ’ 1 Combine Like Bases Which is the PDF of Beta ( a + h , b + t ) It is pretty convenient that if we have a Beta prior belief, then our posterior belief is also Beta. This makes Betas especially convenient to work with, in code and in proof, if there are many updates that you will make to your belief over time. This property where the type of distribution is the same before and after an observation is called a conjugate prior. Quick question: Are you allowed to just make up priors and imaginary trials? Some folks think that is fine (they are called Bayesians) and some folks think that you shouldn't make up prior beliefs (they are called frequentists). In general, for small data it can make you much better at making predictions if you are able to come up with a good prior belief. Observation: There is a deep connection between the beta-prior and the uniform-prior (which we used initially). It turns out that Beta ( 1 , 1 ) = Uni ( 0 , 1 ) . Recall that Beta ( 1 , 1 ) means 0 imaginary heads and 0 imaginary tails.
Markdown
[Probability For Computer Science](https://chrispiech.github.io/probabilityForComputerScientists/en/index.html) - [Notation Reference](https://chrispiech.github.io/probabilityForComputerScientists/en/intro/notation) [Core Probability Reference](https://chrispiech.github.io/probabilityForComputerScientists/en/intro/core_probability_ref) [Random Variable Reference](https://chrispiech.github.io/probabilityForComputerScientists/en/intro/all_distributions) [Python Reference](https://chrispiech.github.io/probabilityForComputerScientists/en/intro/python) [Calculators](https://chrispiech.github.io/probabilityForComputerScientists/en/intro/calculators) - Part 1: Core Probability - [Counting](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/counting) [Combinatorics](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/combinatorics) [Definition of Probability](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/probability) [Equally Likely Outcomes](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/equally_likely) [Probability of **or**](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_or) [Conditional Probability](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob) [Independence](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/independence) [Probability of **and**](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_and) [Law of Total Probability](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/law_total) [Bayes' Theorem](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/bayes_theorem) [Log Probabilities](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/log_probabilities) [Many Coin Flips](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/many_flips) - [Applications](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta/#part1_examples) - [Enigma Machine](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/enigma) - [Serendipity](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/serendipity) - [Random Shuffles](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/random_shuffles) - [Random Graphs](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/counting_graphs) - [Set Diversity](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/diversity_shapes) - [Bacteria Evolution](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/bacteria_evolution) - Part 2: Random Variables - [Random Variables](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/rvs) [Probability Mass Functions](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/pmf) [Expectation](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/expectation) [Variance](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/variance) [Bernoulli Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli) [Binomial Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/binomial) [Poisson Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/poisson) [More Discrete Distributions](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/more_discrete) [Categorical Distributions](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/categorical) [Continuous Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/continuous) [Uniform Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/uniform) [Exponential Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/exponential) [Normal Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/normal) [Binomial Approximation](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/binomial_approx) - [Applications](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta/#part2_examples) - [100 Binomial Problems](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/100_binomial_problems) - [Winning Series](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/winning_series) - [Approximate Counting](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/approximate_counting) - [Jury Selection](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/jury) - [Grading Eye Inflamation](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/grading_eye_inflammation) - [Grades are Not Normal](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/grades_not_normal) - [Curse of Dimensionality](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/curse_of_dimensionality) - [Algorithmic Art](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/algorithmic_art) - Part 3: Probabilistic Models - [Joint Probability](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/joint) [Marginalization](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/marginalization) [Multinomial](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/multinomial) [Continuous Joint](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/continuous_joint) [Inference](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/inference) [Bayesian Networks](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/bayesian_networks) [Independence in Variables](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/independent_vars) [Correlation](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/correlation) [General Inference](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/computational_inference) - [Applications](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta/#part3_examples) - [Fairness in AI](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/fairness) - [Federalist Paper Authorship](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/federalist) - [Name to Age](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/name2age) - [Probability of Baby Delivery](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/prob_baby_delivery) - [Bayesian Carbon Dating](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/bayesian_carbon_dating) - [Digital Vision Test](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/digital_vision_test) - [Bridge Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/bridge_distribution) - [Expectation of Sum Proof](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/expectation_of_sums) - [Bayesian Viral Load Test](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/bayesian_viral_load_test) - [CS109 Logo](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/dart_logo) - [Tracking in 2D](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/tracking_in_2D) - Part 4: Uncertainty Theory - [Beta Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta) [Adding Random Variables](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/summation_vars) [Central Limit Theorem](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt) [Sampling](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/samples) [Bootstrapping](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/bootstrapping) [Algorithmic Analysis](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/algorithmic_analysis) [Information Theory](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/information_theory) [Distance Between Distributions](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/divergence) - [Applications](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta/#part4_examples) - [Thompson Sampling](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/thompson) - [Night Sight](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/night_sight) - [P-Hacking](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/p_hacking) - [Differential Privacy](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/differential_privacy) - Part 5: Machine Learning - [Parameter Estimation](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/parameter_estimation) [Maximum Likelihood Estimation](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/mle) [Maximum A Posteriori](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/map) [Machine Learning](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/machine_learning) [NaΓ―ve Bayes](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/naive_bayes) [Logistic Regression](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/log_regression) [Diffusion](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/diffusion) - [Applications](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/beta/#part5_examples) - [MLE Normal Demo](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/mle_demo) - [MLE Pareto Distribution](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/mle_pareto) - [MLE Mixture Model](https://chrispiech.github.io/probabilityForComputerScientists/en/examples/mixture_models) - β’Έ Chris Piech, Stanford University # Beta Distribution *** The Beta distribution is the distribution most often used as the distribution of probabilities. In this section we are going to have a very meta discussion about how we represent probabilities. Until now probabilities have just been numbers in the range 0 to 1. However, if we have uncertainty about our probability, it would make sense to represent our probabilities as random variables (and thus articulate the relative likelihood of our belief). **Beta Random Variable** | | | |---|---| | Notation: | X ∼ Beta ( a , b ) | | Description: | A belief distribution over the value of a probability p from a Binomial distribution after observing a βˆ’ 1 successes and b βˆ’ 1 fails. | | Parameters: | a \> 0, the number successes + 1 b \> 0, the number of fails + 1 | | Support: | x ∈ \[ 0 , 1 \] | | PDF equation: | f ( x ) \= B ( a , b ) β‹… x a βˆ’ 1 β‹… ( 1 βˆ’ x ) b βˆ’ 1 where B ( a , b ) \= Ξ“ ( a ) Ξ“ ( b ) Ξ“ ( a , b ) | | CDF equation: | No closed form | | Expectation: | E \[ X \] \= a a \+ b | | Variance: | Var ( X ) \= a b ( a \+ b ) 2 ( a \+ b \+ 1 ) | | PDF graph: | | Parameter a : Parameter b : ## What is your Belief in p After 9 Heads in 10 Flips? Imagine we have a coin and we would like to know its true probability of coming up heads, p. We flip the coin 10 times and observe 9 heads and 1 tail. What is your belief in p based off this evidence? Using the definition of probability we could guess that p β‰ˆ 9 10. That number is a very rough estimate, especially since it is only based off 10 coin flips. Moreover the "point-value" 9 10 does not have the ability to articulate how uncertain it is. Could we instead have a random variable for the true probability? Formally, let X represent the true probability of the coin coming up heads. We don't use the symbol P for random variables, so X will have to do. If X \= 0\.7 then the probability of heads is 0.7. X must be a continuous random variable with support \[ 0 , 1 \] since probabilities are continuous values which must be between 0 and 1. Before flipping the coin, we could say that our belief about the coin's heads probability is uniform: X ∼ Uni ( 0 , 1 ). Let H be a random variable for the number of heads and let T be a random variable for the number of tails observed. What is P ( X \= x \| H \= 9 , T \= 1 )? That probability is hard to think about! However it is much easier to reason about the probability with the condition reveresed: P ( H \= 9 , T \= 1 \| X \= x ). This term asks the question: what is the probability of seeing 9 heads and 1 tail in 10 coin flips, given that the true probability of a heads is x. Convince yourself that this probability is just a binomial probability mass function with n \= 10 experiements, and p \= x evaluated at k \= 9 heads: P ( H \= 9 , T \= 1 \| X \= x ) \= ( 10 9 ) x 9 ( 1 βˆ’ x ) 1 We are presented with a perfect context for [Bayes' theorem with random variables](https://chrispiech.github.io/probabilityForComputerScientists/en/part3/inference/). We know a conditional probability in one direction and we would like to know it in the other: f ( X \= x \| H \= 9 , T \= 1 ) \= P ( H \= 9 , T \= 1 \| X \= x ) β‹… f ( X \= x ) P ( H \= 9 , T \= 1 ) Bayes Theorem \= ( 10 9 ) x 9 ( 1 βˆ’ x ) 1 β‹… f ( X \= x ) P ( H \= 9 , T \= 1 ) Binomial PMF \= ( 10 9 ) x 9 ( 1 βˆ’ x ) 1 β‹… 1 P ( H \= 9 , T \= 1 ) Uniform PDF \= ( 10 9 ) P ( H \= 9 , T \= 1 ) x 9 ( 1 βˆ’ x ) 1 Constants to front \= K β‹… x 9 ( 1 βˆ’ x ) 1 Rename constant Lets take a look at that function. For now we can let K \= 110. Regardless of K we will get the same shape, just scaled: What a beautiful image. It tells us relatively likelihood over the probability that is governing our coinflips. Here are a few observations from this chart: 1. Even after only 10 coin flips we are very confident that the true probability is \> 0.5 2. It is almost 10 times more likely that X \= 0\.9 as it is that X \= 0\.6 . 3. f ( X \= 1 ) \= 0 , which makes sense. How could we have flipped that one tail if the probability of heads was 1? ***Wait but why?*** In the derivation above for f ( X \= x \| H \= 9 , T \= 1 ) we made the claim that P ( H \= 9 , T \= 1 ) is a constant. A lot of folks find that hard to believe. Why is that the case? It may be helpful to juxtapose P ( H \= 9 , T \= 1 ) with P ( H \= 9 , T \= 1 \| X \= x ). The later says "what is the probability of 9 heads, given the true probability is x". The former says "what is the probability of 9 heads, under all possible assignments of x". If you wanted to calculate P ( H \= 9 , T \= 1 ) you could use the law of total probability: P ( H \= 9 , T \= 1 ) \= ∫ y \= 0 1 P ( H \= 9 , T \= 1 \| X \= y ) f ( X \= y ) That is a hard number to calculate, but it is in fact a constant with respect to x. ## Beta Derivation Let's generalize the derivation from the previous section, using h for the number of observed heads and t the number of observed tails. If we let H \= h be the event that we saw h heads, and let T \= t be the event that we saw t tails in h \+ t coinflips. We want to calculate the probability density function f ( X \= x \| H \= h , T \= t ). We can use the exact same series of steps, starting with Bayes Theorem: f ( X \= x \| H \= h , T \= t ) \= P ( H \= h , T \= t \| X \= x ) f ( X \= x ) P ( H \= h , T \= t ) Bayes Theorem \= ( h \+ t h ) x h ( 1 βˆ’ x ) t P ( H \= h , T \= t ) Binomial PMF, Uniform PDF \= ( h \+ t h ) P ( H \= h , T \= t ) x h ( 1 βˆ’ x ) t Moving terms around \= 1 c β‹… x h ( 1 βˆ’ x ) t where c \= ∫ 0 1 x h ( 1 βˆ’ x ) t d x The equation that we arrived at when using a Bayesian approach to estimating our probability defines a probability density function and thus a random variable. The random variable is called a Beta distribution, and it is defined as follows: The Probability Density Function (PDF) for X ∼ Beta ( a , b ) is: f ( X \= x ) \= { 1 B ( a , b ) x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 if 0 \< x \< 1 0 otherwise where B ( a , b ) \= ∫ 0 1 x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 d x A Beta distribution has E \[ X \] \= a a \+ b and V a r ( X ) \= a b ( a \+ b ) 2 ( a \+ b \+ 1 ). All modern programming languages have a package for calculating Beta CDFs. You will not be expected to compute the CDF by hand in CS109. To model our estimate of the probability of a coin coming up heads: set a \= h \+ 1 and b \= t \+ 1. Beta is used as a random variable to represent a belief distribution of probabilities in contexts beyond estimating coin flips. For example perhaps a drug has been given to 6 patients, 4 of whom have been cured. We could express our belief in the probability that the drug can cure patients as X ∼ Beta ( a \= 5 , b \= 3 ): ![](https://chrispiech.github.io/probabilityForComputerScientists/img/chapters/beta.png) Notice how the most likely belief for the probability of curing a patient, is 4 / 6, the fraction of patients cured. This distribution shows that we hold a non-zero belief that the probability could be something other than 4 / 6. It is unlikely that the probability is 0.01 or 0.09, but reasonably likely that it could be 0.5. ## Beta as a Prior You can set X ∼ Beta ( a , b ) as a prior to reflect how biased you think the coin is apriori to flipping it. This is a subjective judgment that represent a \+ b βˆ’ 2 "imaginary" trials with a βˆ’ 1 heads and b βˆ’ 1 tails. If you then observe h \+ t real trials with h heads you can update your belief. Your new belief would be, X ∼ Beta ( a \+ h , b \+ t ). Using the prior Beta ( 1 , 1 ) \= Uni ( 0 , 1 ) is the same as saying we haven't seen any "imaginary" trials, so apriori we know nothing about the coin. Here is the proof for the distribution of X when the prior was a Beta too: If our prior belief is X ∼ Beta ( a , b ) , then our posterior is Beta ( a \+ h , b \+ t ) : f ( X \= x \| H \= h , T \= t ) \= P ( H \= h , T \= t \| X \= x ) f ( X \= x ) P ( H \= h , T \= t ) Bayes Theorem \= ( h \+ t h ) x h ( 1 βˆ’ x ) t β‹… 1 c β‹… x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 P ( H \= h , T \= t ) Beta PMF, Uniform PDF \= K β‹… x h ( 1 βˆ’ x ) t β‹… x a βˆ’ 1 ( 1 βˆ’ x ) b βˆ’ 1 Combine Constants \= K β‹… x a \+ h βˆ’ 1 ( 1 βˆ’ x ) b \+ t βˆ’ 1 Combine Like Bases Which is the PDF of Beta ( a \+ h , b \+ t ) It is pretty convenient that if we have a Beta prior belief, then our posterior belief is also Beta. This makes Betas especially convenient to work with, in code and in proof, if there are many updates that you will make to your belief over time. This property where the type of distribution is the same before and after an observation is called a conjugate prior. Quick question: Are you allowed to just make up priors and imaginary trials? Some folks think that is fine (they are called Bayesians) and some folks think that you shouldn't make up prior beliefs (they are called frequentists). In general, for small data it can make you much better at making predictions if you are able to come up with a good prior belief. Observation: There is a deep connection between the beta-prior and the uniform-prior (which we used initially). It turns out that Beta ( 1 , 1 ) \= Uni ( 0 , 1 ). Recall that Beta ( 1 , 1 ) means 0 imaginary heads and 0 imaginary tails.
Readable Markdownnull
Shard143 (laksa)
Root Hash2566890010099092343
Unparsed URLio,github!chrispiech,/probabilityForComputerScientists/en/part4/beta/ s443