11  Bayesian Methods

Bayesian
Subjective Probability
Credible Intervals

All of the methods we have developed and used thus far in this course have been developed using what statisticians would call a “frequentist” approach. In this section, we revisit some of those methods using what statisticians would call a “Bayesian” approach. Specifically, we will:

In case you are wondering, the picture to the right is that of the Reverend Thomas Bayes, after which the field of Bayesian statistics is aptly named.

We’d need at least an entire semester, and perhaps more, to truly do any justice to the study of Bayesian statistics. The best we can do in this course is to dabble a bit and in doing so take just a tiny peek at the world through the eyes of a Bayesian. First, we’ll take a look at an example that illustrates how a Bayesian might assign a (subjective) probability to an event given the information he or she has available. Then, we’ll investigate how a Bayesian might estimate a parameter. Finally, we’ll explore how a Bayesian might update the prior information he or she has about the value of a parameter.

Objectives

Upon completion of this lesson, you should be able to:

  1. Analytically find Bayesian posterior distributions for single parameter models with a given prior distribution, both in situations where data values are given and in situations where the data are seen as unobservable random variables, and
  2. Calculate a credible interval for unknown parameter, .

11.1 Subjective Probability

Let’s go through an example that illustrates how a Bayesian might use available data to assign probabilities to particular events.

Example 11.1 The following amounts, in dollars, are bet on horses A, B, C, and D to win a local race:

Horse Amount
A 10,000
B 30,000
C 22,000
D 38,000
Total 100,000

Determine the payout for winning with a $2 bet on each of the four horses, if the track wants to take 17%, or $17,000, from the amount collected.

Solution

The first thing we need to do is to determine the likelihood, that is, probability, that each of the horses wins. We could assign the probabilities based on our own “gut feeling.” Geez…. I think horse A is gonna win today. In general, that’s probably not a good idea. We’d probably be much better served if we used the information we have available, namely the “gut feeling” of all of the bettors! That is, if $38,000 worth of bets came in for Horse D and only $10,000 came in for Horse A, that tells us that more people think that Horse D is going to win than Horse A. It would behoove us, therefore, to use the amounts bet to assign the probabilities. That said, the probability that Horse A will win is then:

\[\begin{align*} P(A)=\frac{10000}{100000}=0.10 \end{align*}\]

And, the probability that Horse B will win is:

\[\begin{align*} P(B)=\frac{30000}{100000}=0.30 \end{align*}\]

And, the probability that Horse C will win is:

\[\begin{align*} P(C)=\frac{22000}{100000}=0.22 \end{align*}\]

And, the probability that Horse D will win is:

\[\begin{align*} P(D)=\frac{38000}{100000}=0.38 \end{align*}\]

Now having assigned a probability to the event that Horse A wins, we can determine the odds against horse A winning. It is:

\[\begin{align*} \text{Odds against }A=\frac{1-0.1}{0.1}=9 \text{ to }1 \end{align*}\]

The odds against A tells us that the track will payout $9 on a $1 bet. Now, the odds against Horse B winning is:

\[\begin{align*} \text{Odds against }B=\frac{1-0.3}{0.3}=\frac{7}{3} \text{ to }1 \end{align*}\]

The odds against B tells us that the track will payout \(\frac{\$7}{3}=\$2.33\) on a $1 bet. Now, the odds against Horse C winning is:

\[\begin{align*} \text{Odds against }C=\frac{1-0.22}{0.22}=\frac{39}{11} \text{ to }1 \end{align*}\]

The odds against C tell us that the track will payout \(\frac{\$39}{11}=\$3.55\) on a $1 bet. Finally, the odds against Horse D winning is:

\[\begin{align*} \text{Odds against }D=\frac{1-0.38}{0.38}=\frac{31}{19} \text{ to }1 \end{align*}\]

The odds against D tell us that they track will payout \(\frac{\$31}{19}=\$1.63\) on a $1 bet. Now, using the odds against Horse A winning, we can determine that the amount the track pays out on a $2 bet on Horse A is:

\[\begin{align*} \text{Payout for }A=\$2+\$9(\$2)=\$20 \end{align*}\]

That is, the bettor gets $9 for each $1 bet, that is, $18, plus the $2 he or she originally bet. Now, using the odds against Horse B winning, we can determine that the amount the track pays out on a $2 bet on Horse B is:

\[\begin{align*} \text{Payout for }B=\$2+\frac{\$7}{3}(\$2)=\$6.67 \end{align*}\]

And, using the odds against Horse C winning, we can determine that the amount the track pays out on a $2 bet on Horse C is:

\[\begin{align*} \text{Payout for }C=\$2+\frac{\$39}{11}(\$2)=\$9.09 \end{align*}\]

And, using the odds against Horse D winning, we can determine that the amount the track pays out on a $2 bet on Horse D is:

\[\begin{align*} \text{Payout for }D=\$2+\frac{\$31}{19}(\$2)=\$5.26 \end{align*}\]

We’re almost there. We just need to skim the 17% off for the race track to make sure that it makes some money off of the bettors. Multiplying each of the calculated payouts by 0.83, we get:

\[\begin{align*} & \text{Absolute Payout for }A=\$20.00(0.83)=\$16.60\\ & \text{Absolute Payout for }B=\$6.67(0.83)=\$5.54\\ & \text{Absolute Payout for }C=\$9.09(0.83)=\$7.54\\ & \text{Absolute Payout for }D=\$5.26(0.83)=\$4.37\\ \end{align*}\]

We can always check our work, too. Suppose, for example, that Horse A wins. Then the track would payout $16.60 to the 5000 bettors who bet $2 on Horse A, costing the track $16.60(5000) = $83,000. Given that that track brought in $100,000 in bets, the track would indeed make its desired $17,000. A similar check of our work can be made for Horses B, C, and D, although there may be some round-off errors because we did do some rounding in our payout calculations. Should still be very close to $83,000 in each case though!

11.2 Bayesian Estimation

There’s one key difference between frequentist statisticians and Bayesian statisticians that we first need to acknowledge before we can even begin to talk about how a Bayesian might estimate a population parameter \(\theta\). The difference has to do with whether a statistician thinks of a parameter as some unknown constant or as a random variable. Let’s take a look at a simple example in an attempt to emphasize the difference.

Example 11.2 A traffic control engineer believes that the cars passing through a particular intersection arrive at a mean rate \(\lambda\) equal to either 3 or 5 for a given time interval. Prior to collecting any data, the engineer believes that it is much more likely that the rate \(\lambda=3\) than \(\lambda=5\). In fact, the engineer believes that the prior probabilities are:

\[\begin{align*} P(\lambda=3)=0.7\;\text{and }P(\lambda=5)=0.3 \end{align*}\]

One day, during a a randomly selected time interval, the engineer observes \(x=7\) cars pass through the intersection. In light of the engineer’s observation, what is the probability that \(\lambda=3\) And what is the probability that \(\lambda=5\)?

Solution

The first thing you should notice in this example is that we are talking about finding the probability that a parameter \(\lambda\) takes on a particular value. In one fell swoop, we’ve just turned everything that we’ve learned in Stat 414 and Stat 415 on its head! The next thing you should notice, after recovering from the dizziness of your headstand, is that we already have the tools necessary to calculate the desired probabilities. Just stick your hand in your probability tool box, and pull out Bayes’ Theorem.

Now, simply by using the definition of conditional probability, we know that the probability that \(\lambda=3\) given that \(X=7\) is:

\[\begin{align*} P(\lambda=3|X=7)=\frac{P(\lambda=3, X=7)}{P(X=7)} \end{align*}\]

which can be written using Bayes’ Theorem as:

\[\begin{align*} P(\lambda=3|X=7)=\frac{P(X=7|\lambda=3)P(\lambda=3)}{P(X=7|\lambda =3)P(\lambda=3)+P(X=7|\lambda=5)P(\lambda=5)} \end{align*}\]

We can use the Poisson cumulative probability table in the back of our text book to find

\[\begin{align*} &P(X=7|\lambda=3)\\ & P(X=7|\lambda=5) \end{align*}\] They are:

\[\begin{align*} &P(X=7|\lambda=3)=0.988-0.966=0.022\\ & P(X=7|\lambda=5)=0.867-0.762=0.105 \end{align*}\]

Now, we have everything we need to finalize our calculation of the desired probability:

\[\begin{align*} P(\lambda=3|X=7)=\frac{0.7(0.022)}{0.7(0.022)+0.3(0.105)}=\frac{0.0154}{0.0154+0.0315}=0.328 \end{align*}\]

Hmmm. Let’s summarize. The initial probability, in this case, \(P(\lambda=3)=0.7\) is called the prior probability. That’s because it is the probability that the parameter takes on a particular value prior to taking into account any new information. The newly calculated probability, that is:

\[\begin{align*} P(\lambda=3|X=7) \end{align*}\]

is called the posterior probability. That’s because it is the probability that the parameter takes on a particular value posterior to, that is, after, taking into account the new information. In this case, we have seen that the probability that \(\lambda=3\) has decreased from 0.7 (the prior probability) to 0.328 (the posterior probability) with the information obtained from the observation \(x=7\).

A similar calculation can be made in finding \(P(\lambda=5|X=7)\). In doing so, we see:

\[\begin{align*} P(\lambda=5|X=7)=\frac{0.3(0.105)}{0.7(0.022)+0.3(0.105)}=\frac{0.0315}{0.0154+0.0315}=0.672 \end{align*}\]

In this case, we see that the probability that \(\lambda=5\) has increased from 0.3 (the prior probability) to 0.672 (the posterior probability) with the information obtained from the observation \(x=7\).

That last example is good for illustrating the distinction between prior probabilities and posterior probabilities, but it falls a bit short as a practical example in the real world. That’s because the parameter in the example is assumed to take on only two possible values, namely \(\lambda=3\) or \(\lambda=5\). In the case where the parameter space for a parameter \(\theta\) takes on an infinite number of possible values, a Bayesian must specify a prior probability density function \(h(\theta)\), say. Entire courses have been devoted to the topic of choosing a good prior p.d.f., so naturally, we won’t go there! We’ll instead assume we are given a good prior p.d.f. \(h(\theta)\) and focus our attention instead on how to find a posterior probability density function \(k(\theta|y)\), say, if we know the probability density function \(g(y|\theta)\) of the statistic \(Y\).

Well, if we knew \(h(\theta)\) and \(g(y|\theta)\), we can treat: \[k(y, \theta)=g(y|\theta)h(\theta)\] as the joint p.d.f. of the statistic \(Y\) and parameter \(\theta\). Then, we can find the marginal distribution of \(Y\) from the joint distribution \(k(y, \theta)\) by integrating over the parameter space of \(\theta\):

\[k_1(y)=\int_{-\infty}^\infty k(y, \theta)d\theta=\int_{-\infty}^\infty g(y|\theta)h(\theta)d\theta\]

And then, we can find the posterior p.d.f. of \(\theta\), given that \(Y=y\), by using Bayes’ theorem. That is:

\[k(\theta|y)=\frac{k(y, \theta)}{k_1(y)}=\frac{g(y|\theta)h(\theta)}{k_1(y)}\]

Let’s make this discussion more concrete by taking a look at an example.

Example 11.3 Suppose that \(Y\) follows a binomial distribution with parameters \(n\) and \(p=\theta\), so that the p.m.f. of \(Y\) given \(\theta\) is:

\(g(y|\theta)={n\choose y}\theta^y(1-\theta)^{n-y}\) for \(y=0, 1, \ldots, n\). Suppose that the prior p.d.f. of the parameter \(\theta\) is the beta p.d.f., that is: \(h(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\)

for \(0<\theta<1\). Find the posterior p.d.f. of \(\theta\), given \(Y=y\). That is, find \(k(\theta|y)\).

Solution

First, we find the joint p.d.f. of the statistic \(Y\) and the parameter \(\theta\) by multiplying the prior p.d.f. \(h(\theta)\) and the conditional p.m.f. of the statistic \(Y\) and the parameter \(\theta\) is: \[k(y, \theta)={n\choose y}\left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right)\theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}\]

over the support \(y=0, 1, 2, \ldots, n\) and \(0<\theta<1\).

Then, we find the marginal p.d.f. of \(Y\) by integrating \(k(y, \theta)\) over the parameter space of \(\theta\):

\[k_1(y)=\int_0^1 k(y, \theta)d\theta={n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\int_0^1 \theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}d\theta\]

Now, if we multiply the integrand by 1 in a special way:

\[k_1(y)={n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\left(\frac{\Gamma(y+\alpha)\Gamma(n-y+\beta)}{\Gamma(y+\alpha+n-y+\beta)}\right)\int_0^1 \left(\frac{\Gamma(y+\alpha+n-y+\beta)}{\Gamma(y+\alpha)\Gamma(n-y+\beta)}\right)\theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}d\theta\]

we see that we get a beta p.d.f. with parameters \(y+\alpha\) and \(n-y+\beta\) that therefore, by the definition of a valid p.d.f., must integrate to 1. Simplifying we therefore get that the marginal p.d.f. of \(Y\) is:

\[k_1(y)={n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\left(\frac{\Gamma(y+\alpha)\Gamma(n-y+\beta)}{\Gamma(y+\alpha+n-y+\beta)}\right)\]

on the support \(y=0, 1, 2, \ldots, n\). Then, the posterior p.d.f. of \(\theta\), given that \(Y=y\) is:

\[\begin{align*} k(\theta|y)=\frac{{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}}{{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\left(\frac{\Gamma(y+\alpha)\Gamma(n-y+\beta)}{\Gamma(y+\alpha+n-y+\beta)}\right)} \end{align*}\]

Because some things cancel out, we see that the posterior p.d.f. of \(\theta\), given \(Y=y\) is:

\[k(\theta|y)=\frac{\Gamma(n+\alpha+\beta)}{\Gamma(\alpha+y)\Gamma(n+\beta-y)}\theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}\]

for \(0<\theta<1\), which you might recognize as a beta p.d.f. with parameters \(y+\alpha\) and \(n-y+\beta\).

Let’s consider another example.

Example 11.4 For fixed \(p\), suppose we observe a single observation \(Y=y\) for a distribution with pdf \(p(1-p)^y\), for \(y=0, 1, \ldots\), Also, prior information on \(p\) indicates that \(p\) has a pdf \(3p^2\) for \(0<p<1\). Find the posterior distribution of \(p\) given \(Y=y\).

Solution

We need to find the posterior distribution of \(p\) given that \(Y=y\). We can find the joint distribution by:

\[\begin{align*} k(p, y)=g(y|p)h(p)=p(1-p)^y\left(3p^2\right)=3p^3(1-p)^y \end{align*}\]

The next step would be to find \(k_1(y)\) by:

\[\begin{align*} k_1(y)=\int_0^1 3p^3(1-p)^ydp \end{align*}\]

Instead of doing this, let’s recognize that \(k_1(y)\) is a constant in regards to \(p\). Lets denote \(k_1(p)\) as \(c\).

Now, we need to find \(k(p|y)=\frac{k(p, y)}{k_1(y)}\).

\[\begin{align*}k(p|y)=\frac{k(p, y)}{k_1(y)}=\frac{3p^3(1-p)^y}{c}=\left(\frac{3}{c}\right)p^3(1-p)^y \end{align*}\]

The term in the parentheses, \(\left(\frac{3}{c}\right)\), is a constant with respect to \(p\). Let’s denote that constant as \(\kappa\). Therefore, we have:

\[k(p|y)=\kappa p^3(1-p)^y\]

We know that \(k(p|y)\) is a valid pdf. We also know that \(p\) has support \(0<p<1\) and that \(k(p|y)\) integrates to 1 over the support. Therefore, the posterior distribution of \(p\) given \(Y=y\) has to be a beta distribution with parameters \(\alpha=4\) and \(\beta=y+1\). In other words,

\[\begin{align*} k(p|y)=\frac{\Gamma(4+y)}{\Gamma(4)\Gamma(y+1)}p^{4-1}(1-p)^{(y+1)-1} \end{align*}\]

We found the posterior distribution of \(p\) given \(Y=y\) without any integration! We just recognized lumped together all of the constants with respect to \(p\) and used our knowledge of what makes a function a valid pdf.

Note! Since \(\kappa\) is a constant with respect to \(p\), we can say that \(k(p|y)\) is proportional to \(p^3(1-p)^y\). We denote this as:

\[k(p|y)\propto p^3(1-p)^y\]

This technique is helpful to find the posteriors quicker. However, it only works if we can recognize the posterior distribution, just as we recognized that \(k(p|y)\) is a beta distribution.

To demonstrate the long way, test your calculus and algebra skills to find

\[\begin{align*} c=k_1(y)=\int_0^1 3p^3(1-p)^ydp=\frac{18}{(y+1)(y+2)(y+3)(y+4)} \end{align*}\]

The constant, \(\kappa\), is equal to

\[\begin{align*} \frac{3}{c}=\frac{3}{\frac{18}{(y+1)(y+2)(y+3)(y+4)}}=\frac{(y+1)(y+2)(y+3)(y+4)}{6} \end{align*}\]

Compare this to the constant of the beta distribution we found as the posterior:

\[\begin{align*} \frac{\Gamma(5+y)}{\Gamma(4)\Gamma(y+1)}=\frac{(y+4)(y+3)(y+2)(y+1)y!}{3!(y!)}=\frac{(y+4)(y+3)(y+2)(y+1)}{6} \end{align*}\]

They are exactly the same! Again, this is because we recognized the posterior distribution is a beta and only one constant can make the pdf of a beta distribution integrate to one over the support.

In the examples so far, we have considered only one observation. We can begin with a sample of observations, say \(X_1, X_2, \ldots, X_n\) or we can begin with a sample statistic.

If we begin with a sample, we could replace \(g(y|\theta)\) by the likelihood function

\[\begin{align*} L(\theta)=f(x_1|\theta)f(x_2|\theta)\cdots f(x_n|\theta) \end{align*}\]

which is just the joint distribution of \(X_1, X_2, \ldots, X_n\) given the parameter \(\theta\). Therefore, we can find

\[\begin{align*} k(\theta|x_1, \ldots, x_n)\propto h(\theta)L(\theta) \end{align*}\]

If we begin with a statistic, say \(W\), then we need to find the distribution of \(g(w|\theta)\) and proceed as usual.

Example 11.5 Let \(X_1, X_2, \ldots, X_n\) be a random sample from a Bernouli distribution with pmf \[\begin{align*} f(x|\theta)=\theta^x(1-\theta)^{1-x}, \qquad x=0, 1 \end{align*}\] Prior information tells us \(\theta\) follows a beta distribution with parameters \(\alpha\) and \(\beta\). Find the posteriour distribution of \(\theta\) given the data.

Solution

First, we need to find the likelihood function:

\[\begin{align*} L(\theta)=\prod_{i=1}^n f(x_i|\theta)=\prod_{i=1}^n\theta^{x_i}(1-\theta)^{1-x_i}=\theta^{\sum x_i}(1-\theta)^{n-\sum x_i} \end{align*}\]

The next step is to find the joint distribution of \(\theta\) and the data.

\[\begin{align*}k(\theta|x_1, \ldots, x_n)\propto h(\theta)L(\theta)=\theta^{\alpha-1}(1-\theta)^{\beta-1}\theta^{\sum x_i}(1-\theta)^{n-\sum x_i} \end{align*}\]

Remember, when we say \(k(\theta|x_1, \ldots, x_n)\) is proportional to the expression on the right hand side, we are ignoring all the terms that are constant with respect to \(\theta\). Simplifying, we get:

\[\begin{align*} k(\theta|x_1, \ldots, x_n)\propto \theta^{\alpha+\sum x_i -1}(1-\theta)^{n-\sum x_i+\beta-1} \end{align*}\]

We can recognize that \(\theta^{\alpha+\sum x_i -1}(1-\theta)^{n-\sum x_i+\beta-1}\) is proportional to a beta distribution with parameters \(\alpha^*=\alpha+\sum x_i\) and \(\beta^*=n-\sum x_i +\beta\).

Okay now, are you scratching your head wondering what this all has to do with Bayesian estimation, as the title of this page suggests it should? Well, let’s talk about that then! Bayesians believe that everything you need to know about a parameter \(\theta\) can be found in its posterior p.d.f. \(k(\theta|y)\). So, if a Bayesian is asked to make a point estimate of \(\theta\), he or she is going to naturally turn to \(k(\theta|y)\) for the answer. But how? Well, the logical thing to do would be to use \(k(\theta|y)\) to calculate the mean or median of \(\theta\), as they would all be reasonable guesses of the value of \(\theta\). But, hmmm! Should he or she …. errrr, let’s get rid of this he or she stuff…. let’s make it a she for now… okay… should she calculate the mean or the median? Well, that depends on what it will cost her for using either. Huh? Cost? We’re talking about estimating a parameter not buying groceries. Well, let’s suppose she gets charged a certain amount for estimating the real value of the parameter \(\theta\) with her guess \(w(y)\). Well, this Bayesian woman would probably want the cost of her error to be as small as possible. Suppose, she is charged the square of the error between \(\theta\) and we guess \(w(y)\). That is, suppose her cost is:

\[(\theta-w(y))^2\]

Aha… there we have it! Because she wants her cost to be as small as possible, she should make her guess \(w(y)\) be the conditional mean \(E(\theta|y)\). Thats because way back in STAT 414, we showed that if \(Z\) is a random variable, then the expected value of the squared error, that is, \(E[(Z-b)^2]\) is minimized at \(b=E(Z)\). In her case, the \(Z\) is the \(\theta\) and the \(b\) is the \(w(y)=E(\theta|y)\).

On the other hand, if she is charged the absolute value of the error between \(\theta\) and her guess \(w(y)\), that is: \[|\theta-w(y)|\] then in order to make her cost be as small as possible, she’ll want to make her guess \(w(y)\) be the conditional median. That’s because, again, way back in STAT 414, we showed that if \(Z\) is a random variable, then the expected value of the absolute value of the error, that is \(E[|Z-b|]\) is minimized when \(b\) equals the median of the distribution.

Let’s make this discussion concrete by returning to our binomial example.

Example 11.6 Suppose that \(Y\) follows a binomial distribution with parameters \(n\) and \(p=\theta\), so that the pmf of \(Y\) given \(\theta\) is:

\[g(y|\theta)={n\choose y}\theta^y(1-\theta)^{n-y}\] for \(y=0, 1, \ldots, n\). Suppose that the prior pdf of the parameter \(\theta\) is the beta pdf, that is:

\[h(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\] for \(0<\theta<1\). Estimate the parameter \(\theta\) using the squared error loss function.

We’ve previously shown that the posterior p.d.f. of \(\theta\) given \(Y=y\) is the bets p.d.f. with parameters \(y+\alpha\) and \(n-y+\beta\). The previous discussion tell us that in order to minimize the squared error loss function:

\[\begin{align*} (\theta-w(y))^2 \end{align*}\]

we should use the conditional mean \(w(y)=E(\theta|y)\) as an estimate of the parameter \(\theta\). Well, in general, the mean of a beta p.d.f. with parameters \(\alpha\) and \(\beta\) is:

\[\begin{align*} \frac{\alpha}{\alpha+\beta} \end{align*}\]

In our case, the posterior p.d.f. of \(\theta\) given \(Y=y\) is the beta p.d.f. with parameters \(y+\alpha\) and \(n-y+\beta\). Therefore, the conditional mean is:

\[\begin{align*}w(y)=E(\theta|y)=\frac{\alpha+y}{\alpha+y+n-y+\beta}=\frac{\alpha+y}{\alpha+n+\beta} \end{align*}\]

And, it serves, in this situation, as a Bayesian’s best estimate of \(\theta\) when using the squared error loss function.

11.3 Credible Intervals

Previously, when we assumed the known parameter was a constant, we found a point estimate or point estimator for \(\theta\). We have done something similar in the Bayesian setting this section by find the value that minimizes the squared error loss function or by minimizing the absolute value of the error.

However, in the frequentist setting, we did not stop at a point estimate. We constructed (using our probability tools) confidence intervals for the unknown parameter. Can we do something similar here with the Bayesian estimate?

The answer is yes. We can find what is called a credible interval. We can construct the credible interval by finding two value of \(y\), say \(a(y)\) and \(b(y)\) such that:

\[\begin{align*} \int_{a(y)}^{b(y)}k(\theta|y)d\theta=1-\alpha \end{align*}\]

where \(\alpha\) is small, say \(\alpha=0.05\). The resulting interval from \(a(y)\) to \(b(y)\) could be an interval estimate for \(\theta\) such that the posterior probability of \(\theta\) being in that interval is \(1-\alpha\).

It is common for the interval to have equal probability on each tail. Therefore, the interval should be set up such that: \[\begin{align*} & \frac{\alpha}{2}=\int_{-\infty}^{a(y)}k(\theta|y)d\theta\\ & \frac{\alpha}{2}=\int_{b(y)}^\infty k(\theta|y)d\theta \end{align*}\]

Example 11.7 Consider the previous example where we have one observation, \(Y\), from a distribution with pdf \(p(1-p)^y\) for \(y=0, 1, \ldots\) and a unknown parameter \(p\) with a prior distribution \(3p^2\), for \(0<p<1\). Find a 90% credible interval for \(p\) with \(Y=3\).

Solution

We know the posterior distribution is a beta distribution with parameters \(\alpha=4\) and \(\beta=y+1\). In this case, \(Y=3\), making \(\beta=4\).

To find the 90% credible interval, we need to find \(a(y)\) and \(b(y)\) such that

\[0.90=\int_{a(y)}^{b(y)}\frac{\Gamma(8)}{\Gamma(4)\Gamma(4)}p^3(1-p)^3dp=\int_{a(y)}^{b(y)}\frac{35}{4}p^3(1-p)^3dp\]

We can use R to help us find this interval. We can find the \(5^{th}\) percentile and the \(95^{th}\) of a beta distribution with parameters \(\alpha=4\) and \(\beta=4\) by using the following commands.

qbeta(0.05, shape1=4, shape2=4) 
[1] 0.2253216
qbeta(0.95, shape1=4, shape2=4)
[1] 0.7746784

Therefore, the 90% credible interval for \(p\), given \(Y=3\), is \((0.2253, 0.7747)\).

11.3.1 Summary

In this lesson, we shifted perspectives from frequentist to Bayesian thinking. While frequentist methods treat parameters as fixed but unknown values, Bayesian methods treat parameters as random variables with probability distributions. This shift allows us to incorporate prior beliefs and update them with data using Bayes’ Theorem.

We started by understanding subjective probability, where a Bayesian uses available information (like betting amounts on horses) to assign probabilities to uncertain events. Then we explored how Bayesians estimate parameters using a prior distribution and likelihood to derive a posterior distribution.

We also saw that Bayesian estimation often leads to posterior distributions with familiar forms—like a beta distribution for binomial data with a beta prior. A Bayesian point estimate depends on the chosen loss function. If minimizing squared error, the best estimate is the posterior mean.

This lesson introduced the Bayesian view of learning from data: combining prior beliefs with observed evidence to make probabilistic statements about unknown parameters. While this was just a glimpse into the Bayesian world, it showed how flexible and intuitive the approach can be.

Key Takeways:

  • Mathematically, if \(\theta\) is the parameter and \(y\) is the data, the posterior distribution is: \(k(\theta|y)=\frac{g(y|\theta)}{k_1(y)}\)
  • Credible intervals, which are Bayesian counterparts to confidence intervals, satisfies: \(\int_{a(y)}^{b(y)} k(\theta|y)d\theta=1-\alpha\)