3 Estimation (Part II)
Building on the foundational concepts of point estimation from Lesson 2, this lesson delves deeper into the critical concept of sufficiency and introduces the method of moments as a technique for deriving estimators. You will explore how sufficient statistics capture all relevant information about a population parameter from sample data, using tools like the Factorization Theorem and the Exponential Criterion. Through examples involving Bernoulli, Poisson, normal, and exponential distributions, you will learn to identify sufficient statistics for both single and multiple parameters. Additionally, you will be introduced to the method of moments, a practical approach to estimating parameters by equating sample moments with their theoretical counterparts. By the end of this lesson, you will be equipped to evaluate and construct estimators with a deeper understanding of their properties, preparing you for advanced statistical inference techniques.
Objectives
Upon completion of this lesson, you should be able to:
- Apply the Factorization Theorem to identify a sufficient statistic for distributions with one parameter,
- Apply the Exponential Criterion to identify a sufficient statistic for distributions with one parameter, and
- Find an estimator of a population parameter (or parameters) using the Method of Moments.
3.1 Sufficiency
In the process of estimating such a parameter, we summarize, or reduce, the information in a sample of size \(n\), \(X_1, X_2, \ldots, X_n\), to a single number, such as the sample mean \(\bar{X}.\) The actual sample values are no longer important to us. That is, if we use a sample mean of 3 to estimate the population mean \(\mu\), it doesn’t matter if the original data values were (1, 3, 5) or (2, 3, 4). Has this process of reducing the \(n\) data points to a single number retained all the information about \(\mu\) that was contained in the original \(n\) data points? Or has some information about the parameters been lost through the process of summarizing the data? In this lesson, we will learn how to find statistics that summarize all of the information in a sample about the desired parameter. Such statistics are called sufficient statistics, and hence the name of this lesson.
Definition of Sufficiency
Sufficiency is the kind of topic in which it is probably best to just jump right in and state its definition. Let’s do that!
Def. 3.1 (Sufficient) Let \(X_1, X_2, \ldots, X_n\) be a random sample from a probability distribution with unknown parameter \(\theta.\) Then, the statistic:
\[Y = u(X_1, X_2, ... , X_n)\]
is said to be sufficient for \(\theta\) if the conditional distribution of \(X_1, X_2, \ldots, X_n\), given the statistic \(Y\), does not depend on the parameter \(\theta.\)
Example 3.1 Let \(X_1, X_2, \ldots, X_n\) be a random sample of \(n\) Bernoulli trials in which:
- \(X_i=1\) if the \(i^{th}\) subject likes Pepsi
- \(X_i=0\) if the \(i^{th}\) subject does not like Pepsi
If \(p\) is the probability that subject \(i\) likes Pepsi, for \(i = 1, 2,\ldots,n\), then:
- \(X_i=1\) with probability \(p\)
- \(X_i=0\) with probability \(q = 1 − p\)
Suppose, in a random sample of \(n=40\) people, that \(Y = \sum_{i=1}^{n}X_i =22\) people like Pepsi. If we know the value of \(Y\), the number of successes in \(n\) trials, can we gain any further information about the parameter \(p\) by considering other functions of the data \(X_1, X_2, \ldots, X_n\)? That is, is \(Y\) sufficient for \(p\)?
Solution
The definition of sufficiency tells us that if the conditional distribution of \(X_1, X_2, \ldots, X_n\), given the statistic \(Y\), does not depend on \(p\), then \(Y\) is a sufficient statistic for \(p.\) The conditional distribution of \(X_1, X_2, \ldots, X_n\), given \(Y\), is by definition:
Now, for the sake of concreteness, suppose we were to observe a random sample of size \(n=3\) in which \(x_1=1, x_2=0, \text{ and }x_3=1.\) In this case:
\[P(X_1 = 1, X_2 = 0, X_3 =1, Y=1)=0\]
because the sum of the data values, \(\sum_{i=1}^{n}X_i\), is 1 + 0 + 1 = 2, but \(Y\), which is defined to be the sum of the \(X_i\)’s is 1. That is because \(2\ne 1\), the event in the numerator of Equation 3.1 is an impossible event and therefore its probability is 0.
Now, let’s consider an event that is possible, namely ( \(X_1=1, X_2=0, X_3=1, Y=2\)). In that case, we have, by independence:
\[P(X_1 = 1, X_2 = 0, X_3 =1, Y=2) = p(1-p) p=p^2(1-p)\]
So, in general:
\[P(X_1 = x_1, X_2 = x_2, ... , X_n = x_n, Y = y) = 0 \text{ if } \sum_{i=1}^{n}x_i \ne y\]
and:
\[P(X_1 = x_1, X_2 = x_2, ... , X_n = x_n, Y = y) = p^y(1-p)^{n-y} \text{ if } \sum_{i=1}^{n}x_i = y \]
Now, the denominator in the Equation 3.1 equation above is the binomial probability of getting exactly \(y\) successes in \(n\) trials with a probability of success \(p.\) That is, the denominator is:
\[P(Y=y) = \binom{n}{y} p^y(1-p)^{n-y}\]
for \(y = 0, 1, 2,\ldots, n.\) Putting the numerator and denominator together, we get, if \(y=0, 1, 2, \ldots, n\), that the conditional probability is:
\[P(X_1 = x_1, ... , X_n = x_n |Y = y) = \dfrac{p^y(1-p)^{n-y}}{\binom{n}{y} p^y(1-p)^{n-y}} =\dfrac{1}{\binom{n}{y}} \text{ if } \sum_{i=1}^{n}x_i = y\]
and:
\[P(X_1 = x_1, ... , X_n = x_n |Y = y) = 0 \text{ if } \sum_{i=1}^{n}x_i \ne y\]
Aha! We have just shown that the conditional distribution of \(X_1, X_2, \ldots, X_n\) given \(Y\) does not depend on \(p.\) Therefore, \(Y\) is indeed sufficient for \(p.\) That is, once the value of \(Y\) is known, no other function of \(X_1, X_2, \ldots, X_n\) will provide any additional information about the possible value of \(p.\)
Factorization Theorem
While the definition of sufficiency provided on the previous page may make sense intuitively, it is not always all that easy to find the conditional distribution of \(X_1, X_2, \ldots, X_n\) given \(Y.\) Not to mention that we’d have to find the conditional distribution of \(X_1, X_2, \ldots, X_n\) given \(Y\) for every \(Y\) that we’d want to consider a possible sufficient statistic! Therefore, using the formal definition of sufficiency as a way of identifying a sufficient statistic for a parameter \(\theta\) can often be a daunting road to follow. Thankfully, a theorem often referred to as the Factorization Theorem provides an easier alternative! We state it here without proof.
Let’s put the theorem to work on a few examples!
Example 3.2 Let \(X_1, X_2, \ldots, X_n\) denote a random sample from a Poisson distribution with parameter \(\lambda>0.\) Find a sufficient statistic for the parameter \(\lambda.\)
Solution
Because \(X_1, X_2, \ldots, X_n\) is a random sample, the joint probability mass function of \(X_1, X_2, \ldots, X_n\) is, by independence:
\[f(x_1, x_2, ... , x_n;\lambda) = f(x_1;\lambda) \times f(x_2;\lambda) \times ... \times f(x_n;\lambda)\]
Inserting what we know to be the probability mass function of a Poisson random variable with parameter \(\lambda\), the joint p.m.f. is therefore:
\[f(x_1, x_2, ... , x_n;\lambda) = \dfrac{e^{-\lambda}\lambda^{x_1}}{x_1!} \times\dfrac{e^{-\lambda}\lambda^{x_2}}{x_2!} \times ... \times \dfrac{e^{-\lambda}\lambda^{x_n}}{x_n!}\]
Now, simplifying, by adding up all \(n\) of the \(\lambda\)s in the exponents, as well as all \(n\) of the \(x_i\)’s in the exponents, we get:
\[f(x_1, x_2, ... , x_n;\lambda) = \left(e^{-n\lambda}\lambda^{\Sigma x_i} \right) \times \left( \dfrac{1}{x_1! x_2! ... x_n!} \right)\]
Hey, look at that! We just factored the joint p.m.f. into two functions, one (\(\phi\)) being only a function of the statistic \(Y=\sum_{i=1}^{n}X_i\) and the other (\(h\)) not depending on the parameter \(\lambda\):
\[f(x_1, x_2, \ldots, x_n;\lambda) = {\color{blue}\underbrace{\color{black} \left(e^{-n\lambda}\lambda^{n\bar{x}} \right)}_{\textstyle \color{blue} {\phi[u(\Sigma{x_i})_i\lambda)]}}} \times {\color{red}\underbrace{\color{black}\left( \frac{1}{x_1! x_2! \ldots x_n!} \right)}_{\textstyle \color{red} {h(x_1, x_2, \ldots x_n)}}}\]
Therefore, the Factorization Theorem tells us that \(Y=\sum_{i=1}^{n}X_i\) is a sufficient statistic for \(\lambda.\) But, wait a second! We can also write the joint p.m.f. as:
\[f(x_1, x_2, ... , x_n;\lambda) = \left(e^{-n\lambda}\lambda^{n\bar{x}} \right) \times \left( \dfrac{1}{x_1! x_2! ... x_n!} \right)\]
Therefore, the Factorization Theorem tells us that \(Y = \bar{X}\) is also a sufficient statistic for \(\lambda\)!
If you think about it, it makes sense that \(Y = \bar{X}\) and \(Y=\sum_{i=1}^{n}X_i\) are both sufficient statistics, because if we know \(Y = \bar{X}\), we can easily find \(Y=\sum_{i=1}^{n}X_i.\) And, if we know \(Y=\sum_{i=1}^{n}X_i\), we can easily find \(Y = \bar{X}.\)
The previous example suggests that there can be more than one sufficient statistic for a parameter \(\theta.\) In general, if \(Y\) is a sufficient statistic for a parameter \(\theta\), then every one-to-one function of \(Y\) not involving \(\theta\) is also a sufficient statistic for \(\theta.\) Let’s take a look at another example.
Example 3.3 Let \(X_1, X_2, \ldots, X_n\) be a random sample from a normal distribution with mean \(\mu\) and variance 1. Find a sufficient statistic for the parameter \(\mu.\)
Solution
Because \(X_1, X_2, \ldots, X_n\) is a random sample, the joint probability density function of \(X_1, X_2, \ldots, X_n\) is, by independence:
\[f(x_1, x_2, ... , x_n;\mu) = f(x_1;\mu) \times f(x_2;\mu) \times ... \times f(x_n;\mu)\]
Inserting what we know to be the probability density function of a normal random variable with mean \(\mu\) and variance 1, the joint p.d.f. is:
\[\begin{align} &f(x_1, x_2, ... , x_n;\mu) = \\&\dfrac{1}{(2\pi)^{1/2}} exp \left[ -\dfrac{1}{2}(x_1 - \mu)^2 \right] \times \dfrac{1}{(2\pi)^{1/2}} exp \left[ -\dfrac{1}{2}(x_2 - \mu)^2 \right] \times ... \times \dfrac{1}{(2\pi)^{1/2}} exp \left[ -\dfrac{1}{2}(x_n - \mu)^2 \right]\end{align}\]
Collecting like terms, we get:
\[f(x_1, x_2, ... , x_n;\mu) = \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2}\sum_{i=1}^{n}(x_i - \mu)^2 \right]\]
A trick to making the factoring of the joint p.d.f. an easier task is to add 0 to the quantity in parentheses in the summation. That is:
\[f(x_1, x_2, ... , x_n;\mu) = \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2}\sum_{i=1}^{n}(x_i {\color{red}\underbrace{\color{black}{-\bar{x}+\bar{x}}}_{\textstyle \color{red}0}}- \mu)^2 \right]\]
Now, squaring the quantity in parentheses, we get:
\[f(x_1, x_2, ... , x_n;\mu) = \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2}\sum_{i=1}^{n}\left[ (x_i - \bar{x})^2 +2(x_i - \bar{x}) (\bar{x}-\mu)+ (\bar{x}-\mu)^2\right] \right]\]
And then distributing the summation, we get:
\[f(x_1, x_2, ... , x_n;\mu) = \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 - (\bar{x}-\mu) \sum_{i=1}^{n}(x_i - \bar{x}) -\dfrac{1}{2}\sum_{i=1}^{n}(\bar{x}-\mu)^2\right]\]
But, the middle term in the exponent is 0, and the last term, because it doesn’t depend on the index \(i\), can be added up \(n\) times:
\[f(x_1, x_2, ... , x_n;\mu) = \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2} \sum_{i=1}^{n} (x_i - \bar{x})^2 - (\bar{x}-\mu) {\color{red}{\underbrace{\color{black}{\sum_{i=1}^{n}(x_i - \bar{x})}}_{\textstyle\color{red}0}}} -\dfrac{1}{2}{\color{red}\underbrace{\color{black}{\sum_{i=1}^{n}(\bar{x}-\mu)^2}}_{\textstyle\color{red}{n(\bar{x}-\mu)^2}}}\right]\]
So, simplifying, we get:
\[f(x_1, x_2, ... , x_n;\mu) = \left\{ exp \left[ -\dfrac{n}{2} (\bar{x}-\mu)^2 \right] \right\}\times \left\{ \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 \right] \right\}\]
In summary, we have factored the joint p.d.f. into two functions, one (\(\phi\)) being only a function of the statistic \(Y = \bar{X}\) and the other (\(h\)) not depending on the parameter \(\mu\):
\[f(x_1, x_2, ... , x_n;\mu) = {\color{blue}{\underbrace{\color{black}{\left\{ exp \left[ -\dfrac{n}{2} (\bar{x}-\mu)^2 \right] \right\}}}_{\textstyle \color{blue}{\phi[\mu(\bar{x});\mu)]}}}} \times \color{red}{\underbrace{\color{black}{\left\{ \dfrac{1}{(2\pi)^{n/2}} exp \left[ -\dfrac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 \right] \right\}}}_{\textstyle\color{red}{h(x_1, x_2,...,x_n)}}}\]
Therefore, the Factorization Theorem tells us that \(Y = \bar{X}\) is a sufficient statistic for \(\mu.\) Now, \(Y = \bar{X}^3\) is also sufficient for \(\mu\), because if we are given the value of \(\bar{X}^3\), we can easily get the value of \(\bar{X}\) through the one-to-one function \(w=y^{1/3}.\) That is:
\[W=(\bar{X}^3)^{1/3}=\bar{X}\]
On the other hand, \(Y = \bar{X}^2\) is not a sufficient statistic for \(\mu\), because it is not a one-to-one function. That is, if we are given the value of \(\bar{X}^2\), using the inverse function:
\[w=y^{1/2}\]
we get two possible values, namely:
\[-\bar{X} \text{ and } +\bar{X}\]
We’re getting so good at this, let’s take a look at one more example!
Example 3.4 Let \(X_1, X_2, \ldots, X_n\) be a random sample from an exponential distribution with parameter \(\theta.\) Find a sufficient statistic for the parameter \(\theta.\)
Solution
Because \(X_1, X_2, \ldots, X_n\) is a random sample, the joint probability density function of \(X_1, X_2, \ldots, X_n\) is, by independence:
\[f(x_1, x_2, ... , x_n;\theta) = f(x_1;\theta) \times f(x_2;\theta) \times ... \times f(x_n;\theta)\]
Inserting what we know to be the probability density function of an exponential random variable with parameter \(\theta\), the joint p.d.f. is:
\[f(x_1, x_2, ... , x_n;\theta) =\dfrac{1}{\theta}exp\left( \dfrac{-x_1}{\theta}\right) \times \dfrac{1}{\theta}exp\left( \dfrac{-x_2}{\theta}\right) \times ... \times \dfrac{1}{\theta}exp\left( \dfrac{-x_n}{\theta} \right)\]
Now, simplifying, by adding up all \(n\) of the \(\theta\)s and the \(n\) \(x_i\)’s in the exponents, we get:
\[f(x_1, x_2, ... , x_n;\theta) =\dfrac{1}{\theta^n}exp\left( - \dfrac{1}{\theta} \sum_{i=1}^{n} x_i\right)\]
We have again factored the joint p.d.f. into two functions, one (\(\phi\)) being only a function of the statistic \(Y=\sum_{i=1}^{n}X_i\) and the other (h) not depending on the parameter \(\theta\):
\[f(x_1, x_2, ... , x_n;\theta) ={\color{blue}{\underbrace{\color{black}{\dfrac{1}{\theta^n}exp\left( - \dfrac{1}{\theta} \sum_{i=1}^{n} x_i\right)}}_{\textstyle \color{blue}{\phi[\mu(\sum x_i);\theta]}}}}\times{\color{red}{\underbrace{\color{black}1}_{\textstyle \color{red}h(x_1,x_2,...,x_n)}}}\]
Therefore, the Factorization Theorem tells us that \(Y=\sum_{i=1}^{n}X_i\) is a sufficient statistic for \(\theta.\) And, since \(Y = \bar{X}\) is a one-to-one function of \(Y=\sum_{i=1}^{n}X_i\), it implies that \(Y = \bar{X}\) is also a sufficient statistic for \(\theta.\)
Exponential Form
You might not have noticed that in all of the examples we have considered so far in this lesson, every p.d.f. or p.m.f. could be written in what is often called exponential form, that is:
\[f(x;\theta) =exp\left[K(x)p(\theta) + S(x) + q(\theta) \right] \]
with
\(K(x)\) and \(S(x)\) being functions only of \(x\)
\(p(\theta)\) and \(q(\theta)\) being functions only of the parameter \(\theta\)
The support being free of the parameter \(\theta.\)
First, we had Bernoulli random variables with p.m.f. written in exponential form as:
\[f(x;p) =p^x(1-p)^{1-x}=exp[{\color{blue}\underbrace{\color{black}x}_{\textstyle k(x) }}{\color{red}\underbrace{\color{black}\text{ln}\left( \frac{p}{1-p}\right)}_{\textstyle p(p)}} + {\color{brown}\underbrace{\color{black}\text{ln}(1)}_{\textstyle s(x)}} + {\color{green}\underbrace{\color{black}\text{ln}(1-p)}_{\textstyle q(p)}}]\]
with:
\(K(x)\) and \(S(x)\) being functions only of \(x\)
\(p(p)\) and \(q(p)\) being functions only of the parameter \(p\)
The support \(x=0\), 1 not depending on the parameter \(p\)
Okay, we just skipped a lot of steps in that second equality sign, that is, in getting from point A (the typical p.m.f.) to point B (the p.m.f. written in exponential form). So, let’s take a look at that more closely. We start with:
\[f(x;p) =p^x(1-p)^{1-x}\]
Is the p.m.f. in exponential form? Doesn’t look like it to me! We clearly need an “exp” to appear upfront. The only way we are going to get that without changing the underlying function is by taking the inverse function, that is, the natural log (“ln”), at the same time. Doing so, we get:
\[f(x;p) =exp\left[\text{ln}(p^x(1-p)^{1-x}) \right]\]
Is the p.m.f. now in exponential form? Nope, not yet, but at least it’s looking more hopeful. All of the steps that follow now involve using what we know about the properties of logarithms. Recognizing that the natural log of a product is the sum of the natural logs, we get:
\[f(x;p) =exp\left[\text{ln}(p^x) + \text{ln}(1-p)^{1-x} \right]\]
Is the p.m.f. now in exponential form? Nope, still not yet, because \(K(x)\), \(p(p)\), \(S(x)\), and \(q(p)\) can’t yet be identified as following exponential form, but we are certainly getting closer. Recognizing that the log of a power is the power times the log of the base, we get:
\[f(x;p) =exp\left[x\text{ln}(p) + (1-x)\text{ln}(1-p) \right]\]
This is getting tiring. Is the p.m.f. in exponential form yet? Nope, afraid not yet. Let’s distribute that \((1-x)\) in that last term. Doing so, we get:
\[f(x;p) =exp\left[x\text{ln}(p) + \text{ln}(1-p) - x\text{ln}(1-p) \right]\]
Is the p.m.f. now in exponential form? Let’s take a closer look. Well, in the first term, we can identify the \(K(x)p(p)\) and in the middle term, we see a function that depends only on the parameter \(p\):
\[f(x;p) =exp[{\color{blue}\underbrace{\color{black}x}_{\textstyle k(x) }}{\color{red}\underbrace{\color{black}\text{ln}\left( \frac{p}{1-p}\right)}_{\textstyle p(p)}} + {\color{green}\underbrace{\color{black}\text{ln}(1)}_{\textstyle q(p)}} + {\color{brown}\underbrace{\color{black}x\text{ln}(1-p)}_{\textstyle s(x,p)}}]\]
Now, all we need is the last term to depend only on \(x\) and we’re as good as gold. Oh, rats! The last term depends on both \(x\) and \(p.\) So back to work some more! Recognizing that the log of a quotient is the difference between the logs of the numerator and denominator, we get:
\[f(x;p) =exp\left[x\text{ln}\left( \frac{p}{1-p}\right) + \text{ln}(1-p) \right]\]
Is the p.m.f. now in exponential form? So close! Let’s just add 0 in (by way of the natural log of 1) to make it obvious. Doing so, we get:
\[f(x;p) =exp\left[x\text{ln}\left( \frac{p}{1-p}\right) + \text{ln}(1) + \text{ln}(1-p) \right]\]
Yes, we have finally written the Bernoulli p.m.f. in exponential form:
\[f(x;p) =exp[{\color{blue}\underbrace{\color{black}x}_{\textstyle k(x) }}{\color{red}\underbrace{\color{black}\text{ln}\left( \frac{p}{1-p}\right)}_{\textstyle p(p)}} + {\color{brown}\underbrace{\color{black}\text{ln}(1)}_{\textstyle s(x)}} + {\color{green}\underbrace{\color{black}\text{ln}(1-p)}_{\textstyle q(p)}}]\]
Whew! So, we’ve fully explored writing the Bernoulli p.m.f. in exponential form! Let’s get back to reviewing all of the p.m.f.’s we’ve encountered in this lesson. We had Poisson random variables whose p.m.f. can be written in exponential form as:
\[f(x;\lambda) =\frac{e^{-\lambda}\lambda^x}{x!}=exp[{\color{blue}\underbrace{\color{black}x}_{\textstyle k(x) }}{\color{red}\underbrace{\color{black}\text{ln}\lambda}_{\textstyle p(\lambda)}} + {\color{brown}\underbrace{\color{black}\text{ln}(x!)}_{\textstyle s(x)}} + {\color{green}\underbrace{\color{black}\lambda}_{\textstyle q(\lambda)}}]\]
with
\(K(x)\) and \(S(x)\) being functions only of \(x\)
\(p(\lambda)\) and \(q(\lambda)\) being functions only of the parameter \(\lambda\)
The support \(x = 0, 1, 2, \ldots\) not depending on the parameter \(\lambda\)
Then, we had \(N(\mu, 1)\) random variables whose p.d.f. can be written in exponential form as:
\[f(x;\mu) =\frac{1}{\sqrt{2\pi}}e^{-(x-\mu)^2/2}=exp\left\{{\color{blue}\underbrace{\color{black}x}_{\textstyle k(x) }}{\color{red}\underbrace{\color{black}u}_{\textstyle p(\mu)}} - {\color{brown}\underbrace{\color{black}\frac{x^2}{2}}_{\textstyle s(x)}} - {\color{green}\underbrace{\color{black}\frac{\mu^2}{2}-\frac{1}{2}ln(2\pi)}_{\textstyle q(\mu)}}\right\}\]
with
\(K(x)\) and \(S(x)\) being functions only of \(x\)
\(p(\mu)\) and \(q(\mu)\) being functions only of the parameter \(\mu\)
The support \(-\infty<x<\infty\) not depending on the parameter \(\mu\)
Then, we had exponential random variables random variables whose p.d.f. can be written in exponential form as:
\[f(x;\theta) =\frac{1}{\theta}e^{-x/\theta}=exp\left\{{\color{blue}\underbrace{\color{black}-x}_{\textstyle k(x) }}{\color{red}\underbrace{\color{black}\left(\frac{1}{\theta}\right)}_{\textstyle p(\theta)}} + {\color{brown}\underbrace{\color{black}ln(1)}_{\textstyle s(x)}} - {\color{green}\underbrace{\color{black}ln\theta}_{\textstyle q(\theta)}}\right\}\]
with
\(K(x)\) and \(S(x)\) being functions only of \(x\)
\(p(\theta)\) and \(q(\theta)\) being functions only of the parameter \(\theta\)
The support \(x\ge 0\) not depending on the parameter \(\theta.\)
Happily, it turns out that writing p.d.f.s and p.m.f.s in exponential form provides us yet a third way of identifying sufficient statistics for our parameters. The following theorem tells us how.
Let’s try the Exponential Criterion out on an example.
Example 3.5 Let \(X_1, X_2, \ldots, X_n\) be a random sample from a geometric distribution with parameter \(p.\) Find a sufficient statistic for the parameter \(p.\)
Solution
The probability mass function of a geometric random variable is:
\[f(x;p) = (1-p)^{x-1}p\]
for \(x=1, 2, 3, \ldots\) The p.m.f. can be written in exponential form as:
\[f(x;p) = \text{exp}\left[ x\text{log}(1-p)+\text{log}(1)+\text{log}\left( \frac{p}{1-p} \right)\right]\]
Therefore, \(Y=\sum_{i=1}^{n}X_i\) is sufficient for \(p.\) Easy as pie!
By the way, you might want to note that almost every p.m.f. or p.d.f. we encounter in this course can be written in exponential form. With that noted, you might want to make the Exponential Criterion the first tool you grab out of your toolbox when trying to find a sufficient statistic for a parameter.
Two or More Parameters
In each of the examples we considered so far in this lesson, there is one and only one parameter. What happens if a probability distribution has two parameters, \(\theta_1\) and \(\theta_2\), say, for which we want to find sufficient statistics, \(Y_1\) and \(Y_2\)? Fortunately, the definitions of sufficiency can easily be extended to accommodate two (or more) parameters. Let’s start by extending the Factorization Theorem.
3.1.1 Factorization Theorem
Let \(X_1, X_2, \ldots, X_n\) denote random variables with a joint p.d.f. (or joint p.m.f.):
\[f(x_1,x_2, ... ,x_n; \theta_1, \theta_2)\]
which depends on the parameters \(\theta_1\) and \(\theta_2.\) Then, the statistics \(Y_1=u_1(X_1, X_2, ... , X_n)\) and \(Y_2=u_2(X_1, X_2, ... , X_n)\) are joint sufficient statistics for \(\theta_1\) and \(\theta_2\) if and only if:
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) =\phi\left[u_1(x_1, ... , x_n), u_2(x_1, ... , x_n);\theta_1, \theta_2 \right] h(x_1, ... , x_n)\]
where:
\(\phi\) is a function that depends on the data \((x_1, x_2, ... , x_n)\) only through the functions \(u_1(x_1, x_2, ... , x_n)\) and \(u_2(x_1, x_2, ... , x_n)\), and
the function \(h(x_1, ... , x_n)\) does not depend on either of the parameters \(\theta_1\) or \(\theta_2.\)
Let’s try the extended theorem out for size on an example.
Example 3.6 Let \(X_1, X_2, \ldots, X_n\) denote a random sample from a normal distribution \(N(\theta_1, \theta_2.\) That is, \(\theta_1\) denotes the mean \(\mu\) and \(\theta_2\) denotes the variance \(\sigma^2.\) Use the Factorization Theorem to find joint sufficient statistics for \(\theta_1\) and \(\theta_2.\)
Solution
Because \(X_1, X_2, \ldots, X_n\) is a random sample, the joint probability density function of \(X_1, X_2, \ldots, X_n\) is, by independence:
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = f(x_1;\theta_1, \theta_2) \times f(x_2;\theta_1, \theta_2) \times ... \times f(x_n;\theta_1, \theta_2) \times\]
Inserting what we know to be the probability density function of a normal random variable with mean \(\theta_1\) and variance \(\theta_2\), the joint p.d.f. is:
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \dfrac{1}{\sqrt{2\pi\theta_2}} \text{exp} \left[-\dfrac{1}{2}\dfrac{(x_1-\theta_1)^2}{\theta_2} \right] \times ... \times = \dfrac{1}{\sqrt{2\pi\theta_2}} \text{exp} \left[-\dfrac{1}{2}\dfrac{(x_n-\theta_1)^2}{\theta_2} \right]\]
Simplifying by collecting like terms, we get:
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \left(\dfrac{1}{\sqrt{2\pi\theta_2}}\right)^n \text{exp} \left[-\dfrac{1}{2}\dfrac{\sum_{i=1}^{n}(x_i-\theta_1)^2}{\theta_2} \right]\]
Rewriting the first factor, squaring the quantity in parentheses, and distributing the summation, in the second factor, we get:
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \text{exp} \left[\text{log}\left(\dfrac{1}{\sqrt{2\pi\theta_2}}\right)^n\right] \text{exp} \left[-\dfrac{1}{2\theta_2}\left\{ \sum_{i=1}^{n}x_{i}^{2} -2\theta_1\sum_{i=1}^{n}x_{i} +\sum_{i=1}^{n}\theta_{1}^{2} \right\}\right]\]
Simplifying yet more, we get:
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \text{exp} \left[ -\dfrac{1}{2\theta_2}\sum_{i=1}^{n}x_{i}^{2}+\dfrac{\theta_1}{\theta_2}\sum_{i=1}^{n}x_{i} -\dfrac{n\theta_{1}^{2}}{2\theta_2}-n\text{log}\sqrt{2\pi\theta_2} \right]\]
Look at that! We have factored the joint p.d.f. into two functions, one (\(\phi\)) being only a function of the statistics \(Y_1=\sum_{i=1}^{n}X^{2}_{i}\) and \(Y_2=\sum_{i=1}^{n}X_i\), and the other (h) not depending on the parameters \(\theta_1\) and \(\theta_2\):
\[f(x_1, x_2, ... , x_n;\theta_1, \theta_2) = \color{blue}{\underbrace{{\color{black}{\text{exp} \left[ -\dfrac{1}{2\theta_2}\sum_{i=1}^{n}x_{i}^{2}+\dfrac{\theta_1}{\theta_2}\sum_{i=1}^{n}x_{i} -\dfrac{n\theta_{1}^{2}}{2\theta_2}-n\text{log}\sqrt{2\pi\theta_2} \right]}}}_{\textstyle \phi[u_1(\sum x_1^2),u_2(\sum x_i);\theta_1,\theta_2]}}\times \color{red}\underbrace{\color{black}{1}}_{\textstyle h(x_1,...,x_n)}\]
Therefore, the Factorization Theorem tells us that \(Y_1=\sum_{i=1}^{n}X^{2}_{i}\) and \(Y_2=\sum_{i=1}^{n}X_i\) are joint sufficient statistics for \(\theta_1\) and \(\theta_2.\) And, the one-to-one functions of \(Y_1\) and \(Y_2\), namely:
\[\begin{align}\bar{X} &=\dfrac{Y_2}{n}=\dfrac{1}{n}\sum_{i=1}^{n}X_i \\ &\text{ and }\\ S_2&=\dfrac{Y_1-(Y_{2}^{2}/n)}{n-1}\\&=\dfrac{1}{n-1} \left[\sum_{i=1}^{n}X_{i}^{2}-n\bar{X}^2 \right]\end{align}\]
are also joint sufficient statistics for \(\theta_1\) and \(\theta_2.\) Aha! We have just shown that the intuitive estimators of \(\mu\) and \(\sigma^2\) are also sufficient estimators. That is, the data contain no more information than the estimators \(\bar{X}\) and \(S^2\) do about the parameters \(\mu\) and \(\sigma^2\)! That seems like a good thing!
We have just extended the Factorization Theorem. Now, the Exponential Criterion can also be extended to accommodate two (or more) parameters. It is stated here without proof.
Let’s try applying the extended exponential criterion to our previous example.
Example 3.7 Let \(X_1, X_2, \ldots, X_n\) denote a random sample from a normal distribution \(N(\theta_1, \theta_2).\) That is, \(\theta_1\) denotes the mean \(\mu\) and \(\theta_2\) denotes the variance \(\sigma^2.\) Use the Exponential Criterion to find joint sufficient statistics for \(\theta_1\) and \(\theta_2.\)
Solution
The probability density function of a normal random variable with mean \(\theta_1\) and variance \(\theta_2\) can be written in exponential form as:
\[f(x;\theta_1,\theta_2) =exp[\frac{-1}{2\theta_2}{\color{blue}\underbrace{\color{black}x^2}_{\textstyle K_1(x) }}+\frac{\theta_1}{\theta_2}{\color{blue}\underbrace{\color{black}x}_{\textstyle K_2(x)}}- {\color{brown}\underbrace{\color{black}\text{log}(1)}_{\textstyle S(x)}} - {\color{green}\underbrace{\color{black}\frac{\theta_1^2}{2\theta_2}-log\sqrt{2\pi\theta_2}}_{\textstyle q(\theta_1, \theta_2)}}]\]
Therefore, the statistics \(Y_1=\sum_{i=1}^{n}X^{2}_{i}\) and \(Y_2=\sum_{i=1}^{n}X_i\) are joint sufficient statistics for \(\theta_1\) and \(\theta_2.\)
3.2 Method of Moments
So far, we have been provided with an estimator and we calculated properties associated with the estimator. What if we do not have an estimator? How would we go about finding one?
In this lesson, we introduce a crude method of finding an estimator, the method of moments. In the next section, we dive into detail for finding maximum likelihood estimators.
In short, the method of moments involves equating sample moments with theoretical moments. So, let’s start by making sure we recall the definitions of theoretical moments, as well as learn the definitions of sample moments.
Definitions
\(E(X^k)\) is the \(k^{th}\) (theoretical) moment of the distribution (about the origin), for \(k=1, 2, \ldots\)
\(E\left[(X-\mu)^k\right]\) is the \(k^{th}\) (theoretical) moment of the distribution (about the mean), for \(k=1, 2, \ldots\)
\(M_k=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^k\) is the \(k^{th}\) sample moment, for \(k=1, 2, \ldots\)
\(M_k^\ast =\dfrac{1}{n}\sum\limits_{i=1}^n (X_i-\bar{X})^k\) is the \(k^{th}\) sample moment about the mean, for \(k=1, 2, \ldots\)
One Form of the Method
The basic idea behind this form of the method is to:
Equate the first sample moment about the origin \(M_1=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\) to the first theoretical moment \(E(X).\)
Equate the second sample moment about the origin \(M_2=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2\) to the second theoretical moment \(E(X^2).\)
Continue equating sample moments about the origin, \(M_k\), with the corresponding theoretical moments \(E(X^k), \; k=3, 4, \ldots\) until you have as many equations as you have parameters.
Solve for the parameters.
The resulting values are called method of moments estimators. It seems reasonable that this method would provide good estimates since the empirical distribution converges in some sense to the probability distribution. Therefore, the corresponding moments should be about equal.
Example 3.8 Let \(X_1, X_2, \ldots, X_n\) be Bernoulli random variables with parameter \(p.\) What is the method of moments estimator of \(p\)?
Solution
Here, the first theoretical moment about the origin is:
\[E(X_i)=p\]
We have just one parameter for which we are trying to derive the method of moments estimator. Therefore, we need just one equation. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\[p=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\]
Now, we just have to solve for \(p.\) Whoops! In this case, the equation is already solved for \(p.\) Our work is done! We just need to put a hat (^) on the parameter to make it clear that it is an estimator. We can also subscript the estimator with an “MM” to indicate that the estimator is the method of moments estimator:
\[\hat{p}_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\]
So, in this case, the method of moments estimator is the same as the maximum likelihood estimator, namely, the sample proportion.
Example 3.9 Let \(X_1, X_2, \ldots, X_n\) be normal random variables with mean \(\mu\) and variance \(\sigma^2.\) What are the method of moments estimators of the mean \(\mu\) and variance \(\sigma^2\)?
Solution
The first and second theoretical moments about the origin are:
\[E(X_i)=\mu\qquad E(X_i^2)=\sigma^2+\mu^2\]
(Incidentally, in case it’s not obvious, that second moment can be derived from manipulating the shortcut formula for the variance.) In this case, we have two parameters for which we are trying to derive the method of moments estimators. Therefore, we need two equations here. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\[E(X)=\mu=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\]
And, equating the second theoretical moment about the origin with the corresponding sample moment, we get:
\[E(X^2)=\sigma^2+\mu^2=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2\]
Now, the first equation tells us that the method of moments estimator for the mean \(\mu\) is the sample mean:
\[\hat{\mu}_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\]
And, substituting the sample mean for \(\mu\) in the second equation and solving for \(\sigma^2\), we get that the method of moments estimator for the variance \(\sigma^2\) is:
\[\hat{\sigma}^2_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2-\mu^2=\dfrac{1}{n}\sum\limits_{i=1}^n X_i^2-\bar{X}^2\]
which can be rewritten as:
\[\hat{\sigma}^2_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n( X_i-\bar{X})^2\]
Again, for this example, the method of moments estimators are the same as the maximum likelihood estimators.
In some cases, rather than using the sample moments about the origin, it is easier to use the sample moments about the mean. Doing so provides us with an alternative form of the method of moments.
Another Form of the Method
The basic idea behind this form of the method is to:
Equate the first sample moment about the origin \(M_1=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\) to the first theoretical moment \(E(X).\)
Equate the second sample moment about the mean \(M_2^\ast=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i-\bar{X})^2\) to the second theoretical moment about the mean \(E[(X-\mu)^2].\)
Continue equating sample moments about the mean \(M^\ast_k\) with the corresponding theoretical moments about the mean \(E[(X-\mu)^k]\), \(k=3, 4, \ldots\) until you have as many equations as you have parameters.
Solve for the parameters.
Again, the resulting values are called method of moments estimators.
Example 3.10 Let \(X_1, X_2, \dots, X_n\) be gamma random variables with parameters \(\alpha\) and \(\theta\), so that the probability density function is:
\[f(x_i)=\dfrac{1}{\Gamma(\alpha) \theta^\alpha}x^{\alpha-1}e^{-x/\theta}\]
for \(x>0.\) Therefore, the likelihood function:
\[L(\alpha,\theta)=\left(\dfrac{1}{\Gamma(\alpha) \theta^\alpha}\right)^n (x_1x_2\ldots x_n)^{\alpha-1}\text{exp}\left[-\dfrac{1}{\theta}\sum x_i\right]\]
is difficult to differentiate because of the gamma function \(\Gamma(\alpha).\) So, rather than finding the maximum likelihood estimators, what are the method of moments estimators of \(\alpha\) and \(\theta\)?
Solution
The first theoretical moment about the origin is:
\[E(X_i)=\alpha\theta\]
And the second theoretical moment about the mean is:
\[\text{Var}(X_i)=E\left[(X_i-\mu)^2\right]=\alpha\theta^2\]
Again, since we have two parameters for which we are trying to derive method of moments estimators, we need two equations. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\[E(X)=\alpha\theta=\dfrac{1}{n}\sum\limits_{i=1}^n X_i=\bar{X}\]
And, equating the second theoretical moment about the mean with the corresponding sample moment, we get:
\[Var(X)=\alpha\theta^2=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i-\bar{X})^2\]
Now, we just have to solve for the two parameters \(\alpha\) and \(\theta.\) Let’s start by solving for \(\alpha\) in the first equation \((E(X)).\) Doing so, we get:
\[\alpha=\dfrac{\bar{X}}{\theta}\]
Now, substituting \(\alpha=\dfrac{\bar{X}}{\theta}\) into the second equation (\(\text{Var}(X)\)), we get:
\[\alpha\theta^2=\left(\dfrac{\bar{X}}{\theta}\right)\theta^2=\bar{X}\theta=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i-\bar{X})^2\]
Now, solving for \(\theta\)in that last equation, and putting on its hat, we get that the method of moment estimator for \(\theta\) is:
\[\hat{\theta}_{MM}=\dfrac{1}{n\bar{X}}\sum\limits_{i=1}^n (X_i-\bar{X})^2\]
And, substituting that value of \(\theta\)back into the equation we have for \(\alpha\), and putting on its hat, we get that the method of moment estimator for \(\alpha\) is:
\[\hat{\alpha}_{MM}=\dfrac{\bar{X}}{\hat{\theta}_{MM}}=\dfrac{\bar{X}}{(1/n\bar{X})\sum\limits_{i=1}^n (X_i-\bar{X})^2}=\dfrac{n\bar{X}^2}{\sum\limits_{i=1}^n (X_i-\bar{X})^2}\]
Example 3.11 Let’s return to the example in which \(X_1, X_2, \ldots, X_n\) are normal random variables with mean \(\mu\) and variance \(\sigma^2.\) What are the method of moments estimators of the mean \(\mu\) and variance \(\sigma^2\)?
Solution
The first theoretical moment about the origin is:
\[E(X_i)=\mu\]
And, the second theoretical moment about the mean is:
\[\text{Var}(X_i)=E\left[(X_i-\mu)^2\right]=\sigma^2\]
Again, since we have two parameters for which we are trying to derive method of moments estimators, we need two equations. Equating the first theoretical moment about the origin with the corresponding sample moment, we get:
\[E(X)=\mu=\dfrac{1}{n}\sum\limits_{i=1}^n X_i\]
And, equating the second theoretical moment about the mean with the corresponding sample moment, we get:
\[\sigma^2=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i-\bar{X})^2\]
Now, we just have to solve for the two parameters. Oh! Well, in this case, the equations are already solved for \(\mu\) and \(\sigma^2.\) Our work is done! We just need to put a hat (^) on the parameters to make it clear that they are estimators. Doing so, we get that the method of moments estimator of \(\mu\)is:
\[\hat{\mu}_{MM}=\bar{X}\]
(which we know, from our previous work, is unbiased). The method of moments estimator of \(\sigma^2\)is:
\[\hat{\sigma}^2_{MM}=\dfrac{1}{n}\sum\limits_{i=1}^n (X_i-\bar{X})^2\]
(which we know, from our previous work, is biased). This example, in conjunction with the second example, illustrates how the two different forms of the method can require varying amounts of work depending on the situation.
3.3 Summary
In this lesson, we explored how estimators are derived. We introduced the method of moments, which matches sample moments to population moments to estimate parameters. We also developed a deeper understanding of sufficiency, learning how to identify statistics that capture all the information about a parameter using the Factorization Theorem and the Exponential Criterion. These tools help simplify complex data while preserving what’s essential for inference.
Key Takeaways
- The method of moments provides a straightforward way to find estimators.
- A sufficient statistic contains all information needed to estimate a parameter and allows us to reduce the data without losing inferential power.
- The Factorization Theorem helps determine whether a statistic is sufficient.
- The Exponential Criterion is a shortcut for checking sufficiency in exponential family models.