7 Asymptotic Distribution of MLE (Part I)

MLEs

Asymptotic Confidence Intervals

Numerical Approximations

Bootstrap

Overview

Let’s revisit maximum likelihood estimators (MLEs) to explore their asymptotic properties and their application in constructing confidence intervals. This lesson introduces the theoretical foundations of MLEs, including their consistency, equivariance, and asymptotic normality under regularity conditions. You will learn to derive the Fisher Information for single and multiple parameter models and use it to construct asymptotic confidence intervals. The lesson also covers numerical approximations of these intervals using R’s optim function and introduces the bootstrap method—both parametric and nonparametric—for estimating confidence intervals when analytical solutions are infeasible. Additionally, you will explore the Delta method and bootstrap approaches for inference on transformations of parameters. Through examples involving Bernoulli, exponential, normal, geometric, t, and Pareto distributions, you will gain practical skills in applying these techniques, preparing you for advanced statistical inference. Let us dive in.

Objectives

Upon completion of this lesson, you should be able to:

7.1 Properties of MLEs

Maximum likelihood estimation is popular because there are a number of provable properties that apply to ANY MLE!

The following result holds for any MLE, subject to some “regularity conditions”. The regularity conditions include

The underlying probability function, $f(x|\theta)$, is “smooth”.
The estimator, $\hat{\theta}_{mle}$, is NOT on the edge of allowable values. For example, the MLE is not in the support.

We will not show the full list of conditions nor proofs of the following properties. You can find them in the Wasserman text, chapter 9.13.

Property 1: The estimator, $\hat{\theta}_{mle}$ is consistent.

This means that as the sample size $n$ approaches infinity, $\hat{\theta}_{mle}$ approaches the true $\theta$. A corollary to this is as $n$ approaches infinity, $E(\hat{\theta}_{mle})$ approaches the true $\theta$.

Property 2: The estimator, $\hat{\theta}_{mle}$, is equivariant.

An estimator, $\hat{\theta}$, is the MLE of $\theta$, and $g(\theta)$ is an invertible function, then the MLE of $\tau=g(\theta)$ is $\hat{\tau}=g(\hat{\theta})$.

This means that MLE of a function of a parameter is the function of the MLE of the parameter, as long as the function is invertible.

Example 7.1 Let $X_i\sim N(\mu, \sigma^2)$. That is, let $X_1, X_2, \ldots, X_n$ be iid Normal random variables with mean $\mu$ and variance $\sigma^2$. We know that $\hat{\mu}=\bar{x}$ and $\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$ are the MLEs of $\mu$ and $\sigma^2$, respectfully. What is the MLE of the standard deviation, $\sigma$?

Solution

We know $\sigma=g(\sigma^2)=\sqrt{\sigma^2}$. Therefore, based on the equivariant property, the MLE of $\sigma$ is: \[\begin{align*} \hat{\sigma}_{mle}=\sqrt{\hat{\sigma}^2_{mle}}=\sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2} \end{align*}\]

Example 7.2 Let $X_i\sim Bin(5, p)$, for $i=1, \ldots, n$. What is the MLE of the odds ratio, $\text{OR}=\frac{p}{1-p}$?

Solution

The MLE of $p$ is $\hat{p}=\dfrac{\sum_{i=1}^n x_i}{5n}$.Therefore, the MLE of $\text{OR}$ is \[\begin{align*} \hat{\text{OR}}_{mle}=\dfrac{\hat{p}}{1-\hat{p}}=\dfrac{\dfrac{\sum_{i=1}^n x_i}{5n}}{1-\dfrac{\sum_{i=1}^n x_i}{5n}} \end{align*}\]

Property 3: MLEs are Asymptotically Normal.

For $X_i\sim f(x_i|\theta)$ $i=1, \ldots, n$, the MLE is $\hat{\theta}=\text{argmax}_{\theta}(L(\theta))$. As $n$ approaches infinity, the MLE $\hat{\theta}$ is approximately normal: \[\begin{align*} \hat{\theta}\approx N\left(\theta_{\text{true}}, \frac{1}{I(\hat{\theta}_{mle})}\right) \end{align*}\]

Here “$\approx$” means “is approximately distributed as”. The function $I(\theta)$ is called the “Fisher Information” or the “Information” function. It is defined as

\[\begin{align*} I(\hat{\theta}_{mle})=-E\left(\frac{d^2}{d\theta^2}\ell(\hat{\theta})\right) \end{align*}\]

Example 7.3 Let $X_1, \ldots, X_n$ be an i.i.d. sample from a Bernoulli distribution with parameter $p$. Find the MLE of $p$ and find $I(\hat{p})$. Then find the approximate normal distribution of $\hat{p}$.

Solution

The pmf is $f(x|p)=p^x(1-p)^{1-x}$. The likelihood function is:

\[L(p)=\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}=p^{\sum x_i}(1-p)^{n-\sum x_i}\]

The log-likelihood function is: \[\ell(p)=\sum x_i \ln(p)+(n-\sum x_i)\ln (1-p)\]

The derivative of the log-likelihood function is: \[\frac{d}{dp}\ell(p)=\frac{\sum x_i}{p}-\frac{n-\sum x_i}{1-p}\]

Setting the derivative equal to zero we get:

\[\begin{align*} & 0=\frac{\sum x_i}{p}-\frac{n-\sum x_i}{1-p}, \qquad \Rightarrow 0=\frac{\sum x_i -p\sum x_i -np+p\sum x_i}{p(1-p)}\\ & 0=\sum x_i-np, \qquad \hat{p}=\frac{\sum x_i}{n} \end{align*}\]

Therefore, the MLE of $p$ is $\hat{p}=\dfrac{\sum x_i}{n}$. Next, lets find $I(p)=-E\left(\frac{d^2}{dp^2}\ell(p)\right)$.

\[\frac{d^2}{dp^2}\ell(p) = -\frac{\sum x_i}{p^2}-\frac{n-\sum x_i}{(1-p)^2}\]

Next, we need to find the negative expectation of the second derivative. \[\begin{align*} I(p)&=-E\left(\frac{d^2}{dp^2}\ell(p)\right)\\ &=-E\left(-\frac{\sum x_i}{p^2}-\frac{n-\sum x_i}{(1-p)^2}\right)\\ &=E\left(\frac{\sum x_i}{p^2}+\frac{n-\sum x_i}{(1-p)^2}\right)\\ &=\frac{\sum E(x_i)}{p^2}+\frac{n-\sum E(x_i)}{(1-p)^2} \end{align*}\]

For a Bernoulli random variable $E(X)=p$. Therefore, \[\begin{align*} I(p)&=\frac{\sum E(x_i)}{p^2}+\frac{n-\sum E(x_i)}{(1-p)^2}\\ &=\frac{np}{p^2}+\frac{n-np}{(1-p)^2}\\ &=\frac{np(1-p)^2+np^2-np^3}{p^2(1-p)^2}\\ & =\frac{np((1-p)^2+p(1-p))}{p^2(1-p)^2}\\ &=\frac{np(1-p)(1-p+p)}{p^2(1-p)^2}\\ &=\frac{n}{p(1-p)} \end{align*}\]

If $I(p)=\frac{n}{p(1-p)}$, then $I(\hat{p})=\dfrac{n}{\hat{p}(1-\hat{p})}=\dfrac{n}{\dfrac{\sum x_i}{n}\left(1-\dfrac{\sum x_i}{n}\right)}=\dfrac{n}{\bar{x}(1-\bar{x})}$.

Putting this all together, we have $\hat{p}\approx N\left(p_{true}, \dfrac{1}{I(\hat{p})}\right)$. Which is equivalent to $\hat{p}\approx N\left(p_{true}, \dfrac{\bar{x}(1-\bar{x})}{n}\right)$.

Now that we have determined an asymptotic distribution of MLEs (under certain conditions), we can use this information (with large sample size) to construct interval estimates for our unknown parameter, $\theta$.

7.2 Asymptotic Confidence Intervals

7.2.1 Single Parmater Case

In this section, we put together the Properties discussed in the previous section to construct an asymptotic confidence interval for the unknown parameter, $\theta$, using the MLE.

The normal approximation to the distribution of any MLE gives us an easy way to find an asymptotic confidence interval for the unknown parameter.

Property 4: Approximate Confidence Interval

Let $X_i$ follow a distribution with probability distribution $f(x_i|\theta)$. Let $\hat{\theta}$ be the MLE of $\theta$. We have $I(\theta)=-E\left(\frac{d^2}{d\theta^2} \ell(\theta)\right)$. Then an asymptotic 95% confidence interval for $\theta$ is \[\begin{align*} \hat{\theta}\pm 1.96 \sqrt{\frac{1}{I(\hat{\theta})}} \end{align*}\]

Example 7.4 Continuing the previous example, we know $\hat{p}=\bar{x}$, $I(\hat{p})=\frac{n}{\bar{x}(1-\bar{x})}$, and $\hat{p}\approx N\left(p_{true}, \frac{\bar{x}(1-\bar{x})}{n}\right)$. Find a 95% asymptotic confidence interval for $p$.

Solution

An approximate 95% confidence interval for $p$ is:

\[\begin{align*} \bar{x}\pm 1.96 \sqrt{\frac{\bar{x}(1-\bar{x})}{n}} \end{align*}\]

Example 7.5 Suppose we have the following 10 observations from a Bernoulli distribution with parameter $p$. Calculate a 95% asymptotic confidence interval for $p$.

1	1	0	1	0
0	0	1	1	1

Solution

Continuing the previous example, we know $\hat{p}=\bar{x}$, $I(\hat{p})=\dfrac{n}{\bar{x}(1-\bar{x})}$, and $\hat{p}\approx N\left(p_{true}, \dfrac{\bar{x}(1-\bar{x})}{n}\right)$.

We can calculate $\bar{x}=0.6$ and know $n=10$. Therefore,

An approximate 95% confidence interval for $p$ is:

\[\begin{align*} \bar{x}\pm 1.96 \sqrt{\frac{\bar{x}(1-\bar{x})}{n}}=0.6\pm \sqrt{\frac{0.6(1-0.6)}{10}}=0.6\pm 0.3036=(0.2964, 0.9036) \end{align*}\]

Note!
The sample size is not large in this example but it is chosen it for practice purposes.

Example 7.6 Let $X_1, X_2, \ldots, X_n$ be a random sample from $\text{Exp}(\theta)$. Find a 95% asymptotic CI for $\theta$, using the MLE of $\theta$.

Solution

The PMF is $f(x_i|\theta)=\frac{1}{\theta}e^{-\frac{x_i}{\theta}}$ and $E(X_i)=\theta$.

Step 1: Find the likelihood function.

\[\begin{align*} L(\theta)=\prod_{i=1}^n \frac{1}{\theta}e^{-\frac{x_i}{\theta}}=\theta^{-n}e^{-\frac{\sum x_i}{\theta}} \end{align*}\]

Step 2: Find the log-likelihood function.

\[\begin{align*} \ell(\theta)=-n\log \theta -\frac{\sum x_i}{\theta} \end{align*}\]

Step 3: Find the derivative of the log-likelihood function with respect to $\theta$.

\[\begin{align*} \frac{d}{d\theta}=-\frac{n}{\theta}+\frac{\sum x_i}{\theta^2} \end{align*}\]

Step 4: Set the derivative equal to 0 and solve for $\theta$.

\[\begin{align*} & 0=\frac{-n\theta+\sum x_i}{\theta^2}, \qquad \Rightarrow 0=-n\theta+\sum x_i\\ & \Rightarrow \hat{\theta}=\frac{\sum x_i}{n} \end{align*}\]

Step 5: Find $I(\hat{\theta})=-E\left(\frac{d^2}{d\theta^2}\ell(\hat{\theta})\right)$.

\[\begin{align*} \frac{d^2}{d\theta^2}\ell(\theta)&=\frac{n}{\theta^2}-\frac{2\sum x_i}{\theta^3}\\ &=\frac{n\theta -2\sum x_i}{\theta^3}\\ I(\hat{\theta})&=-E\left[\frac{n\theta -2\sum x_i}{\theta^3}\right]\\&=-\frac{n\hat{\theta} -2\sum x_i}{\hat{\theta}^3}\\ &=-\frac{n\hat{\theta}-2n\hat{\theta}}{\hat{\theta}^3}=\frac{n}{\bar{x}^2}=\frac{n}{\hat{\theta}^2}=\frac{n^3}{(\sum x_i)^2}\end{align*}\]

Therefore, a 95% asymptotic confidence interval for $\theta$ is: \[\hat{\theta}\pm 1.96\sqrt{\frac{1}{I(\hat{\theta})}}=\bar{x}\pm 1.96 \sqrt{\frac{n^3}{\bar{x}^2}}\]

7.2.2 Multiple Parmater Case

In the multivariate case, where there are two or more parameters to be estimated, $\boldsymbol\theta=(\theta_1,\theta_2,\ldots,\theta_p)$ is the vector of $p$ parameters to be estimated and the asymptotic distribution is as follows. as $n\rightarrow \infty$, \[ \hat{\boldsymbol\theta}_{ML}\sim N\left(\boldsymbol\theta_{\text{true}},\mathbf{I}^{-1}(\hat{\boldsymbol\theta}_{ML})\right)\] where $\mathbf{I}(\boldsymbol\theta)$ is the “Fisher Information Matrix”, defined by \[\mathbf{I}(\boldsymbol\theta) = \left[\begin{array}{cccc} -E_x\left[\frac{\partial^2}{\partial \theta_1^2}\ell(\theta)\right] & -E_x\left[\frac{\partial^2}{\partial \theta_1\partial\theta_2}\ell(\theta)\right] & \cdots & -E_x\left[\frac{\partial^2}{\partial \theta_1\partial\theta_p}\ell(\theta)\right]\\ -E_x\left[\frac{\partial^2}{\partial \theta_2\partial\theta_1}\ell(\theta)\right] & -E_x\left[\frac{\partial^2}{\partial\theta_2^2}\ell(\theta)\right] & \cdots & -E_x\left[\frac{\partial^2}{\partial \theta_2\partial\theta_p}\ell(\theta)\right] \\ \vdots & \vdots & & \vdots \\ -E_x\left[\frac{\partial^2}{\partial \theta_p\partial\theta_1}\ell(\theta)\right] & -E_x\left[\frac{\partial^2}{\partial \theta_p\partial\theta_2}\ell(\theta)\right] & \cdots & -E_x\left[\frac{\partial^2}{\partial \theta_p^2}\ell(\theta)\right] \end{array}\right] \] where again $\ell(\theta)=log(L(\theta))$ and the expectation is taken over all $x_1,x_2,\ldots,x_n$.

7.2.3 Summary: The Asymptotic 95% Confidence Interval

For any MLE, the asymptotic Gaussian distribution approximation can be used to construct a confidence interval. Let $x_i\sim f_X(x|\theta)$, $i=1,2,\ldots,n$ be $n$ independent and identically distributed random variables from a distribution with only a single parameter $\theta$, and let $\hat{\theta}=\text{argmax}(L(\theta))$ be the MLE of the unknown parameter $\theta$. Then from the asymptotic approximate Gaussian distribution of any MLE, an approximate asymptotic 95% confidence interval for $\theta$ is \[\hat{\theta}\pm 1.96\sqrt{1/I(\hat{\theta}_{ML})}\]

Similarly, for a multiparameter model, an approximate asymptotic 95% confidence interval for the $k$-the parameter $\theta_k$ is \[\hat{\theta_k}\pm 1.96\sqrt{\mathbf{I}^{-1}(\hat{\theta}_{ML})[k,k]}\] where $\sqrt{\mathbf{I}^{-1}(\hat{\theta}_{ML})[k,k]}$ is the square root of the $k$-th diagonal entry of $\mathbf{I}^{-1}(\hat{\theta}_{ML})$ where $\mathbf{I}^{-1}$ is the matrix inverse of $\mathbf{I}$.

7.3 Numerical Approximations for CIs

For some relatively simple cases, we can analytically construct 95% confidence intervals for MLEs. This requires being able to analytically find the MLE (by taking the first derivative of the log-likelihood function and setting equal to zero), and then also analytically find the Fisher Information (by taking two derivatives of the log-likelihood function and then taking the expectation over $x_1,x_2,\ldots,x_n$). As we have seen, there are many cases, even for simple statistical distributions, where it is impossible to analytically solve for the MLE. In these cases we numerically approximated this MLE by using optim in R. Just like we can approximate the MLE using optim, we can also approximate the Fisher Information $I(\hat{\theta})$ (in the single-parameter case) or $\mathbf{I}(\hat{\boldsymbol\theta})$ (in the multiparameter case) using optim. We do this in optim by including one additional input in our call to optim.

7.3.1 Single Parameter Case

We will walk through the numerical approximations using the case where we have data from a geometric distribution as follows:

Histogram of geometric distribution — Fig 7.1

To find the MLE numerically, we would first write a function to evaluate the negative log likelihood as we have done before.

nll.geom=function(p,x){
  -sum(log(dgeom(x,p)))
}

Then use optim to numerically approximate the MLE:

out=optim(0.5,nll.geom,x=x)
out

$par
[1] 0.1546875

$value
[1] 125.3257

$counts
function gradient 
      28       NA 

$convergence
[1] 0

$message
NULL

If we run the same optim code, but include the option hessian=TRUE, then optim will approximate the Fisher Information, and return it in the optim object, with the name of hessian.

out=optim(0.5,nll.geom,x=x,hessian=TRUE)
out

$par
[1] 0.1546875

$value
[1] 125.3257

$counts
function gradient 
      28       NA 

$convergence
[1] 0

$message
NULL

$hessian
         [,1]
[1,] 2225.054

This hessian is the Fisher Information, so we can construct an approximate asymptotic 95% confidence interval for our MLE of the parameter $p$ by following the above formulas:

## get the MLE
p.hat=out$par
p.hat

[1] 0.1546875

## get the Fisher Information
I=out$hessian
## get the 95% CI
p.hat-1.96*sqrt(1/I)

          [,1]
[1,] 0.1131361

p.hat+1.96*sqrt(1/I)

          [,1]
[1,] 0.1962389

So a 95% CI for the MLE of $p$ would be (0.113,0.196).

7.3.2 Multiple Parameter Case

When we have two or more parameters, numeric optimization in R, as well as numeric approximation of the Fisher Information Matrix $\mathbf{I}(\hat{\boldsymbol\theta})$ can similarly be done using optim. To illustrate this, we will consider data generated from the Normal distribution. The following code simulates normally-distributed data from a distribution with a mean of -7 and variance of 16:

Histogram of normal distribution — Fig 7.2

To find the MLE, we first write a function to evaluate the negative log likelihood:

nll.norm=function(theta,x){
  mn=theta[1]
  vr=theta[2]
  -sum(log(dnorm(x,mean=mn,sd=sqrt(vr))))
}

Then run optim to find the MLE. We include the hessian=TRUE option, to tell optim to approximate the Fisher Information Matrix.

out=optim(c(-11,1),nll.norm,x=x,hessian=TRUE)
out

$par
[1] -6.564774 12.773473

$value
[1] 269.2845

$counts
function gradient 
      61       NA 

$convergence
[1] 0

$message
NULL

$hessian
             [,1]         [,2]
[1,] 7.8287246339 0.0001983622
[2,] 0.0001983622 0.3067152221

The MLEs of $\boldsymbol\theta=(\mu,\sigma^2)$ are:

theta.hat=out$par
theta.hat

[1] -6.564774 12.773473

so the MLE of $\mu$ is 5.469399 and the MLE of $\sigma^2$ is 38.620385.

To find confidence intervals for each of these parameters, we need to find the matrix inverse of the Fisher Information $\mathbf{I}$, and then get the square root of its diagonal elements. As these values are often called the “standard errors”, we will denote them as se in the R code:

## standard errors = sqrt(1/I.inv[p,p])
I=out$hessian
se=sqrt(diag(solve(I)))
se

[1] 0.357400 1.805645

This gives us a vector, where the $k$-th entry is $\sqrt{\mathbf{I}^{-1}(\hat{\theta}_{ML})[k,k]}$. The standard error associated with $\mu$ is the first of these ($se[1]$) and the standard error associated with $\sigma^2$ is the second of these. The order is the same as the order of the parameter estimates in out\$par. We can thus construct 95% CIs for $\mu$

## CI for mu
c(theta.hat[1]-1.96*se[1] , theta.hat[1]+1.96*se[1])

[1] -7.265278 -5.864270

and for $\sigma^2$:

## CI for sigma^2
c(theta.hat[2]-1.96*se[2] , theta.hat[2]+1.96*se[2])

[1]  9.234409 16.312537

and also asymptotically normal:

\[ \sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta) \xrightarrow{d} \mathcal{N}\left(0, \frac{1}{I(\theta)}\right) \]

Here, $I(\theta)$ is the Fisher Information, which quantifies how much information the data gives us about the parameter. It is defined as:

\[ I(\theta) = \mathbb{E} \left[ \left( \frac{\partial}{\partial \theta} \log f(X \mid \theta) \right)^2 \right] \]

This result is powerful because it allows us to use normal-based methods, like confidence intervals and hypothesis tests, even in complex models. The takeaway: MLEs are not only useful but also come with strong theoretical guarantees as sample size grows.

7.4 Summary

In this lesson we gained practical skills to analyze data using maximum likelihood estimation (MLE) and construct reliable confidence intervals. We can now calculate MLEs for parameters in distributions like Bernoulli, exponential, and normal, and use the Fisher Information to quantify uncertainty and build asymptotic confidence intervals. With R’s optim function, we can numerically estimate MLEs and intervals for complex models, such as geometric or normal distributions, even when analytical solutions are elusive.

We also learned to apply parametric and nonparametric bootstrap methods to generate confidence intervals from data, as demonstrated with t-distributed and Pareto datasets, enabling robust inference without strict distributional assumptions.

Looking ahead, the bootstrap confidence interval techniques we’ve mastered will be invaluable for tackling real-world datasets with non-standard distributions or small sample sizes, empowering you to make data-driven decisions in future statistical analyses.