4  Maximum Likelihood Estimation (MLE) (Part I)

MLE
R

In the last lesson, we introduced a way of finding an estimator using the Method of Moments. In this lesson, we will introduce another method to find an estimator, finding an estimator by maximizing the likelihood.

Objectives

Upon completion of this lesson, you should be able to:

  1. Write out the likelihood and loglikelihood functions for data from a specified probability model,
  2. Analytically calculate the maximum likelihood estimate (MLE) for a range of single parameter models where the data values are given,
  3. Analytically calculate the maximum likelihood estimate for a range of single parameter models where the data are seen as unobservable random variables,
  4. Analytically calculate the maximum likelihood estimate (MLE) for a range of single parameter models in situations where data values are given (i.e., x=(1.6,3.1,7.8)),
  5. Analytically calculate the maximum likelihood estimate for multiple parameter models where the data are seen as unobservable random variables, and
  6. Analytically calculate the maximum likelihood estimate for a range of single parameter models where the data are seen as unobservable random variables and the parameter is in the support.

4.1 The Basic Idea

It seems reasonable that a good estimate of the unknown parameter \(\theta\) would be the value of \(\theta\) that maximizes the probability, errrr….that is the ** likelihood**….of getting the data we observed. (So, do you see from where the name ‘maximum likelihood’ comes?)

Def. 4.1 (Likelihood Function) Let \(X_1, X_2, \ldots, X_n\) be a random sample from a distribution with unknown parameters, \(\theta_1, \theta_2, \ldots, \theta_m\). Let the probability density (or mass) function be denoted \(f(x_i|\theta_1, \theta_2, \ldots, \theta_m)\). Suppose that \((\theta_1, \theta_2, \ldots, \theta_m)\) is restricted to a given parameter space \(\Omega\). The likelihood function, \(L(\theta_1, \theta_2, \ldots, \theta_m)\), is defined as the joint probability distribution of the sample. That is:

\[\begin{align*} L(\theta_1, \theta_2, \ldots, \theta_m)=\prod_{i=1}^n f(x_i|\theta_1, \theta_2, \ldots, \theta_m)=f(x_1|\theta_1, \theta_2, \ldots, \theta_m)f(x_2|\theta_1, \theta_2, \ldots, \theta_m)\cdots f(x_n|\theta_1, \theta_2, \ldots, \theta_m) \end{align*}\]

Note!
the symbol \(\prod\) means “the product”. Also note that we consider the likelihood function to be a function of \(\theta_1, \theta_2, \ldots, \theta_m\). We can multiply the individual probability distributions to find the joint distribution because the random variables are independent.

Example 4.1 Let \(X_i\overset{iid}{\sim} \text{Bin}(10, p)\) for \(i=1, 2, 3, 4, 5\). That is, let \(X_1, X_2, \ldots, X_5\) be an independent and identical sample from a Binomial distribution with parameters \(n=10\) and \(p\). Find the likelihood function (often referred to as the likelihood).

Solution

The likelihood function is a product of the individual probability mass functions. Therefore. \[\begin{align*} L(p)=\prod_{i=1}^n f(x_i|p)=\prod_{i=1}^n {10\choose x_i}p^{x_i}(1-p)^{10-x_i}=\left[\prod_{i=1}^n {10\choose x_i}\right]p^{\sum_{i=1}^n x_i}(1-p)^{10n-\sum_{i=1}^n x_i} \end{align*}\]

Now, that we have a formal definition of the likelihood, the basic idea of maximum likelihood estimation is to treat the likelihood function \(L(\theta)\) as a function of \(\theta\) and find the value of \(\theta\) that maximizes it. For the moment, let’s consider the case with only one unknown parameter, \(\theta\). Recall from calculus, that to find the maximum of \(L(\theta)\) (or \(\text{argmax}(L(\theta))\)) we first take the derivative of the function with respect to \(\theta\). Next, we set the derivative equal to 0 and solve for \(\theta\). At this point, we know we have a critical point. Finally, we need to check the second derivative to show that it is a maximum.

Example 4.2 Consider the previous example where we have \(X_i\overset{iid}{\sim} \text{Bin}(10, p)\) for \(i=1, 2, 3, 4, 5\). We found the likelihood function to be: \[\begin{align*} L(p)=\left[\prod_{i=1}^n {10\choose x_i}\right]p^{\sum_{i=1}^n x_i}(1-p)^{10n-\sum_{i=1}^n x_i} \end{align*}\]

Solution

The first step is to find the likelihood, that we have. The next step is to find the first derivative of the likelihood function with respect to \(p\). Now it is time to put on our calculus hats.

\[\begin{align*} \frac{d}{dp}L(P)=\left[\prod_{i=1}^n {10\choose x_i}\right]\left(\sum x_i\right)p^{\sum x_i-1}(1-p)^{10n-\sum x_i}-\left[\prod_{i=1}^n {10\choose x_i}\right]p^{\sum x_i}\left(10n-\sum x_i\right)(1-p)^{10n-\sum x_i-1} \end{align*}\]

The next step is to set the derivative equal to 0 and solve for \(p\). WOW….wait! Is there an easier way to approach this problem?

We will use a “trick” that often makes the differentiation a bit easier. Note that the natural logarithm is an increasing function of \(x\).

That is, if \(x_1<x_2\), then \(f(x_1)<f(x_2)\) for all \(x\). This means that the value of \(x\) that maximizes \(f(x)\) will also maximize the natural logarithm of the function. Now, let’s take this back to our previous example.

Example 4.3 Again, we have \(X_i\overset{iid}{\sim} \text{Bin}(10, p)\) for \(i=1, 2, 3, 4, 5\) and the likelihood function:

\[\begin{align*} L(p)=\left[\prod_{i=1}^n {10\choose x_i}\right]p^{\sum_{i=1}^n x_i}(1-p)^{10n-\sum_{i=1}^n x_i} \end{align*}\]

Find the value of \(p\) that maximized the likelihood.

Solution

Let’s consider the natural logarithm of \(L(p)\), called the log-likelihood function.

\[\begin{align*} \ell(p)=\ln L(p)=\ln \prod_{i=1}^n f(x_i|p) \end{align*}\]

By the rules of logarithms, we get

\[\begin{align*} \ell(p)=\ln \prod_{i=1}^n f(x_i|p)=\sum_{i=1}^n \ln f(x_i|p)=\sum_{i=1}^n\left[\ln {10\choose x_i} +x_i\ln p+(10-x_i)\ln (1-p)\right] \end{align*}\]

Now, take the derivative of \(\ell(p)\) with respect to \(p\).

\[\begin{align*} \frac{d}{dp}\ell(p)=\sum_{i=1}^n \left[0+\frac{x_i}{p}-\frac{(10-x_i)}{1-p}\right]=\frac{\sum x_i}{p}-\frac{\sum (10-x_i)}{1-p} \end{align*}\]

The next step is to set the derivative equal to 0 and solve for \(p\).

\[\begin{align*} & 0=\frac{\sum x_i}{p}-\frac{\sum (10-x_i)}{1-p}, \qquad \Rightarrow \frac{\sum (10-x_i)}{1-p}=\frac{\sum x_i}{p}\\ & \Rightarrow p\sum (10-x_i)=(1-p)\sum x_i\\ & \Rightarrow 10np-\sum x_ip=\sum x_i -\sum x_ip\\ & \Rightarrow 10np=\sum x_i\\ & \Rightarrow p=\frac{\sum x_i}{10n} \end{align*}\]

Finally, we check that the value found above is a maximum. We take the second derivative with respect to \(p\).

\[\begin{align*} \frac{d^2}{dp^2}\ell(p)=-\frac{\sum x_i}{p^2}-\frac{\sum (10-x_i)}{(1-p)^2} \end{align*}\]

For \(p=\frac{\sum x_i}{10n}\) to be a maximum, the value of the second derivative at this point needs to be less than 0.

\[\begin{align*} -\frac{\sum x_i}{\left(\frac{\sum x_i}{10n}\right)^2}-\frac{\sum (10-x_i)}{\left(1-\frac{\sum x_i}{10n}\right)^2} \end{align*}\]

Since all the \(x_i\)’s range between 0 and 10, the value of the second derivative is less than 0, therefore, it is a maximum. When we find the value that maximizes the likelihood, it is the maximum likelihood estimator (MLE) and is denoted with a ‘hat’ symbol. Therefore, the MLE is \(\hat{p}=\frac{\sum x_i}{10n}\).

Note!
For almost all reasonable inference problems, \(\ell(\theta)\) has only one critical point, and that point is a maximum. Therefore, in this class, you do NOT have to use the second derivative test to check if it is a maximum. Finding the critical point is enough.

Now, we state the formal definition og the Maximum Likelihood Estimator (MLE).

Def. 4.2 (Maximum Likelihood Estimator) Let \(X_1, X_2, \ldots, X_n\) be a random sample from a distribution with unknown parameters, \(\theta_1, \theta_2, \ldots, \theta_m\). Let the probability density (or mass) function be denoted \(f(x_i|\theta_1, \theta_2, \ldots, \theta_m)\). Suppose that \((\theta_1, \theta_2, \ldots, \theta_m)\) is restricted to a given parameter space \(\Omega\). The likelihood function is defined as the joint probability distribution of the sample. That is: \[\begin{align*} L(\theta_1, \theta_2, \ldots, \theta_m)=\prod_{i=1}^n f(x_i|\theta_1, \theta_2, \ldots, \theta_m) \end{align*}\] If \(\left[u_1(x_1, x_2, \ldots, x_n), u_2(x_1, x_2, \ldots, x_n), \ldots, u_m(x_1, x_2, \ldots, x_n)\right]\) is the \(m\)-tuple that maximizes the likelihood function then: \[\begin{align*} \hat{\theta}_i=u_i(x_1, x_2, \ldots, x_n), \qquad i=1, 2, \ldots, m \end{align*}\] is the maximum likelihood estimator of \(\theta_i\). The corresponding observed value of the statistic, namely, \[\begin{align*} \left[u_1(x_1, x_2, \ldots, x_n), u_2(x_1, x_2, \ldots, x_n), \ldots, u_m(x_1, x_2, \ldots, x_n)\right] \end{align*}\] are called the maximum likelihood estimates of \(\theta_i\), for \(i=1, 2, \ldots, m\).

4.2 One Parameter Case

In this section, we summarize the steps to find a maximum likelihood estimator in the one parameter case.

Steps to Find the MLE (0ne Parameter Case)

  1. Given a random sample \(X_1, X_2, \ldots, X_n\) from a distribution with unknown parameter, \(\theta\), find the likelihood function: \[\begin{align*} L(\theta)=\prod_{i=1}^n f(x_i|\theta) \end{align*}\]
  2. Find the log-likelihood function: \[\begin{align*} \ell(\theta)=\ln L(\theta)=\ln \prod_{i=1}^n f(x_i|\theta)=\sum_{i=1}^n \ln f(x_i\theta) \end{align*}\]
  3. Take the first derivative of the log-likelihood function: \[\begin{align*} \ell^\prime(\theta)=\frac{d}{d\theta}\ell(\theta) \end{align*}\]
  4. Set \(\ell^\prime(\theta)=0\) and solve for \(\theta\). This will give us a critical point.
  5. The resulting critical value is the MLE of \(\theta\), denoted \(\hat{\theta}\).

Now that we have the formal definitions out of the way and our steps clearly defined, let us work on some examples.

Example 4.4 Let \(X_1, X_2, \ldots, X_n\) be a random sample from a Geometric distribution, with unknown parameter \(p\). Find the MLE.

Solution

The pmf of Geometric random variable is \(f(x)=(1-p)^{x-1}p\) for \(x=1, 2, 3, \ldots\). Follow the steps above to find the MLE.

  1. The likelihood function is: \[\begin{align*} L(p)=\prod_{i=1}^n (1-p)^{x_i-1}p=(1-p)^{\sum_{i=1}^n (x_i-1)}p^n \end{align*}\]

  2. Find the log-likelihood function: \[\begin{align*} \ell(p)=\sum (x_i-1)\ln (1-p)+n\ln p \end{align*}\]

  3. Take the first derivative of \(\ell(p)\) with respect to \(p\). \[\begin{align*} \frac{d}{dp}\ell(p)=-\frac{\sum (x_i-1)}{1-p}+\frac{n}{p} \end{align*}\]

  4. Set the derivative equal to 0 and solve for \(p\).

    \[\begin{align} & 0=-\frac{\sum (x_i-1)}{1-p}+\frac{n}{p}, \qquad \Rightarrow 0=\frac{-p\sum (x_i-1)+n(1-p)}{p(1-p)}=\frac{-p\sum x_i+np+n-np}{p(1-p)}\\ & \Rightarrow 0=-p\sum x_i+n, \qquad \Rightarrow -n=-p\sum x_i\\ & \Rightarrow p=\frac{n}{\sum x_i}\end{align}\]

  5. The MLE for \(p\) is \(\hat{p}=\frac{n}{\sum x_i}\)

Example 4.5 Let \(X_1, X_2, \ldots, X_n\) be a random sample from a Gamma distribution, with unknown parameter \(\theta\) and known parameter \(\alpha\). Find the MLE of \(\theta\).

Solution

The pdf of Gamma random variable is \(f(x)=\frac{1}{\Gamma(\alpha)\theta^{\alpha}}x^{\alpha-1}e^{-x/\theta}\) for \(x>0\). Follow the steps to find the MLE.

  1. The likelihood function is:

    \[\begin{align*} L(p)=\prod_{i=1}^n \frac{1}{\Gamma(\alpha)\theta^\alpha}x_i^{\alpha-1}e^{-x_i/\theta}=\left(\frac{1}{\Gamma(\alpha)\theta^\alpha}\right)^n\left(\prod_{i=1}^n x_i\right)^{\alpha-1}e^{-\frac{\sum x_i}{\theta}} \end{align*}\]

  2. Find the log-likelihood function:

    \[\begin{align*} \ell(\theta)=-n\ln (\Gamma(\alpha))-n\alpha\ln (\theta)+(\alpha-1)\ln \left(\ln \prod x_i\right)-\frac{\sum x_i}{\theta} \end{align*}\]

  3. Take the first derivative of \(\ell(\theta)\) with respect to \(\theta\). \[\begin{align*} \frac{d}{d\theta}\ell(\theta)=-\frac{n\alpha}{\theta}+\frac{\sum x_i}{\theta^2} \end{align*}\]

  4. Set the derivative equal to 0 and solve for \(p\). \[\begin{align*} & 0=-\frac{n\alpha}{\theta}+\frac{\sum x_i}{\theta^2}=\frac{-n\alpha\theta+\sum x_i}{\theta^2}\\ & \Rightarrow 0=-n\alpha\theta+\sum x_i, \qquad \Rightarrow n\alpha\theta=\sum x_i\\ & \Rightarrow \theta=\frac{\sum x_i}{n\alpha} \end{align*}\]

  5. The MLE for \(\theta\) is \(\hat{\theta}=\frac{\sum x_i}{n\alpha}\)

In the last two examples, we were given a random sample and we found maximum likelihood estimators for our unknown parameter. Let’s consider now some examples where find the maximum likelihood estimate when we are given a set of observations.

Example 4.6 Suppose we have three observations from a Geometric distribution. The observations are \(x_1=12, x_2=19, x_3=10\). Find the maximum likelihood estimate of \(\theta\).

Solution

In the previous example, we found the MLE to be \(\hat{p}=\frac{n}{\sum x_i}\). Therefore, the maximum likelihood estimate is:

\[\begin{align*} \hat{p}=\frac{n}{\sum x_i}=\frac{3}{12+19+10}=\frac{3}{41} \end{align*}\]

Example 4.7 Suppose we have three observations from a Gamma distribution with \(\alpha=3\) and unknown parameter \(\theta\). The observations are \(x_1=3.4, x_2=8.1, x_3=5.5\). Find the maximum likelihood estimate of \(\theta\).

Solution

The the previous example, we found the MLE to be \(\hat{\theta}=\frac{\sum x_i}{n\alpha}\). Therefore, the maximum likelihood estimate is:

\[\begin{align*} \hat{\theta}=\frac{\sum x_i}{n\alpha}=\frac{3.4+8.1+5.5}{3(3)}= \end{align*}\]

Example 4.8 Suppose we have the following observations from a Poisson distribution with \(\lambda\). \[\begin{align*} & 14, \qquad 21, \qquad 17\\ & 11, \qquad 22 \end{align*}\] Find the maximum likelihood estimate of \(\lambda\).

Solution

First, we need to have the MLE, using our usual steps. The likelihood function is:

\[\begin{align*} L(\lambda)=\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!}=\frac{e^{-n\lambda}\lambda^{\sum x_i}}{\prod x_i!} \end{align*}\]

The log-likelihood is:

\[\begin{align*} \ell(\lambda)=-n\lambda +\left(\sum x_i\right)\ln \lambda +\ln \left(\prod x_i!\right) \end{align*}\]

The first derivative is:

\[\begin{align*} \frac{d}{d\lambda}=-n+\frac{\sum x_i}{\lambda} \end{align*}\]

Setting the derivative equal to 0 and solving for \(\lambda\), we get the following MLE:

\[\begin{align*} & 0=-n+\frac{\sum x_i}{\lambda}, \qquad \Rightarrow n=\frac{\sum x_i}{\lambda}\\ & \Rightarrow \hat{\lambda}=\frac{\sum x_i}{n}=\bar{x} \end{align*}\]

Therefore, the maximum likelihood estimate of \(\lambda\) is \(\hat{\lambda}=\frac{14+21+17+11+22}{5}=\frac{85}{5}=17\).

4.3 Maximize the Likelihood Directly: Support Parameters

In this section, we will continue to explore maximum likelihood estimation. Consider the following example.

Example 4.9 Suppose we have a random sample of size \(n\) from a Uniform distribution over the interval \((0, a)\), where \(a\) is an unknown real number. What is the MLE of \(a\)?

Solution

Let’s approach this problem using our usual steps. The pdf of this Uniform is \(f(x)=\frac{1}{a}\), for \(0<x<a\). The likelihood is:

\[\begin{align*} L(a)=\prod_{i=1}^n \frac{1}{a}=\frac{1}{a^n}=a^{-n} \end{align*}\]

The log-likelihood function is:

\[\begin{align*} \ell(a)=-n\ln a \end{align*}\]

The derivative is:

\[\begin{align*} \frac{d}{da}\ell(a)=-\frac{n}{a} \end{align*}\]

But what happens when we set the derivative equal to 0 and solve for \(a\)? If we set \(-\frac{n}{a}=0\), then our MLE would be \(\hat{a}=\infty\)!.

What is different about this example compared to the other MLE examples so far? It looks like the likelihood function is monotonic increasing. Therefore, it will not have a critical point.

In the above example, we did not account for the constraint that \(a\) has on the data. In our previous examples from the last lesson, the unknown parameter is not in the support. For example, the support of \(X\) is a Normal random variable is \(-\infty<x<\infty\). In the Gamma example, the support of \(X\) is \(x>0\), and so on.

In the case of the Uniform distribution, however, the unknown parameter \(a\) is in the support and we did not account for the constraint that \(a\) has on the data. So what do we do?

Let’s first define the indicator function.

Def. 4.3 (Indicator function) Suppose we have an event \(A\). The indicator function of \(A\) is defined as:

\[\begin{align*} \mathbf{1}_{A}=\begin{cases} 1, & \text{ if $A$ occurs}\\ 0, & \text{ if $A$ does not occur} \end{cases} \end{align*}\]

Let’s look at an example where we use the indicator function and find the maximum likelihood estimator.

Example 4.10 Let \(Y_1, \ldots, Y_n\) be a random sample from a Bernoulli distribution with parameter \(p\). The pdf of \(Y\) is:

\[\begin{align*} f(y)=\begin{cases} p^y(1-p)^{1-y} & y=0, 1\\ 0 & \text{otherwise} \end{cases} \end{align*}\]

There are other ways we could rewrite the pdf. Consider it as

\[\begin{align*} f(y)=\begin{cases} p & y=1\\ 1-p & y=0 \end{cases} \end{align*}\]

Still, another way we could rewrite the pdf is with the indicator function:

\[\begin{align*} f(y)=p^y(1-p)^{1-y}\mathbf{1}_{\{y\in \{0, 1\}\}} \end{align*}\]

where

\[\begin{align*} \mathbf{1}_{\{y \in \{0,1\}\}}=\begin{cases} 1 & \text{ if $y=$ or $y=1$}\\ 0 & \text{otherwise} \end{cases} \end{align*}\]

Using this version of the pdf, find the maximum likelihood estimator of \(p\).

Solution

The likelihood is:

\[\begin{align*} L(p)=\prod_{i=1}^n p^{y_1}(1-p)^{1-y_1}\mathbf{1}_{\{y\in \{0, 1\}\}}=p^{\sum y_i}(1-p)^{n-\sum y_i}\prod \mathbf{1}_{\{y\in \{0, 1\}\}} \end{align*}\]

The log-likelihood function is:

\[\begin{align*} \ell(p)=\sum y_i\ln p+(n-\sum y_i)\ln (1-p)+\sum \ln \left(\mathbf{1}_{\{y\in \{0, 1\}\}}\right) \end{align*}\]

Next, take the derivative with respect to \(p\).

\[\begin{align*} \frac{d}{dp}=\frac{\sum y_1}{p}-\frac{n-\sum y_i}{1-p} \end{align*}\]

At this point, notice that the indicator function dropped out as it is a constant with respect to \(p\). Now, continue to find the MLE.

\[\begin{align*} & \frac{\sum y_1}{p}=\frac{n-\sum y_i}{1-p}, \qquad \Rightarrow \sum y_i-p\sum y_i=np-p\sum y_i\\ & \Rightarrow \sum y_i=np, \qquad \Rightarrow \hat{p}=\frac{\sum y_i}{n} \end{align*}\]

How does this help us in the Uniform \((0, a)\) case? In the example above, we know \(X\) is a Uniform random variable over the interval \((0, a)\). Therefore, instead of writing the pdf as

\[\begin{align*} f(x)=\begin{cases} \frac{1}{a}, & 0<x<a\\ 0, & \text{otherwise} \end{cases} \end{align*}\]

we can rewrite it using the indicator function.

\[\begin{align*} f(x)=\frac{1}{a}\times\mathbf{1}_{x\in (0, a)} \end{align*}\]

where

\[\begin{align*} \mathbf{1}_{x\in (0, a)}=\begin{cases} 1, & \text{ if $x\in (0, a)$}\\ 0, & \text{ if $x\notin (0, a)$} \end{cases} \end{align*}\]

Example 4.11 Find the maximum likelihood estimator for \(a\) for the Uniform example.

Solution

Again, we start with the likelihood

\[\begin{align*} L(a)=\prod_{i=1}^n \frac{1}{a}\mathbf{1}_{x\in (0, a)}=a^{-n}\prod_{i=1}^n \mathbf{1}_{x\in (0, a)} \end{align*}\]

Note that if \(x_i>a\), then the likelihood is 0. In other words, \(a\) cannot be any less than any \(x_i\). Therefore, to maximize \(L(a)\), we must make \(a\) as small as possible. Therefore, the maximum likelihood estimator is

\[\begin{align*} \hat{a}=\max(X_i)=Y_n \end{align*}\]

Thus, the maximum likelihood estimator for \(a\) is the maximum order statistic, \(Y_n\), as the likelihood is maximized when \(a=\max(x_i)\).

Example 4.12 Suppose we have \(X_i\sim \text{Pareto}(m, \alpha=2)\) be an iid sample of size \(n\), and let the parameter \(\alpha=2\). The pdf of a Pareto is:

\[\begin{align*} f(x)=\frac{\alpha m^\alpha}{x^{\alpha+1}}=\frac{2m^2}{x^3}, \qquad x\ge m \end{align*}\]

Find the maximum likelihood estimator of \(m\).

Solution

The first step as always is to find the likelihood function:

\[\begin{align*} L(m)=\prod_{i=1}^n \frac{2m^2}{x_i^3}\mathbf{1}_{\{x_i>m\}}=\frac{2^nm^{2n}}{\prod x_i^3}\prod \mathbf{1}_{\{x_i>m\}} \end{align*}\]

To maximize the likelihood we consider the terms that include \(m\). Namely,

\[\begin{align*} m^{2n}\prod \mathbf{1}_{\{x_i>m\}} \end{align*}\]

The only way that the product of the indicator function is equal to 1 is if \(x_1>m\), \(x_2>m\), …, \(x_n>m\). It is the same thing as stating that the minimum order statistic is greater than \(m\). In other words, if \(Y_1\) (the first order statistics or minimum) is greater than \(m\), then all of the observations are greater than \(m\). Therefore, the maximum likelihood estimator of \(m\) is the first order statistic, \(Y_1\).

Note!
This approach is only needed when the likelihood is monotonic (Increasing or decreasing) when we ignore the support. When \(L(\theta)\) is monotone increasing, it is often the MLE is \(\hat{\theta}=\min(x_i)\).Likewise, when \(L(\theta)\) is monotone decreasing, it is often the MLE is \(\hat{\theta}=\max(x_i)\).

Example 4.13 Suppose we have a random sample \(X_1, \ldots, X_4\) from a Uniform distribution over \((3c, 10)\). Find the maximum likelihood estimate of \(c\) using the following observations.

\[\begin{align*} & x_1=-5, \qquad x_2=6\\ & x_3=-1, \qquad x_4=2 \end{align*}\]

Solution

The first step is to find the pdf of \(X_i\).

\[\begin{align*} f(x_i|c)=\frac{1}{10-3c}\mathbf{1}_{\{3c\le x_i\le10\}}, \qquad i=1, 2, 3, 4 \end{align*}\]

Next, we find the likelihood function.

\[\begin{align*} L(c)&=\prod_{i=1}^4 \frac{1}{10-3c}\mathbf{1}_{\{3c\le x_i\le10\}}\\ & =\left(\frac{1}{10-3c}\right)^4\mathbf{1}_{\{3c\le x_1\le 10, 3c\le x_2\le 10, 3c\le x_3\le 10, 3c\le x_4\le 10\}} \end{align*}\]

We can show that \(\left(\frac{1}{10-3c}\right)^4\) is a monotone increasing function. Now, let’s take a closer look at the indication function.

\[\begin{align*} \mathbf{1}_{\{3c\le x_1\le 10, 3c\le x_2\le 10, 3c\le x_3\le 10, 3c\le x_4\le 10\}} \end{align*}\]

See if you can convince yourself that the indication function can also be written as:

\[\begin{align*} \mathbf{1}_{\{3c\le x_1, 3c\le x_2, 3c\le x_3, 3c\le x_4\}}\times\mathbf{1}_{\{x_1\le 10, x_2\le 10, x_3\le 10, x_4\le 10\}} \end{align*}\]

The second term is not a function of \(c\). Therefore, we can ignore it. Let’s focus on the first term which is a function of \(c\).

\[\begin{align*} \mathbf{1}_{\{3c\le x_1, 3c\le x_2, 3c\le x_3, 3c\le x_4\}} \end{align*}\]

Similar to before, we can rewrite the indicator function as:

\[\begin{align*} \mathbf{1}_{\{3c\le \min(x_1, \ldots, x_4)\}}=\mathbf{1}_{\{c\le \frac{\min(x_1, \ldots, x_4)}{3}\}} \end{align*}\]

Therefore, the maximum likelihood estimator of \(c\) is

\[\begin{align*} \hat{c}=\frac{\min(x_i)}{3} \end{align*}\]

Using our data, the maximum likelihood estimate is \(\hat{c}=\frac{-5}{3}\).

4.4 Multiparameter Case

So far, we have learned the maximum likelihood estimator for distributions with one parameter. For example, \(\text{Unif}(0, \theta)\) and \(\text{Bin}(10, p)\). We also considered the Normal distribution where only one of the parameters was unknown. In this lesson, we will learn how to find the estimators in situations where there is more than one unknown parameter. For example, \(N(\mu, \sigma^2)\), \(\text{Gamma}(\alpha, \beta)\), etc…

In general, we can follow a similar approach to the multiparameter models as we do for the single parameter case.

Notation

We will be using vector notation in this section. Scalars are not underlined and vectors will be underlined. For example, scalars will look like

\[\begin{align*} & a\in \mathbb{R}\\ & \theta\in \mathbb{R} \end{align*}\]

When we reference vectors, the notation uses an underline. For example,

\[\begin{align*} & \underline{x}\in \mathbb{R}^n, \qquad \underline{x}=(x_1, x_2, \ldots, x_n)\\ & \underline{\theta}\in \mathbb{R}^p, \qquad \underline{\theta}=(\theta_1, \theta_2, \ldots, \theta_p) \end{align*}\]

When we have more than one parameter, we can write the pdf (or pmf) as:

\[\begin{align*} f(x|\underline{\theta}) \end{align*}\]

the likelihood as:

\[\begin{align*} L(\underline{\theta})=\prod_{i=1}^n f(x_i|\underline{\theta}) \end{align*}\]

The log-likelihood is:

\[\begin{align*} & \ell(\underline{\theta})=\sum_{i=1}^n \log f(x_i|\underline{\theta})\\ &=\ell(\theta_1, \ldots, \theta_p)=\sum_{i=1}^n \log f(x_i|\theta_1, \ldots, \theta_p) \end{align*}\]

The critical points occur when

\[\begin{align*} & \frac{d}{d\theta_1}\ell(\underline{\theta})=0\\ & \frac{d}{d\theta_2}\ell(\underline{\theta})=0\\ & \vdots\\ & \frac{d}{d\theta_p}\ell(\underline{\theta})=0 \end{align*}\]

Therefore, we have \(p\) equations and \(p\) unknowns. Solving this system of equations gives us \(\hat{\underline{\theta}}=(\hat{\theta}_1, \ldots, \hat{\theta_p})\).

Let’s take a look at an example for finding the MLE of multiple parameters.

Example 4.14 Suppose we have a random sample of size \(n\) from a Normal distribution with unknown parameters \(\mu\) and \(\sigma^2\). Find the MLE of \(\underline{\theta}=(\mu, \sigma^2)\).

Solution

The likelihood function is:

\[\begin{align*} L(\underline{\theta})=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x_i-\mu)^2}=\left(2\pi\sigma^2\right)^{-n/2}e^{\sum -\frac{1}{2\sigma^2}(x_i-\mu)^2} \end{align*}\]

The log-likelihood is:

\[\begin{align*} \ell(\underline{\theta})=-\frac{n}{2}\ln (2\pi)-\frac{n}{2}\ln (\sigma^2)-\sum_{i=1}^n \frac{1}{2\sigma^2}(x_i-\mu)^2 \end{align*}\]

Let \(\nu=\sigma^2\). We have:

\[\begin{align*} \ell(\mu, \nu)=-\frac{n}{2}\ln (2\pi)-\frac{n}{2}\ln (\nu)-\sum_{i=1}^n \frac{1}{2\nu}(x_i-\mu)^2 \end{align*}\]

Now, take the derivatives with respect to \(\mu\) and \(\nu\).

\[\begin{align*} & \frac{d\ell(\mu, \nu)}{d\mu}=-2\frac{1}{2\nu}\sum (x_i-\mu)(-1)=\frac{\sum x_i-n\mu}{\nu} \\ & \frac{d\ell(\mu, \nu)}{d\nu}=-\frac{n}{2\nu}+\frac{\sum (x_i-\mu)^2}{2\nu^2}=\frac{-n\nu+\sum(x_i-\mu)^2}{2\nu^2} \end{align*}\]

Setting the derivatives equal to 0 and solving gives us:

\[\begin{align*} & 0=\frac{\sum x_i-n\mu}{\nu}, \qquad \Rightarrow \mu=\frac{\sum x_i}{n}=\bar{x}\\ & 0=\frac{-n\nu+\sum(x_i-\mu)^2}{2\nu^2}, \qquad \Rightarrow \nu=\frac{\sum (x_i-\mu)^2}{n} \end{align*}\]

We have two equations and two unknowns. Therefore, we can find the MLEs to be:

\[\begin{align*} & \hat{\mu}=\bar{x}\\ & \hat{\nu}=\hat{\sigma^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n} \end{align*}\]

Example 4.15 Let \(X_i\overset{iid}{\sim} \text{Laplace}(\mu, b)\), for \(i=1, \ldots, n\). Find the MLE of \(\underline{\theta}=(\mu, b)\).

Solution

The pdf of \(X\) is:

\[\begin{align*} f(x|\underline{\theta})=\frac{1}{2b}e^{-\frac{|x-\mu|}{b}}, \qquad -\infty<x<\infty \end{align*}\] The likelihood function is:

\[\begin{align*} L(\underline{\theta})=\prod_{i=1}^n \frac{1}{2b}e^{\frac{|x_i-\mu|}{b}}=(2b)^{-n}e^{-\frac{\sum |x_i-\mu|}{b}} \end{align*}\]

The log-likelihood function is:

\[\begin{align*} \ell(\underline{\theta})=-n\ln 2-n\ln b-\frac{\sum |x_i-\mu|}{b} \end{align*}\]

Let’s start with taking the derivative with respect to \(b\).

\[\begin{align*} \frac{d\ell(\underline{\theta})}{db}=-\frac{n}{b}+\frac{\sum |x_i-\mu|}{b^2}=\frac{-nb+\sum |x_i-\mu|}{b^2} \end{align*}\]

Setting the derivative equal to 0, we get:

\[\begin{align*} 0=\frac{-nb+\sum |x_i-\mu|}{b^2}, \qquad \Rightarrow \hat{b}=\frac{\sum |x_i-\mu|}{n} \end{align*}\]

Next, lets find the derivative of \(\ell(\underline{\theta})\) with respect to \(\mu\).

\[\begin{align*} & \frac{d\ell(\underline{\theta})}{d\mu}=-\sum_{i=1}^n \left(\frac{1}{b}\right)\frac{d}{d\mu} |x_i-\mu| \end{align*}\]

But what is \(\frac{d}{d\mu} |x_i-\mu|\)? Well, we know

\[\begin{align*} |x_i-\mu|=\begin{cases} x_i-\mu & \text{if }x_i\ge \mu\\ -x_i+\mu & \text{if }x_i\le \mu \end{cases} \end{align*}\]

Therefore, the derivative would be

\[\begin{align*} \frac{d}{d\mu}|x_i-\mu|=\begin{cases} -1 & \text{if }x_i\ge \mu\\ 1 & \text{if }x_i\le \mu \end{cases} \end{align*}\]

We could rewrite the derivative as an indicator function in the following way:

\[\begin{align*} \frac{d}{d\mu}|x_i-\mu|=\mathbf{1}_{\{x_i<\mu\}}-\mathbf{1}_{\{x_i>\mu\}} \end{align*}\]

Putting this in the log-likelihood, we get:

\[\begin{align*} \frac{d}{d\mu}\ell(\underline{\theta})=-\frac{1}{b}\sum_{i=1}^n\left(\mathbf{1}_{\{x_i<\mu\}}-\mathbf{1}_{\{x_i>\mu\}}\right) \end{align*}\]

Setting it equal to 0, we get:

\[\begin{align*} & 0=\sum_{i=1}^n\left(\mathbf{1}_{\{x_i<\mu\}}-\mathbf{1}_{\{x_i>\mu\}}\right)=\sum \mathbf{1}_{\{x_i<\mu\}}-\sum \mathbf{1}_{\{x_i>\mu\}}\\ & \Rightarrow \sum \mathbf{1}_{\{x_i<\mu\}}=\sum \mathbf{1}_{\{x_i>\mu\}}\\ \end{align*}\]

Consider each of the terms in the last equation. \(\sum \mathbf{1}_{\{x_i<\mu\}}\) is the number of \(x_i\)’s that are less than \(\mu\). Similarly, \(\sum \mathbf{1}_{\{x_i>\mu\}}\) is the number of \(x_i\)’s that are greater than \(\mu\). The only way these two sums are equal is if \(\mu\) is the median. Thus, the MLE for \(\mu\) is \(\hat{\mu}=\text{median}(x_1, \ldots, x_n)\).

Therefore, with the two equations and the two unknowns, we get the MLE of \(\underline{\theta}\) to be:

\[\begin{align*} \hat{\theta}=(\hat{\mu}=\text{median}(x_1, \ldots, x_n), \hat{b}=\frac{1}{n}\sum_{i=1}^n |x_i-\text{median}(x_1, \ldots, x_n)) \end{align*}\]

One last example…

Example 4.16 Suppose \(X_1, X_2, \ldots, X_n\) is a random sample from a Gamma distribution with parameters \(\alpha\) and \(\beta\). Suppose we want to find the MLE of \(\underline{\theta}=(\alpha, \beta)\). We start, as usual, with the likelihood function.

\[\begin{align*} L(\underline{\theta})=\prod_{i=1}^n \frac{1}{\Gamma(\alpha)\theta^\alpha}x_i^{\alpha-1}e^{-\frac{x_i}{\theta}}=\left(\Gamma(\alpha)\right)^{-n}\left(\theta\right)^{-n\alpha}\left(\prod_{i=1}^n x_i^{\alpha-1}\right)e^{-\frac{\sum x_i}{\theta}} \end{align*}\]

The log-likelihood function is:

\[\begin{align*} \ell(\underline{\theta})=-n\ln \Gamma(\alpha)-n\alpha \ln \theta+(\alpha-1)\sum \ln x_i-\frac{\sum x_i}{\theta} \end{align*}\]

The next step is to take the derivatives with respect to \(\alpha\) and \(\theta\).

\[\begin{align*} & \frac{\ell(\underline{\theta})}{d\theta}=-\frac{n\alpha}{\theta}+\frac{\sum x_i}{\theta^2}=\frac{-n\alpha\theta+\sum x_i}{\theta^2}\\ & \frac{d\ell(\underline{\theta})}{d\alpha}=-\frac{n}{\Gamma(\alpha)}\left(\frac{d}{d\alpha}\Gamma(\alpha)\right)+\sum \ln x_i \end{align*}\]

The next step, of course, is to set the derivatives equal to zero and solve for the parameter, where we have two equations and two unknowns. Before we do that, however, let’s look at \(\frac{d}{d\alpha}\Gamma(\alpha)\). Recall that the Gamma function at \(a\) is

\[\begin{align*} \Gamma(a)=\int_0^\infty t^{a-1}e^{-t}dt \end{align*}\]

The Gamma function is not analytically tractable! Therefore, there is no possible Analytic solution to this system of equations. So what do we do? We need to numerically find the MLEs.

We explore numerical maximum likelihood in the future but to numerically find the MLEs, we need to use software. We take a brief break from find MLEs to introduce R.

4.5 Summary

In this lesson, we were introduced to the method of maximum likelihood estimation (MLE) as a foundational tool in statistical inference. We learned how to construct the likelihood and log-likelihood functions based on a specified probability model, and how to analytically solve for the parameter value that maximizes the likelihood. We practiced deriving MLEs for both discrete and continuous distributions and explored cases where support restrictions require careful handling through indicator functions.

Key Takeaways:

  • The likelihood function \(L(\theta_1, \theta_2, \ldots, \theta_m)\) is the joint probability distribution of the random sample.
  • The maximum likelihood estimator (MLE) is the value of \(\theta\) that is the maximum of the likelihood \(L(\theta)\).
  • It is often easier to maximize \(\ell(\theta)=\ln L(\theta)\).