3.4 Maximum Likelihood Estimation
Definition 3.7 Suppose \(X_1, \dots, X_n \sim f_\theta\). The likelihood function is defined by \[ \mathcal{L}_n(\theta) = \prod_{i = 1}^n f (X_i; \theta) \,. \] The log-likelihood function is defined by \[ \ell_n (\theta) h = \ln \mathcal{L}_n (\theta) \,. \]
The maximum likelihood estimator MLE, denoted by \(\hat \theta_n\), is the value of \(\theta\) that maximizes \(\mathcal{L}_n(\theta)\).
Notation. Another common notation for the likelihood function is \[ L(X|\theta) = \mathcal{L}_n(\theta).\]
Example 3.4 Let \(X_1, \dots, X_n\) be sample from \(\mathrm{Bernoulli}(p)\). Use MLE to find an estimator for \(p\).
Example 3.5 Let \(X_1, \dots, X_n\) be sample from \(N(\theta, 1)\). Use MLE to find an estimator for \(\theta\).
Exercise 3.6 Let \(X_1, \dots, X_n\) be sample from \(Uniform([0,\theta])\), where \(\theta >0\).
Find the MLE for \(\theta\).
Find an estimator by the method of moments.
Compute the mean and the variance of the two estimators above.
Can you find the MLE if we consider \(Uniform((0,\theta))\)?
Theorem 3.4 Let \(\tau = g(\theta)\) be a bijective function of \(\theta\). Suppose that \(\hat \theta_n\) is the MLE of \(\theta\). Then \(\hat \tau_n = g(\hat \theta_n)\) is the MLE of \(\tau\).
3.4.1 Consistency
Example 3.6 (Inconsistency of MLE) Let \(Y_{i,1}, Y_{i,2} \sim N(\mu_1, \sigma^2)\). Our goal is to find MLE for \(\sigma^2\), which turns out to be \[\hat \sigma^2 = \frac{1}{4n} \sum_{i=1}^n (Y_{i,1} - Y_{i,2})^2.\] By law of large number, this will converge to \[\mathbb{E}(\hat \sigma^2) = \sigma^2/2,\] which means that the MLE is not consistent.
To discuss about the consistency of the MLE, we define the Kullback-Leibler distance between two pdf \(f\) and \(g\).
\[ D(f,g) = \int f(x) \ln \left( \frac{f(x)}{g(x)} \right) \, dx.\]
Abusing notation, we will write \(D(\theta, \varphi)\) to mean \(D(f(x;\theta), f(x;\varphi))\).
We say that a model \(\mathcal{F}\) is identifiable if \(\theta \not= \varphi\) implies \(D(\theta, \varphi) > 0\).
Theorem 3.5 Let \(\theta_{\star}\) denote the true value of \(\theta\). Define \[ M_n(\theta)=\frac{1}{n} \sum_i \log \frac{f\left(X_i ; \theta\right)}{f\left(X_i ; \theta_{\star}\right)} \] and \(M(\theta)=-D\left(\theta_{\star}, \theta\right)\). Suppose that \[ \sup _{\theta \in \Theta}\left|M_n(\theta)-M(\theta)\right| \to 0 \] in probability and that, for every \(\epsilon>0\), \[ \sup _{\theta:|\theta-\theta,| \geq \epsilon} M(\theta)<M\left(\theta_{\star}\right) . \]
Let \(\widehat{\theta}_n\) denote the MLE. Then \(\widehat{\theta}_n \to \theta_{\star}\) in probability.
Exercise 3.7 Let \(X_1, \ldots, X_n\) be a random sample from a distribution with density: \[ p(x; \theta) = \theta x^{-2}, \quad 0 < \theta \leq x < \infty. \]
Find the MLE for \(\theta\).
Find the Method of Moments estimator for \(\theta\).
Exercise 3.8 Let \(X_1, \ldots, X_n \sim \text{Poisson}(\lambda)\).
Find the method of moments estimator, the maximum likelihood estimator, and the Fisher information \(I(\lambda)\).
Use the fact that the mean and variance of the Poisson distribution are both \(\lambda\) to propose two unbiased estimators of \(\lambda\).
Show that one of these estimators has a larger variance than the other.
The conditions listed in the above theorem is not very easy to check. Hogg-McKean-Craig has a better theorem (this is a good theorem to read).
Theorem 3.6 Assume that
\(\theta \not = \theta' \implies f_\theta \not = f_{\theta'}\)
\(f_\theta\) has common support for all \(\theta\)
\(\theta^*\) is an interior point in \(\Omega\)
If \(f_\theta(x)\) is differentiable with respect to \(\theta\). Then the likelihood equation \[ \frac{\partial}{\partial \theta} l_n(\theta) = 0 \] has a solution \(\hat \theta_n\) such that \[\lim_{n\to \infty} \hat \theta_n \to \theta^*\] in distribution.
3.4.2 Asymptotic normality
Definition 3.8 Given a RV \(X\). The score function is defined to be \[ s (X;\theta) = \frac{\partial \log f(X; \theta)}{\partial \theta} .\]
The Fisher information is defined to be \[ I_n = \mathbb{V}_\theta \left( \sum_{i=1}^n s(X_i; \theta) \right) = \sum_{i=1}^n \mathbb{V}_\theta \left( s(X_i; \theta) \right).\]
Theorem 3.7 \(I_n (\theta) = n I(\theta)\). Furthermore, \[ I(\theta) = \mathbb{E}_\theta \left( \frac{\partial^2 \log(f(X;\theta))}{\partial \theta^2} \right).\]
The significance of this is that you can think of the Fisher information as the curvature (second derivative) on the “manifold” of parameters. So, error of the score function has certain geometric interpretation.
Theorem 3.8 Let \(\mathrm{se} = \sqrt{\mathbb{V}(\hat \theta_n)}\). Given some regularity conditions, there exists a random variable \(Z \sim N(0,1)\) such that \[\frac{\hat\theta_n - \theta}{\mathrm{se}} \to Z.\]
3.4.3 Efficiency
As \(n\) gets large, the MLE is the most efficient estimator.
Theorem 3.9 (Cramer-Rao Inequality) Let \(X_1, \dots, X_n\) be sample with density \(f(x;\theta)\). Suppose \(\theta'\) is an unbiased estimator of \(\theta\), then under similar regularity conditions as in asymptotic normality, \[ \mathbb{V}(\theta'_n) \geq \frac{1}{n I(\theta)}.\]
Note that, in the proof of asymptotic normality, we have that as \(n\) gets large, the MLE \(\hat \theta\), if obeys the required regularity conditions, satisfies \(\mathbb{V}( \hat \theta_n )\sim \frac{1}{n I(\theta)}\).
By consistency, \(\hat \theta \sim \theta\) when \(n\) is very large. This means that \[\mathbb{V}( \hat \theta_n - \theta ) \sim\mathbb{V}( \hat \theta_n )\sim \frac{1}{n I(\theta)}.\]
Corollary 3.1 Let \(X_1, \dots, X_n\) be sample with density \(f(x;\theta)\). Suppose \(\theta'_n\) is an unbiased estimator of \(\theta\) and \(\hat \theta_n\) the MLE of \(\theta\), then, under regularity condition as in asymptotic normality, we have \[ \lim_{n\to \infty} n \mathbb{V}(\theta'_n) \geq \lim_{n\to \infty} n\mathbb{V}(\hat \theta_n - \theta) .\]
Note that this doesn’t say that MLE (if consistent) is the most efficient for any finite \(n\). In fact, this is a difficult question and one can only verify it for some specific estimator.
Exercise 3.9 Show that for Poisson distribution, the MLE \(\hat \theta_n\) is the most efficient for every \(n\), compared to any other unbiased estimator \(\theta'_n\), i.e, \[\mathbb{V}(\theta'_n) \geq \mathbb{V}(\hat\theta - \theta)\] for every \(n\in \mathbb{N}\).
Exercise 3.10 (Rice, 8.10.6) Let \(X \sim \mathrm{Binomial}(n,p)\) .
Find the MLE of \(p\).
Show that the MLE from part (a) attains the Cramer-Rao lower bound.