Maximum Likelihood Estimation

Definition

Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a statistical model given observations. In MLE, the parameters are chosen to maximize the likelihood that the assumed results in the observed data.

MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters

To implement MLE:

assume a model, also known as a data generating process
be able to derive the likelihood function for our data, given our assumed model

Once the likelihood function is derived, MLE is nothing but a simple optimization problem

MLE adv and disadv:
(+) if the model is correctly assumed, the MLEstimator is the most efficient estimator
(+) it provides a consistent but flexible approach which makes it suitable for a wide variety of applications, including cases where assumptions of other models are violated
(+) it results in unbiased estimates in larger samples
(-) it relies on the assumption of a model and the derivation of the likelihood function which is not always easy
(-) like other optimization problems, MLE can be sensitive to the choice of starting values
(-) depending on the complexity of the likelihood function, the numerical estimation can be computationally expensive
(-) estimates can be biased in small samples

Efficiency - an efficient estimator is one that has a small variance or mean squared error

Likelihood

the distinction between probability and likelihood is:
probability attaches to possible results;
likelihood attaches to hypotheses;
a probability density function expresses the probability of observing our data given the underlying distribution parameters. it assumes that the parameters are known;
the likelihood function expresses the likelihood of parameter values occuring given the observed data. it assumes the parameters are unknown;

probability: p(y|θ), p(y1,y2,y3,…|θ)
p(y1,y2,y3,…|θ)=∏p(yi|θ)=p(y1|θ)p(y2|θ)p(y3|θ)p(y4|θ)…
likelihood: L(θ|y), L(θ|y1,y2,y3,…)
L(θ|y1,y2,y3,…)=∏L(θ|yi)=L(θ|y1)L(θ|y2)L(θ|y3)L(θ|y4)…
log likelihood: logL(θ|y1,y2,y3,…)=log∏L(θ|yi)=ΣlogL(θ|yi)

mathematically:

L(θ|y1,y2,y3,...) = p(y1,y2,y3,...|θ) = p(y1|θ)p(y2|θ)p(y3|θ)... = ∏ p(yi|θ)   

i.e. we have one series y with 10 independent observations:5,0,1,1,0,3,2,3,4,1

step 1 in MLE: to assume a probability distribution of the data

a probability density function measures the probability of observing the data given a set of underlying model parameters

so assume the data has an underlying Poisson distribution

          e^(-θ)θ^(yi)
f(yi|θ) = ------------
              yi!

because the observation in our sample are independent, the probability density of our observed sample can be found by taking the product of the probability of the individual observation

                    10  e^(-θ)θ^(yi)                       e^(-10θ)θ^(20)
f(y1,y2,...y10|θ) =  ∏  -----------  = L(θ|y1,y2,...y10) = --------------
                    i=1     yi!                              207360

which is the likelihood function being plugged in the observed data. by maximizing this function, we can have the parameter θ

in practice the joint product can be difficult to work with, and log is a monotonic transformation, so the Likelihood function can be simplified to

log(L(θ|y1,y2,...y10)) = - nθ + logΣyi - logθΣyi! = - 10θ + 20logθ - log207360  

In short, we have an objective

max ∏ p(yi|θ)  -->  max log ∏ p(yi|θ) --> max Σ log p(yi|θ)
 θ                   θ                     θ

MLE and Linear Regression

    ^
y = βx + ε

Assume that the model residuals are identical and independently normally distributed:

        ^
ε = y - βx ~ N(0, σ^2)  

Based on this assumption, the log-likelihood function for the unknown parameter vector θ = {β, σ^2}

The probability density of N(0, σ^2) is

          1      (x-μ)^2
f(x) = ------ e^[-------]
       √2πσ^2     -2σ^2
                            ^
           n    1        (y-βx)^2
L(θ|y,x) = ∏  ------ e^[-------]
              √2πσ^2     -2σ^2
                                   ^
                   n    1       (y-βx)^2
log L(θ|y,x) = log ∏  ------ e^[-------]
                      √2πσ^2     -2σ^2
                                  ^
               n       1       (y-βx)^2
             = Σ log ------ e^[-------]
                     √2πσ^2     -2σ^2
                                  ^
                      1      n (y-βx)^2
             = nlog ------ - Σ --------
                    √2πσ^2       2σ^2

θ = {β, σ^2} are those that maximize the likelihood

Reference

A Gentle Introduction to Maximum Likelihood Estimation　　
Beginner’s guide to MLE

Jiexin Wang

Maximum Likelihood Estimation

Definition

Likelihood

MLE and Linear Regression

Reference

You May Also Enjoy

Generative Adversarial Nets

Action Selection in RL

Latex test

Model-based RL