Definition

Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a statistical model given observations. In MLE, the parameters are chosen to maximize the likelihood that the assumed results in the observed data.

MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters

To implement MLE:

  1. assume a model, also known as a data generating process
  2. be able to derive the likelihood function for our data, given our assumed model

Once the likelihood function is derived, MLE is nothing but a simple optimization problem

MLE adv and disadv:
(+) if the model is correctly assumed, the MLEstimator is the most efficient estimator
(+) it provides a consistent but flexible approach which makes it suitable for a wide variety of applications, including cases where assumptions of other models are violated
(+) it results in unbiased estimates in larger samples
(-) it relies on the assumption of a model and the derivation of the likelihood function which is not always easy
(-) like other optimization problems, MLE can be sensitive to the choice of starting values
(-) depending on the complexity of the likelihood function, the numerical estimation can be computationally expensive
(-) estimates can be biased in small samples

Efficiency - an efficient estimator is one that has a small variance or mean squared error

Likelihood

the distinction between probability and likelihood is:
probability attaches to possible results;
likelihood attaches to hypotheses;
a probability density function expresses the probability of observing our data given the underlying distribution parameters. it assumes that the parameters are known;
the likelihood function expresses the likelihood of parameter values occuring given the observed data. it assumes the parameters are unknown;

probability: p(y|θ), p(y1,y2,y3,…|θ)
p(y1,y2,y3,…|θ)=∏p(yi|θ)=p(y1|θ)p(y2|θ)p(y3|θ)p(y4|θ)…
likelihood: L(θ|y), L(θ|y1,y2,y3,…)
L(θ|y1,y2,y3,…)=∏L(θ|yi)=L(θ|y1)L(θ|y2)L(θ|y3)L(θ|y4)…
log likelihood: logL(θ|y1,y2,y3,…)=log∏L(θ|yi)=ΣlogL(θ|yi)

mathematically:

L(θ|y1,y2,y3,...) = p(y1,y2,y3,...|θ) = p(y1|θ)p(y2|θ)p(y3|θ)... = ∏ p(yi|θ)   

i.e. we have one series y with 10 independent observations:5,0,1,1,0,3,2,3,4,1

step 1 in MLE: to assume a probability distribution of the data

a probability density function measures the probability of observing the data given a set of underlying model parameters

so assume the data has an underlying Poisson distribution

          e^(-θ)θ^(yi)
f(yi|θ) = ------------
              yi!

because the observation in our sample are independent, the probability density of our observed sample can be found by taking the product of the probability of the individual observation

                    10  e^(-θ)θ^(yi)                       e^(-10θ)θ^(20)
f(y1,y2,...y10|θ) =  ∏  -----------  = L(θ|y1,y2,...y10) = --------------
                    i=1     yi!                              207360

which is the likelihood function being plugged in the observed data. by maximizing this function, we can have the parameter θ

in practice the joint product can be difficult to work with, and log is a monotonic transformation, so the Likelihood function can be simplified to

log(L(θ|y1,y2,...y10)) = - nθ + logΣyi - logθΣyi! = - 10θ + 20logθ - log207360  

In short, we have an objective

max ∏ p(yi|θ)  -->  max log ∏ p(yi|θ) --> max Σ log p(yi|θ)
 θ                   θ                     θ

MLE and Linear Regression

    ^
y = βx + ε

Assume that the model residuals are identical and independently normally distributed:

        ^
ε = y - βx ~ N(0, σ^2)  

Based on this assumption, the log-likelihood function for the unknown parameter vector θ = {β, σ^2}

The probability density of N(0, σ^2) is

          1      (x-μ)^2
f(x) = ------ e^[-------]
       √2πσ^2     -2σ^2
                            ^
           n    1        (y-βx)^2
L(θ|y,x) = ∏  ------ e^[-------]
              √2πσ^2     -2σ^2
                                   ^
                   n    1       (y-βx)^2
log L(θ|y,x) = log ∏  ------ e^[-------]
                      √2πσ^2     -2σ^2
                                  ^
               n       1       (y-βx)^2
             = Σ log ------ e^[-------]
                     √2πσ^2     -2σ^2
                                  ^
                      1      n (y-βx)^2
             = nlog ------ - Σ --------
                    √2πσ^2       2σ^2

θ = {β, σ^2} are those that maximize the likelihood

Reference

A Gentle Introduction to Maximum Likelihood Estimation  
Beginner’s guide to MLE