Distributions II
Multinomial Variables
K-dimensional vector X with one x_k equals 1, others equal 0, i.e. X=(0,1,0,0,0)’
Note: Σ_K x_k=1
If we denote the probability of x_k=1 by the parameter μ_k, i.e. x_2=1 with μ_2=0.3
K i.e.
p(x|μ_) = ∏ μ_k^(x_k) = μ_1^x_1 * μ_2^x_2 * μ_3^x_3 * μ_4^x_4 * μ_5^x_5
k=1
= μ_1^0 * 0.3^1 * μ_3^0 * μ_4^0 * μ_5^0 = 0.3
μ_ = (μ_1,μ_2,...,μ_K)'
constraint:
μ_k ≥ 0 and Σ_K μ_k=1
The distribution p(x|μ_) can be regarded as a generalization of the Bernoulli distribution to more than two outcomes
p(x|μ_) is normalized:
Σ_x p(x|μ_) = Σ_K μ_k = 1
mean: E[x|μ_] = Σ_x p(x|μ_)x = (μ_1,μ_2,...,μ_k)' = μ_
Suppose we have a data set D of N independent observations D={x_1,…x_N}, the likelihood function is
N K K K
p(D|μ_) = ∏ ∏ μ_k^(x_nk) = ∏ μ_k^(Σ_N x_nk) = ∏ μ_k^(m_k)
n=1k=1
m_k = Σ_N x_nk - the number of observations of x_k=1 <- sufficient statistics for this distribution
To find the maximum likelihood solution for μ_, we need to max ln p(D|μ_) w.r.t μ_k with the constraint that μ_k must sum up to 1
This can be done by using Lagrange multiplier λ
The result is
μ_k = m_k/N - the fraction of the N observations for which x_k=1
ML
Multinomial
( N ) K
Mult(m1,m2,...mK|μ_,N) = ( ) ∏ μ_k^(m_k)
(m1m2...mK)k=1
( N ) N!
( ) = ------------
(m1m2...mK) m1!m2!...mK!
Σ_K m_k = N
Dirichlet
The prior for {μ_k}
By inspection of the form of the multinomial distribution, we see that the conjugate prior is given by
K
p(μ_|α_) ∝ ∏ μ_k^(α_k-1)
k=1
Constraint:
0≤μ_k≤1, and Σ_K μ_k =1
Parameter:
α_= (α_1,α_2,...,α_K)'
Note: because of the summation constraint, the distribution over the space of the {μ_k} is confined to a simplex of dimensionality K-1
The normalized form of Dirichlet is:
Γ(α_0) K
Dir(μ_|α_) = --------------- ∏ μ_k^(α_k-1)
Γ(α_1)...Γ(α_K)k=1
α_0 = Σ_K α_k
Prior: Dir(μ_|α_)
Likelihood: Mult(m1,m2,…mK|μ_,N)
Posterior:
K
p(μ_|D,α_) ∝ p(D|μ_)p(μ_|α_) ∝ ∏ μ_k^(α_k+m_k-1)
k=1
The posterior again takes the form of Dirichlet distribution
=> Dirichelet is indeed a conjugate prior for the multinomial
Then the normalization coefficient is
Γ(α_0)+N K
p(μ_|D,α_) = Dir(μ_|α_+m_) = ------------------- ∏ μ_k^(α_k+m_k-1)
Γ(α1+m1)...Γ(αK+mK)
m_ = (m_1,...,m_K)'
Reference
Bishop Chapter 2 Probability Distributions
Visualizing Dirichlet Distributions with Matplotlib
Categorical data / Multinomial distribution