Distributions I
Keywords
Binomial distributions
Multimonial distributions
for
Discrete random variables
Continuous random variables
Density Estimation
Model p(x) of a random variable x, given a finite set {x_1,x_2,…,x_N} of observations
Assumption: Data points are Independent and Identically Distributed (I.I.D)
Density Estimation is fundamentally ill-posed, cuz there are infinitely many probability distributions that could have given rise to the observed finite data set
Model Selection
The issue of choosing an appropriate distribution for the model
Parametric Distributions
Distributions that are governed by a small number of adaptive parameters, i.e. mean and variance from Gaussian
To determine a target distribution parameters,
In a frequentist treatment, we choose specific values for the parameters by optimizing some criterion, such as the likelihood function
In a Bayesian treatment, we introduce prior distributions over the parameters and use Bayes’ theorem to compute the corresponding posterior distribution given the observed data
Conjugate priors
leads to posterior distribution having the same functional form as the prior, therefore leads to a greatly simplified Bayesian analysis
i.e. the conjugate prior for the parameters of the multinomial distribution is called the Dirichlet distribution
the conjugate prior for the mean of a Gaussian is another Gaussian
Exponential Family
Parametric Density Estimation
Non-Parametric Density Estimation
The distribution typically depends on the size of the data set
Such models still contain parameters, but they control the model complexity rather than the form of the distribution
Sufficient Statistic
A statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken
Binary Variables
a single binary random variable x∈{0,1}, i.e. coin flipping
x=1 <- heads
x=0 <- tails
Imagine a damaged coin that p(heads)!=p(tails), Say
p(x=1|μ) = μ
p(x=0|μ) = 1-μ
μ - parameter μ∈[0,1]
Bernoulli
Bern(x|μ) = μ^x * (1-μ)^(1-x)
is normalized
mean: E[x]=μ
variance: var[x]=μ(1-μ)
import numpy as np
import matplotlib.pyplot as plt
#hand craft
mu=0.3
plt.bar([0,1],[1-mu,mu],width=0.2)
plt.show()
#scipy
from scipy.stats import bernoulli
import seaborn as sns
data=bernoulli.rvs(size=1000,p=0.3)
ax=sns.distplot(data,kde=False,hist_kws={'linewidth':15})
ax.set(xlabel='Bernoulli',ylabel='Frequency')
Suppose we have data set D={x_1,…x_N} of observed values of x,
p(D|μ) <- likelihood, a function of μ on the assumption that the observations are drawn independently from p(x|μ)
N N
p(D|μ) = ∏ p(x_n|μ) = ∏ μ^x_n * (1-μ)^(1-x_n)
n=1 n=1
In a frequentist setting, we can estimate a value for μ by maximizing the likelihood or the log-likelihood
ln p(D|μ) = Σ_N ln p(x_n|μ) = Σ_N [x_n*ln(μ)+(1-x_n)ln(1-μ)]
If we set the derivative of ln p(D|μ) w.r.t μ -> 0, we can have the maximum likelihood estimator
dln p(D|μ) / dμ = 0
μ = 1/N Σ_N x_n <- sample mean
ML
If we observe m ‘heads’ in the data set,
μ = m/N
ML
Problem:
Suppose, we flip 3 times a coin and happen to observe 3 heads, then
μ = m/N = 3/3 = 1
ML
In this case, the maximum likelihood result would predict all future observations should give heads
This is unreasonable, and in fact it is an extreme example of the over-fitting associated with maximum likelihood
Solution:
Introduction of a prior distribution over μ
We can work out the distribution of the number m of observations of x=1, given the data set N
Binomial
In order to obtain the normalization coefficient we not that out of N coin flips, we have to add up all of the possible ways of obtaining m heads,
(N)
Bin(m|N,μ) = ( ) μ^m*(1-μ)^(N-m)
(m)
(N) N!
( ) = -------- <- ways of choosing m objects out of a total N identical objects
(m) (N-m)!m!
N
E[m] ≜ Σ m Bin(m|N,μ) = Nμ
m=0
N
var[m] ≜ Σ (m-E[m])^2 * Bin(m|N,μ) = Nμ(1-μ)
m=0
Say N=10, μ=0.25, (flip 10 times coin with μ=0.25, what is the probability of m heads appear)
import numpy as np
import matplotlib.pyplot as plt
#hand craft
def fact(n):
return np.math.factorial(n)
def binomial(n,m,mu):
return (fact(n)/(fact(n-m)*fact(m)))*(mu**m)*(1-mu)**(n-m)
n=10
mu=0.25
dist=[]
for m in range(0,10):
dist.append(binomial(n,m,mu))
plt.bar(range(0,10),dist)
plt.show()
# scipy
from scipy.stats import binom
import seaborn as sns
data=binom.rvs(n=10,p=0.25,size=10000)
ax=sns.distplot(data,kde=False) #norm_hist=True
ax.set(xlabel='Binoamial',ylabel='Frequency')
plt.show()
Beta
for being the prior of μ
Γ(a+b)
Beta(μ|a,b) = -------- μ^(a-1) * (1-μ)^(b-1)
Γ(a)Γ(b)
Γ(x) - the gamma function
∞
Γ(x) ≜ ∫ u^(x-1) * e^(-u) du
0
Γ(x+1) = xΓ(x)
Γ(a+b)
the coefficient --------- ensures the beta distribution is normalized
Γ(a)Γ(b)
1
∫ Beta(μ|a,b) dμ = 1
0
a
E[μ] = -----
a+b
ab
var[μ] = --------------
(a+b)^2(a+b+1)
a,b - hyperparameters, controls the distribution of parameter μ
Generate data
from scipy.stats import beta
import seaborn as sns
data=beta.rvs(0.1,0.1,size=10000)
sns.distplot(data,bins=100)
data=beta.rvs(1,1,size=10000)
sns.distplot(data,kde=False,bins=100)
data=beta.rvs(2,3,size=10000)
sns.distplot(data,kde=False,bins=100)
data=beta.rvs(8,4,size=10000)
sns.distplot(data,kde=False,bins=100)
Draw pdf
x=np.linspace(0,1,size=1000)
plt.plot(x,beta.pdf(x,0.1,0.1),'r')
plt.plot(x,beta.pdf(x,1,1),'r')
plt.plot(x,beta.pdf(x,2,3),'r')
plt.plot(x,beta.pdf(x,8,4),'r')
A Whole Bayesian View
The posterior of μ is obtained by multiplying the beta prior and the binomial likelihood
prior:
Γ(a+b)
Beta(μ|a,b) = -------- μ^(a-1) * (1-μ)^(b-1)
Γ(a)Γ(b)
likelihood:
(N)
Bin(m|N,μ) = ( ) μ^m * (1-μ)^(N-m)
(m)
posterior: (keeping only the factors dependent on μ)
p(μ|m,N,a,b) ∝ μ^(m+a-1) * (1-μ)^(N-m+b-1)
The posterior has the same functional as the prior, reflecting the conjugacy properties of the prior w.r.t the likelihood function
Then, the normalization coefficient is
Γ(N+a+b)
p(μ|m,N,a,b) = -------------- μ^(m+a-1) * (1-μ)^(N-m+b-1)
Γ(m+a)Γ(N-m+b)
We can simply interpret a,b are effective numbers of observations of x=1 and x=0
i.e. prior ~ Beta(a=2,b=2), likelihood ~ Bin(N=m=1) => post ~ Beta(a=3,b=2)
x=np.linspace(0,1,num=1000)
a=2
b=2
plt.plot(x,beta.pdf(x,a,b),'r')
N=1
m=1
plt.plot(x,x,'b')
a=3
b=2
plt.plot(x,beta.pdf(x,a,b),'r')
At each stage, the posterior is beta given current observed values of 1s and 0s generated by a,b
observe one more x=1 simply corresponds to incrementing a by 1, and x=0 of b by 1
Sequential Approach
Sequential approaches make use of observations one at a time, or in small batches, then discard them before the next observations are used
(+) real time streaming
(+) do not require the whole data set
(+) can be useful for large data sets
If we evaluate the predictive distribution of x given the observed data set D
1 1
p(x=1|D) = ∫ p(x=1|μ)p(μ|D)dμ = ∫ μp(μ|D)dμ = E[μ|D]
0 0
a Γ(N+a+b)
use E[μ] = ----- and p(μ|m,N,a,b) = -------------- μ^(m+a-1) * (1-μ)^(N-m+b-1) for p(μ|D)
a+b Γ(m+a)Γ(N-m+b)
m+a -> correspond to x=1
p(x=1|D) = ---------
N+a+b
if N->∞, p(x=1|D) = m/N = μ_ML
(infinite (a,b are (exactly the maximum likelihood result)
large small enough
data set) to be ignored)
if N is finite, p(x=1|D) lies between the prior mean E[μ]=a/(a+b) and maximum likelihood estimate 1/N Σ_N x_n
From figure of beta a,b from (0.1,0.1) to (8,4), we see as the number of observations increases, so the posterior distribution becomes more sharply peaked
This can also be seen from var[μ]=ab/(a+b)^2(a+b+1), in which we see the variance goes to zero for a->∞ or b->∞
AKA a general property of Bayesian learning, that as we observe more and more data, the uncertainty represented by the posterior distribution will steadily decrease
Proof of this property
Reference
Bishop Chapter 2 Probability Distributions
Probability Distributions in Python with SciPy and Seaborn
Binomial Distribution: Formula, What it is and How to use it