Entropy

Entropy gives us a way to quantify the information content of a given probability distribution

         n
H(X) = - Σ p(x_i) log p(x_i)
        i=1

Suppose we have a simple probability distribution over the likelihood of a coin flip resulting in heads or tails [p,1-p]

Plugging [p,1-p] into H(X), we have

H(X) = - [plogp+(1-p)log(1-p)]

p=0.5  H(x)=0.69   entropy peak high   more info
p=0.9  H(x)=0.32
p->1   H(x)->0     entropy low         no info    - events that always occur contain no info

Return to definition of H(X):

         n                     n              1
H(X) = - Σ p(x_i) log p(x_i) = Σ p(x_i) log -----
        i=1                   i=1           p(x_i)

Cross Entropy

Cross Entropy is used to quantify the information content of one distribution p relative to another q

                                             1
H(p,q) = - Σ p(x_i) log q(x_i) = Σ p(x) log ----
          x_i                    x          q(x)

KL Divergence

KL divergence (aka relative entropy) is a distance metric that quantifies the difference between two probability distributions

Useful in measuring loss in machine learning, related to cross-entropy
Useful in dealing with a complex distribution scenario:
Rather than working with the distribution directly, we can use another distribution with well known properties (i.e. Gaussian) that does a decent job of describing the data

Simply put, we can use the KL divergence to tell whether a poisson distribution or a normal distribution is a better one at approximating the data

KL divergence is also a key component of Gaussian Mixture Models and t-SNE

Recall entropy and cross entropy:

H(p,q) - the amount of information needed to encode p and q
H(p) - the amount of information necessary to encode p
KL(p||q) - the amount of info needed to encode p with q minus the amount of info to encode p

                                                                       p(x)
KL(p||q) = H(p,q) - H(p) = - Σ p(x)logq(x) + Σ p(x)logp(x) = Σ p(x)log ----
                             x               x               x         q(x)

For distributions p and q of a discrete random variable,

                       p(x)
 KL(p||q) = Σ p(x) log ---- dx
                       q(x)                   

For distributions p and q of a continuous random variable,

                      p(x)
KL(p||q) = ∫ p(x) log ---- dx
                      q(x)

Notes:

  • KL divergence is not symmetrical: KL(p||q)!=KL(q||p)
  • the lower the KL divergence, the closer the two distributions are to one another

python Examples

For discrete cases, generate red ball 40%, and blue ball 60%

import numpy as np
from collections import Counter

def generate_balls(n):
    #generate uni random in [0.0, 1.0)
    bag=np.random.random(n)
    return ['red' if x<=0.4 else 'blue' for x in bag.tolist()]

def percentage(cnt):

    return [float(v)/np.sum(cnt.values()) for v in cnt.values()]

#fix the output
np.random.seed(8)
cnt10=Counter(generate_balls(10))
>>Counter({'blue': 5, 'red': 5})
cnt100=Counter(generate_balls(100))
>>Counter({'blue': 57, 'red': 43})
cnt1000=Counter(generate_balls(1000))
>>Counter({'blue': 585, 'red': 415})

p10=percentage(cnt10)
>>[0.5,0.5]
p100=percentate(cnt100)
>>[0.57,0.43]
p1000=percentage(cnt1000)
>>[0.585,0.415]

q=[0.6,0.4]

#handcraft
kl10=np.sum(np.array(p10)*np.log(p10/np.array(q)))
>>0.020410997260127586
kl100=np.sum(np.array(p100)*np.log(p100/np.array(q)))
>>0.001860706678335388
kl1000=np.sum(np.array(p1000)*np.log(p1000/np.array(q)))
>>0.0004668811751176276

#scipy lib
from scipy.stats import entropy
entropy(p10,q)
>>0.020410997260127586
entropy(p100,q)
>>0.001860706678335388
entropy(p1000,q)
>>0.0004668811751176276

For continuous cases, taking 2 Gaussian
p=N(0,2) and q=N(2,2) as example, their KL(p||q)=500;
p=N(0,2) and q=N(5,4) as example, their KL(p||q)=1099:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
#sns.set()

from scipy.stats import norm

#to avoid log 0, cuz log 0=-inf
#np.where means if p!=0, run p*np.log(p/q), else 0
def kl(p,q):
    return np.sum(np.where(p!=0,p*np.log(p/q),0))

x=np.arange(-10,10,0.001)
p=norm.pdf(x,0,2)
q=norm.pdf(x,2,2)

plt.figure()
plt.title('KL(p||q)=%1.3f' % kl(p,q))
plt.plot(x,p,'b',label='p with m=0,std=2')
plt.plot(x,q,'r',label='q with m=2,std=2')
plt.legend()
plt.savefig('kl_norm2.png',dpi=350)
plt.clf()

q=norm.pdf(x,5,4)
plt.figure()
plt.title('KL(p||q)=%1.3f' % kl(p,q))
plt.plot(x,p,'b',label='p with m=0,std=2')
plt.plot(x,q,'r',label='q with m=5,std=4')
plt.legend()
plt.savefig('kl_norm2_.png',dpi=350)
plt.clf()

Comparison of KLs of covering two-modal Gaussian and one-model Gaussian

Figure 1 with q1=N(0,4) that covers both two modal has the lowest value of KL(p||q1)

Minimizing KL Divergence

Since the lower the KL divergence, the closer the two distributions are to one another, we can estimate the Gaussian parameters for example of one distribution by minimizing its KL divergence w.r.t another

Using Gradient Descent
Create p=N(0,2), q=N(mu,sig) (mu,sig with random parameters)

x=np.arange(-10,10,0.001)
p_pdf=norm.pdf(x,0,2).reshape(1,-1) #(20000,) reshape to (1,20000)

alpha=0.001
episodes=100

#tensorflow memory relocate
p=tf.placeholder(tf.float64,shape=p_pdf.shape)
mu=tf.Variable(np.zeros(1))
sig=tf.Variable(np.eye(1))
normal=tf.exp(-tf.square(x-mu)/(2*sig))
q=normal/tf.reduce_sum(normal)

kl=tf.reduce_sum(tf.where(p==0,tf.zeros(p_pdf.shape,tf.float64),p*tf.log(p/q)))
optimizer=tf.train.GradientDescentOptimizer(alpha).minimize(kl)
init=tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    traj=[]
    mus=[]
    sig2s=[]

    for ep in range(episodes):
        sess.run(optimizer,{p:p_pdf})
        traj.append(sess.run(kl,{p:p_pdf}))
        mus.append(sess.run(mu)[0])
        sig2s.append(sess.run(sig)[0][0])

    for m,s in zip(mus,sig2s):
        q_pdf=norm.pdf(x,m,np.sqrt(s))
        plt.plot(x,q_pdf.reshape(-1,1),'r')

plt.plot(x,q_pdf.reshape(-1,1),'r',label='q approach to p')
plt.title('final KL(p||q)=%1.3f' % traj[-1])
plt.plot(x,p_pdf.reshape(-1,1),'b',label='p')
plt.legend()
plt.savefig('kl_gd.png',dpi=350)
plt.clf()
plt.plot(traj)
plt.xlabel('episodes')
plt.ylabel('KL(p||q) values')
plt.savefig('kl_klvalue.png',dpi=350)
plt.clf()

Extensions

  • MaxEnt RL
  • Variational inference

Reference

KL Divergence Python Example
wiki
Entropy and KL
KL Divergence for Machine Learning