My Machine Learning Learning Experience (Part 6): Gaussian Discriminant Analysis And Maximum Likelihood Estimation

Gaussian Discriminant Analysis

So we see the word 'discriminant' again, that means we're still dealing with conditional probability P(Y|X). To be more specific, our target is to maximize the posterior P(Y=y|X=x):

$P(Y=C|X=x) = \frac{P(X=x|Y=C)P(Y=C)}{P(X=x)}$

so we can compare P(Y = C | X) and P(Y = D | X) , where C and D represent different classes.

For the rest of the whole blag post, everything is based upon this fundamental assumption:
each class comes from a normal distribution (Gaussian), like this:

$X \sim N(\mu, \sigma^2) : P(X) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{|x-\mu|^2}{2\sigma^2})$

So let's say for each class, we collect some data and construct the normal distribution, so for a class C, which has its own mean and standard deviation,

$P(X=x | Y=C) = \frac{1}{\sqrt{2\pi}\sigma_c}exp(-\frac{|x-\mu_c|^2}{2\sigma_c^2})$

and call

$P(Y=C)=\pi_C$

Compared to P(X=x|Y=C), P(Y=C) is much easier to find. Consider a 2-class classification, and we collect 600 samples for class C and 400 samples for class D,
P(Y=C) = 600/(600 + 400) = 0.6 and P(Y=D) = 400/(600 + 400) = 0.4

Now all we need is P(X) and we''ll be done filling every piece of the equation. Actually, we don't need that. Our mission is to compare P(Y=C|X=x) and P(Y=D|X=x), which actually share the same denominator P(X) according to Bayes' Theorem. Therefore, we're actually comparing P(X=x|Y=C)P(Y=C) and P(X=x|Y=D)P(Y=D).

Given x,Bayes' rule r*(x) return class that maximizes P(X=x|Y=C)P(Y=C).

And to make the computation easier, we're gonna do something about P(X=x|Y=C)P(Y=C) and P(X=x|Y=D)P(Y=D). Since ln(z) is monotonically increasing for z > 0,

Therefore,

$P(X=x|Y=C)P(Y=C) > P(X=x|Y=D)P(Y=D)\\ \Rightarrow ln(P(X=x|Y=C)P(Y=C)) > ln(P(X=x|Y=D)P(Y=D))\\ \Rightarrow ln(\sqrt{2\pi}P(X=x|Y=C)P(Y=C)) > ln(\sqrt{2\pi}P(X=x|Y=D)P(Y=D))$

And so, we have

$\begin{align*} Q_C(X) &= ln(\sqrt{2\pi}P(X=x|Y=C)P(Y=C))\\ &= ln(\sqrt{2\pi}P(X=x|Y=C)\pi_C)\\ &= ln(\frac{1}{\sigma}exp(-\frac{|x-\mu|^2}{2\sigma_C^2})\pi_C)\\ &= ln(\frac{1}{\sigma_C}) -\frac{|x-\mu|^2}{2\sigma_C^2} + ln(pi_C)\\ &= -\frac{|x-\mu|^2}{2\sigma_C^2} - ln(\sigma_C) + ln(pi_C) \end{align*}$

Note that this is a quadratic function. So we can move on to the next part.

Quadratic Discriminant Analysis (QDA)

Suppose we have only two classes C and D. Then we have the Bayes' Decision Rule

$r^*(x) = \left\{\begin{matrix} C & if\quad Q_C(x) - Q_D(x) > 0\\ D & otherwise \end{matrix}\right.$

Prediction function is quadratic in x, which we've just calculated.
Bayes Decision Boundary is when

$Q_C(x) - Q_D(x) = 0$

In 1-D, Bayes Decision Boundary may have one or two points.
In d-D, Bayes Decision Boundary is a quadratic.

Note that we're dealing with those two quadratic functions instead of the original P(Y=C|X=x) and P(Y=D|X=x). Therefore, to recover those posterior probabilities in 2-class case, we write them in terms of Q(x) using Bayes' Theorem:

$\begin{align*} P(Y=C|X=x) &= \frac{P(X=x|Y=C)\pi_C}{P(X=x|Y=C)\pi_C + P(X=x|Y=D)\pi_D}\\ &= \frac{e^{Q_c(x)}}{e^{Q_c(x)} + e^{Q_d(x)}}\quad(Recall\quad e^{Q_C(x)} = \sqrt{2\pi}P(X=x|Y=C)\pi_C)\\ &=\frac{1}{1 + e^{Q_D(x)-Q_C(x)}}\\ &=\frac{1}{1 + e^{-(Q_C(x)-Q_D(x))}}\\ &=s(Q_C(x)-Q_D(x)),\quad where\quad s(\gamma ) = \frac{1}{1+e^{-\gamma}} \end{align*}$

s(-) is the logistic function aka sigmoid function. This is what it looks like:

s(0) = 0.5 => Bayesian Decision Boundary
s(infinity) = 1 => Class C
s(-infinity) = 0 => Class D

Since s(-) is always within 0 and 1, it's perfect for modeling probabilities.

So again, we can think about this as sort of a 2-step process. First, we'll find these quadratic functions. Then we take difference between them and run this difference through the sigmoid function to get a probability.

Linear Discriminant Analysis (LDA)

If we make this fundamental assumption: all the Gaussians have same variance.

We'll have

$\begin{align*} Q_C(x)-Q_D(x) &= (-\frac{|x-\mu_C|^2}{2\sigma_C^2}-ln(\sigma_C)+ln(\pi_C)) - (-\frac{|x-\mu_D|^2}{2\sigma_D^2}-ln(\sigma_D)+ln(\pi_D))\\ &= -\frac{|x-\mu_C|^2}{2\sigma_C^2} + \frac{|x-\mu_D|^2}{2\sigma_D^2} -ln(\sigma_C) + ln(\sigma_D) + ln(\pi_C) - ln(\pi_D)\\ &= -\frac{|x-\mu_C|^2}{2\sigma^2} + \frac{|x-\mu_C|^2}{2\sigma^2} + ln(\pi_C) - ln(\pi_D)\quad (Using\quad the\quad assumption)\\ &= \mathbf{(\frac{\mu_C \cdot x}{\sigma^2} - \frac{\mu_C^2}{2\sigma^2} + ln(\pi_C)) - (\frac{\mu_D \cdot x}{\sigma^2} - \frac{\mu_D^2}{2\sigma^2} + ln(\pi_D))}\\ &= \mathbf{\frac{(\mu_C-\mu_D)\cdot x}{\sigma^2} - \frac{\mu_C^2-\mu_D^2}{2\sigma^2} + ln(\pi_C) - ln(\pi_D)}\\ &= \omega \cdot x + \alpha,\\ & where\quad \omega = \frac{(\mu_C-\mu_D)}{\sigma^2}\quad and\quad \alpha = - \frac{\mu_C^2-\mu_D^2}{2\sigma^2} + ln(\pi_C) - ln(\pi_D) \end{align*}$

So what happened here is I take the expressions of C and D, subtract them from each other and use the fact that when both of their sigmas are the same. σC and σD are now equal, which makes certain terms in the equation drop out. One of the certain terms dropped out was the x^2 term. So this equation is now linear instead of quadratic.

This is nice for two reasons. One is the simplicity, it's a little faster to compute compared to using quadratic functions. But the other more important reason is if QDA is overfitting, then switching to a simpler model makes it less likely to overfit, because a linear model has fewer parameters (but also less possibility and less flexible).

So what we now have is a linear classifier! Choose C that maximizes

$\frac{\mu_C \cdot x}{\sigma^2} - \frac{\mu_C^2}{2\sigma^2} + ln(\pi_C)$

Here we are just comparing expression of C and expression of D. Therefore, if we wanna compare a million different classes, I just need to compute one scalar value for each class and figure out which one is biggest.

In 2-class case, decision boundary is when ωx + α=0

Another nice simplification happens if the two priors are equal. So suppose you have two classes, and 50% of your samples are in one class and the other 50% belong to another class then this simplifies the equation into this:

If P(Y=C) = P(Y=D) = 0.5, we'll have

$(\mu_C-\mu_D)\cdot x - (\mu_C-\mu_D)\frac{\mu_C+\mu_D}{2} = 0$

This is the centroid method.

Also point out that just like what we did in QDA, we can get the Bayes posterior. Why do we want that? We want that because sometimes we don't want just what class we're predicting, but what the probability is in that class. And so the Bayes posterior is we take that logistic function I showed you before and we apply it to our predictor function ωx + α. And now we have a probability!

Maximum Likelihood Estimation of Parameters

Now the question remaining is how do we find the mean and standard deviation of the distributions we have based on the data we collect. We're gonna use something called maximum likelihood estimation.

If we flip some biased coins 10 times, heads with probability p; tails with probability 1-p, and we got 8 heads and 2 tails. What is the most likely value of p?

Using Binomial Distribution: X~B(n,p)

$P[X=x]=\binom{n}{x}p^x(1-p)^{n-x}$

Back to our example, n = 10, x = 8:

$P[X=8]=\binom{10}{8}p^8(1-p)^{2} = 45p^8(1-p)^{2} \overset{def}= L(p)$

We define this as L(p), and call it the likelihood function.

So what we basically just did is we wrote the probability of 8 heads in 10 flips as a function L(p) of distribution parameter(s); this is the likelihood function.

Maximum Likelihood Estimation (MLE):

A method of estimating the parameters of a statistical model by picking the params that maximize the likelihood function. (The idea is what parameter would let us find the maximum possible chance of getting 8 heads.)

So back to our example, we 're trying to find p that maximizes L(p):
By setting derivative = 0,

$\frac{dL}{dp} = 360p^7(1-p)^2-90p^8(1-p)=0\\ \Rightarrow 4(1-p) - p = 0\\ \Rightarrow p = 0.8\\ \\ \\ \frac{d^2L}{dp^2} = -18.8744 < 0\quad at\quad p = 0.8\\ \Rightarrow it's\quad a\quad maximum.$

Likelihood of a Gaussian

Moving onto Gaussian distributions,

$\\Given\quad samples\quad x_1, x_2, ..., x_n, find\quad best\quad fit\quad Gaussian.\\ Likelihood\quad of\quad generating\quad these\quad samples\quad is\\ \\ L(\mu, \sigma;x_1, x_2, ..., x_n) = P(x_1)P(x_2)...P(x_n)$

The log likelihood l(.) is the natural of the likelihood L(.), and maximizing likelihood is like maximizing log-likelihood, and vice versa. Therefore, to make our life easier, we're gonna take the log-likelihood here.

$\begin{align*} l(\mu,\sigma;x_1, x_2...,x_n) &= ln(P(x_1)P(x_2)...P(x_n))\\ &= ln(P(x_1)) + ln(P(x_2)) + ... + ln(P(x_n) \end{align*}$

And of course, what we're gonna do next to setting the derivative = 0 to find our precious, precious, mean and standard deviation.

$\begin{align*} First\quad we\quad set\quad \bigtriangledown_{\mu}l = 0, \\ P(x_i) &= \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x_i-\mu)^2}{2\sigma^2})\\ \Rightarrow ln(P(x_i)) &= -\frac{|x_i-\mu|^2}{2\sigma^2}-ln(\sqrt{2\pi})-ln(\sigma)\\ \\ \therefore \bigtriangledown_{\mu}l &= \sum_{i=1}^{n} (\frac{x_i - \mu}{\sigma^2}) = 0\\ \Rightarrow \sum_{i=1}^{n} x_i - n\mu &= 0 \Rightarrow n\mu = \sum_{i=1}^{n} x_i\\ \\ \therefore \hat{\mu} &= \frac{1}{n}\sum_{i=1}^{n} x_i\\ \\ \end{align*}$

Therefore, the 'mean' we have is the mean of all the samples we got.
(Note that we use this triangle thingy because the mean may be a vector.)

$\begin{align*} Next\quad we\quad set\quad \frac{dl}{d\sigma} = 0, \\ \\ \frac{dl}{d\sigma} &= \sum_{i=1}^{n} \frac{|x_i-\mu|^2 - \sigma^2}{\sigma^3} = 0\\ \Rightarrow \sum_{i=1}^{n} (|x_i-\mu|^2 - \sigma^2) &= 0 \Rightarrow n\sigma^2 = \sum_{i=1}^{n} |x_i-\mu|^2\\ \\ \therefore \hat{\sigma^2} &= \frac{1}{n}\sum_{i=1}^{n} |x_i - \mu|^2\\ \\ \end{align*}$

And since we don't exactly know what the true mean is, we're just gonna substitute it with the μ hat we just calculated.

$\hat{\sigma^2} = \frac{1}{n}\sum_{i=1}^{n} |x_i - \hat{\mu}|^2$

Therefore, the 'variance' (just take a square root of it and we'll get the standard deviation) we have is the variance of all the sample points.

In short, we use mean & variance of samples in class C to estimate mean & variance of Gaussian for class C.

Now that we have every piece of the puzzle, we can go back to LDA and QDA.

QDA

In the case of QDA, we can simply estimate the mean and variance of each class using the previous results that we have and estimate the priors for each class C:

$\begin{align*} \hat{\mu} &= \frac{1}{n}\sum_{i = 1}^{n} x_i\\ \\ \hat{\sigma^2} &= \frac{1}{n}\sum_{i=1}^{n} |x_i - \hat{\mu}|^2\\ \\ \pi_C &= \frac{n_C}{n_C + n_D}\quad ,\quad where\quad n_C\quad stands\quad for\quad no.\quad of\quad samples\quad class\quad C \end{align*}$

LDA

However, in the case of LDA, things are a little bit complicated, since we assume every class shares the same standard deviation (and variance, of course).

$\begin{align*} \hat{\mu} &= \frac{1}{n}\sum_{i = 1}^{n} x_i\\ \\ \hat{\sigma^2} &= \frac{1}{n}\sum_{C} \sum_{i:y_i=C} |x_i - \hat{\mu_C}|^2,\quad n\quad is\quad the\quad total\quad no.\quad of\quad samples\quad for\quad all\quad classes\\ \\ \pi_C &= \frac{n_C}{n_C + n_D} \end{align*}$

Small Summary

Let's end this blag by wrapping up everything:

We assume every class comes from a Gaussian distribution
Our target is to compare posteriors for different classes
We do this by using QDA and LDA, and the sigmoid function
In the meantime, a Gaussian distribution needs a mean and standard deviation, and we find them using maximum likelihood estimation

What's next? Anisotropic normal distributions, that means more Gaussians. Crap...

Kev

Road to Kevolution

My Machine Learning Learning Experience (Part 6): Gaussian Discriminant Analysis And Maximum Likelihood Estimation

Gaussian Discriminant Analysis

Quadratic Discriminant Analysis (QDA)

Linear Discriminant Analysis (LDA)

Maximum Likelihood Estimation of Parameters

Likelihood of a Gaussian

QDA

LDA

Small Summary

No comments:

Note:

Recents

Popular

Comments

About me

Albums

Email me

Archive

Label

Contributors

Popular Posts

Random Posts

Recent Posts