- Basics of probability
- Bernoulli distribution

I will explain what likelihood is with a simple example. Suppose you have a coin that is not necessarily fair, i.e., it is not certain if it flips equally heads and tails. The result of $n$ flips is defined by a vector $\mathbf{x} = [x_1,...,x_n]$, where $x_i$ can be heads ($x_i = 1$) or tails ($x_i = 0$). We would like to find the probability of seeing this result, given that the coin flips heads (the remaining probability is to flip tails) a fraction $\theta$ of the times. The Bernoulli distribution represents each flip of the coin:
$$f(x_i | \theta) = \theta^{x_i} (1 - \theta)^{1-x_i}$$
As you can see, this function depends on our unknown parameter $\theta$ and our result. The join result of several flips is defined by the join distribution:
$$f(\mathbf{x} | \theta) = f(x_1,...,x_n | \theta). $$
Since we know that a flip of a coin does not depend on any other flip, we say that **each observation $x_i$ is independent and identically distributed (i.i.d.) and the joint distribution simplifies to a product of distributions**:
$$f(x_1,...,x_n | \theta) = \prod_{i=1}^n f(x_i | \theta) = \theta^{x_1} (1 - \theta)^{1-x_1}\cdot...\cdot\theta^{x_n} (1 - \theta)^{1-x_n} = \theta^{\#H} (1 - \theta)^{n-\#H},$$
where $\#H$ is the number of flips that resulted in heads. **The likelihood
$$L(\theta|\mathbf{x}) = \prod_{i=1}^n f(x_i | \theta)$$
is just the probability of seeing our observations given a certain value for the unknown parameters**. Other common ways to represent the likelihood is
$$L(\theta|\mathbf{x}) = f(\mathbf{x} | \theta) = f_{\theta}(\mathbf{x}) = p_{\theta}(\mathbf{x}).$$
But **the likelihood function is not a probability**. To show you this more clearly, let's calculate the likelihood for some coins:
$$L(\theta = 0.5 | HH ) = f(HH | \theta = 0.5) = 0.5 \cdot 0.5 = 0.25$$
$$L(\theta = 0 | HH ) = f(HH | \theta = 0) = 0$$
$$L(\theta = 1 | HH ) = f(HH | \theta = 1) = 1$$
By this simple example, you can imagine that the integral of the likelihoods will not sum to 1, which is impossible for a distribution.

Just to make it clear:

- Probability allows to predict unknown outcomes based on known parameters.
- Likelihood allows to predict unknown parameters based on known outcomes.

Because many probability functions have exponentials, it is sometimes easier to use the log of the likelihood. Furthermore, imagine the example above $$L(\theta | \mathbf{x} ) = \prod_{i=1}^n f(x_i | \theta).$$ Then $$\log L(\theta | \mathbf{x} ) = \log \prod_{i=1}^n f(x_i | \theta) = \sum_{i=1}^n \log f(x_i | \theta)$$ because the log of products is equal to the sum of logs. If the probability function is from the exponential family, the log cancels with the exponential and the expression becomes simpler.

The name makes it really obvious: **Maximum Likelihood Estimation is about finding the value for the parameters that maximizes the likelihood function**. If we have to choose some value for the parameter, our best guess is the one that best describes our results. Since log is a monotonically increasing function, **maximizing the log-likelihood is the same as maximizing the likelihood**.

$$\log L(\theta|\mathbf{x}) = \sum_{i=1}^n \log f(x_i | \theta)$$ Dividing the log-likelihood by $n$ does not change our maximization problem, so we have $$\hat{l(\theta|\mathbf{x})} = \frac{\sum_{i=1}^n \log L(\theta | x_i)}{n} = E[\log f(x_i | \theta)]$$ where $E[\cdot]$ is the expected value over all observations. Then, the maximum likelihood problem can be defined as: $$\theta_{MLE} = \underset{\theta}{\textrm{argmax }} \hat{l}(\theta|\mathbf{x}) = \underset{\theta}{\textrm{argmax }} \log L(\theta|\mathbf{x}) = \underset{\theta}{\textrm{argmax }} L(\theta|\mathbf{x})$$

Let's apply the Maximum Likelihood Estimation to our previous example: $$\log L(\theta|\mathbf{x}) = \log \prod_{i=1}^n f(x_i | \theta) = \log \theta^{\#H} (1 - \theta)^{n-\#H}.$$ To find the maximum likelihood, we derive it and find its root: $$ \frac{\partial \log (\theta^{\#H} (1 - \theta)^{n-\#H}) }{\partial \theta} = \frac{ \partial \log (\theta^{\#H}) + \partial \log ((1 - \theta)^{n-\#H}) }{\partial \theta} = \frac{ \#H\partial\log \theta + (n-\#H)\partial \log(1 - \theta) }{\partial \theta} = 0 $$ Recalling that $\partial log (x) / \partial x = 1/x$ $$ \#H\frac{1}{\theta} + (n-\#H)\frac{ \partial \log (1-\theta)}{1-\theta}\frac{1 - \theta}{\partial \theta} = 0$$ $$ \#H\frac{1}{\theta} + (n-\#H)\frac{1}{1-\theta}(-1) = 0$$ $n-\#H = \#T$ is equal to the number of tails $$ \#H\frac{1}{\theta} - \#T\frac{1}{1-\theta} = 0$$ $$ \theta = \frac{\#H}{\#T + \#H} = \frac{\#H}{n}$$ Conclusion: the best value for $\theta$ (according to MLE) is defined by the fraction of flips that resulted in heads. That makes sense!

As you might have noticed, there is no a priori knowledge about the coin. If we have a tendency to believe that the coin is fair, we cannot put that belief into the model. That is when Maximum a Posteriory Probability (MAP) comes in.