---
title: "Maximum Likelihood"
output:
  html_document: default
  html_notebook: default
---
Suppose that we want to predict the US per-capita *cheese consumption* in the US. We have a data set with cheese consumption $Y$ as the response. We also have two features, or predictors, $X_1$ and $X_2$. The goal is to compute a *linear regression* model with parameters $\beta_0, \beta_1, \beta_2$ such that
\[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2~.  \]
The data set is as follows:
```{r}
cheese <- data.frame(Y=c(29.2,30.2,30.5,30.55,31.1,31.6,32.2,33,32.8,32.9), 
                     X1=c(390,450,500,490,620,600,690,790,820,780),
                     X2=c(9.5,9.7,9.7,9.6,10,10.3,10.4,11,10.5,10.2))
summary(cheese)
```
Feature $X_1$ is is much larger than $X_2$. The scale does not affect the predictions of linear regression. However, to get a better sense of the contribution of each feature to the prediction, it may be better to have them be the same scale. So lets rescale $X_1$.
```{r}
cheese$X1 <- cheese$X1 / 50
```
Lets check the correlations between the labels and features first. First, the basic R command.
```{r}
pairs(cheese)
```
But we can do better using the corrplot package.
```{r}
library(corrplot) #install.packages("corrplot")
corrplot.mixed(cor(cheese),upper="ellipse")
```
# Linear regression fit
Lets just fit a linear regression model.
```{r}
cheese.lm <- lm(Y~X1+X2, data=cheese)
summary(cheese.lm)
```
The prediction equation is:
\[ Y \approx 23.08 + 0.36 X_1 + 0.38 X_2 \]
That means that both $X_1$ and $X_2$ contribute about equally to the prediction. That may be OK in general, but is this really a sensible model? It depends on the source of feature values.
# The Data Source
The data is for years 2000-2010 and comes from the site [http://www.tylervigen.com/spurious-correlations](http://www.tylervigen.com/spurious-correlations) which mines for spurious correlations in US government data. The features are:
- $Y$ is the per-capita cheese consumption
- $X_1$ is the total number of people (in the US) who died by becoming tangled in their bedsheets
- $X_2$ is the per-capita *mozarella* cheese consumption
Clearly, one would expect the feature $X_2$ to be a sensible predictor for $Y$. Using $X_1$ however seem to make little sense. While $X_1$ seems to be a good predictor of $Y$ for 2000-2010, this is purely by chance. If we build a model like this, it is unlikely to perform well in the future when there two variables become uncorrelated.
# Bayesian Approach
The problem with linear regression is that it purely maximizes the likelihood. Likelihood is oblivious to the parameters of the model. It ignores that some of the values (such as deaths in bedsheets predicting cheese consumption) are physically implausible. We could, of course, simply remove unlikley predictor but reality is rarely as clear-cut as this small contrived example. There are often features that seem unlikely but are not impossible.
The alternative to maximum likelihood is to compute that maximum a posteriori probability. That is the most likely value of $\beta$ given data. Recall that likelihood measures the probability of data given $\beta$. The posterior distribution over $\beta$ can be easily derived using the prior over $\beta$ and the Bayes theorem.
It turns out that when we assume a Normal (zero mean) prior on the parameter $\beta_0,\beta_1,\beta_2$, the maximum a posteriori problem simply reduces to *ridge regression*. We can now tune our belief by changing the prior. A prior distribution with *smaller* variance indicate *smaller probability* of being an important predictor. One way to control the variance is to scale the features (glmnet also offers a separate functionality). A scale of the feature influences the penalty but not the RSS term. A smaller feature is less likely to be used in the prediction. 
\[ Z_1 = \phi \, X_1  \]
Denote the coefficient for $Z_1$ as $\theta_1$. The coefficient for $X_1$ is recovered as $\beta_1 = \theta_1\cdot\phi$. Note that lm.ridge will simply standardize features before running. 
First, run ridge regression:
```{r}
library(glmnet) #install.packages("glmnet")
phi <- 0.1
cheese$Z1 <- phi * cheese$X1
cheese.lm.r <- glmnet(as.matrix(cheese[c(3,4)]), cheese$Y,  family="gaussian", standardize = FALSE, alpha=0, lambda = 0.3)
c <- coef(cheese.lm.r); c[3] <- c[3] * phi
print(c)
```
Now, the same thing with Lasso
```{r}
library(glmnet) #install.packages("glmnet")
phi <- 0.1
cheese$Z1 <- phi * cheese$X1
cheese.lm.r <- glmnet(as.matrix(cheese[c(3,4)]), cheese$Y,  family="gaussian", standardize = FALSE, alpha=1, lambda = 0.3)
c <- coef(cheese.lm.r); c[3] <- c[3] * phi
print(c)
```
Notice that the lasso coefficient for $X_1$ is 0 while it is postive for ridge regression.