---
title: "Matrix algebra and linear projections"
output:
  pdf_document: default
  html_notebook: default
  html_document: default
---

# Linear Regression

In the first part, we look at how multiple linear regression can be performed directly using linear algebra using a *single* line of code. 

Lets start with something very simple: linear regression with one feature. Recall that the goal of linear regression is to find a function:
$$ f(x) = \beta_0 + \beta_1 \cdot x_1 $$
The notation can be simplified if we pretend that the intercept is simply a parameter for a feature $x_0$ which is always equal to $1$.
$$ f(x) = \beta_0 \cdot 1 + \beta_1 \cdot x_1 = \beta_0 \cdot x_0 + \beta_1 \cdot x_1 $$

Linear regression at its simplest is when there is one feature and with two data points: $x_1, x_2$ with targets $y_1,y_2$. Finding the line that goes through these two values simply reduces to solving a system of linear equations:
$$ 
\begin{aligned}
y_1 &= f(x_1) = \beta_0 \cdot x_{1,0} + \beta_1 \cdot x_{1,1} \\
y_2 &= f(x_2) = \beta_0 \cdot x_{2,0} + \beta_1 \cdot x_{2,1}
\end{aligned}
$$
We also can write this system of linear equations as matrix multiplication:
$$ 
\underbrace{\begin{bmatrix} y_1 \\ y_2 \end{bmatrix}}_y  = \underbrace{\begin{bmatrix}  x_{1,0} &  x_{1,1} \\ x_{2,0} &  x_{2,1} \end{bmatrix}}_X
  \underbrace{\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}}_\beta
$$
The matrix $X$ here is called the *design matrix* and the general linear system is: 
$$ y = X \beta $$
If there are $K$ features and $N$ data points then the dimensions of $X$ are $N \times (K+1)$. 

Computing the solution to our linear system with two data points is as easy as computing the inverse matrix $X^{-1}$ to $X$ and multiplying both sides by this inverse matrix:
$$ X^{-1} y = X^{-1} X \beta = \beta $$

Consider now a concrete example with two data points. The value of the feature is: $x_{1,1} = 2$ and $x_{2,1} = 5$. The target is $y_1 = 7$ and $y_2 = 3$. The design matrix $X$  is then (don't forget the intercept feature):
```{r}
X <- rbind(c(1,2),
          c(1,5))
print(X)
```
The target vector $y$ is:
```{r}
y <- c(7,3)
print(y)
```
The two data points plotted look as follows:
```{r}
plot(X[,2],y); grid()
```

We can now invert the matrix and get our solution to the linear regression problem! 
```{r}
Xinv <- solve(X)
print(Xinv)
```
Lets double check that this is indeed a proper matrix inverse.
```{r}
Xinv %*% X
```
```{r}
X %*% Xinv
```
Yes the inverse works!

Computing the coefficients is now super easy:
```{r}
beta <- Xinv %*% y
print(beta)
```
Lets make sure that this line is in fact correct and goes through both of our data points.
```{r}
plot(X[,2],y); grid()
abline(beta[1], beta[2])
```

Does the built-in linear regression give us the same result?
```{r}
lm(y ~ X[,2])$coeff
c(beta)
```
Yes! The results are the same.

**Questions**: 

  1. How can we compute the the parameters $\beta$ when there is only a single data point? 
  2. How about when there are more data points than features?

OK, lets try 4 data points.
```{r}
X <- rbind(c(1,2),
           c(1,5),
           c(1,3),
           c(1,2))
y <- c(7,3,5,6)
plot(X[,2],y); grid()
```  

It does not look like one can find a line through these points. The system of linear equations will not have a solution. The error will happen when we try to invert the new design matrix.
```{r}
# > solve(X)
# Fails with: Error in solve.default(X) : 'a' (4 x 2) must be square 
```

The answer is to find the line that will minimize the RSS. 

## Minimizing RSS

Recall that RSS is the residual sum of squares:
$$ \operatorname{RSS} = \sum_{i=1}^n (y_i - f(x_i))^2$$
It would be nice to be able to write it in a form of linear algebra. There is actually a tool for this called the $L_2$-norm or the *Euclidean distance*:
$$ \| z \|_2^2 = \sum_{i=1}^n z_i^2 = z^T z$$
Using linear algebra, the RSS can be written much more compactly. 
$$ \operatorname{RSS} = \| y - X \beta \|_2^2 = (y-X\beta)^T (y - X \beta) = y^T y - 2 y^T X \beta + \beta^T X^T X \beta$$
Linear regression chooses $\beta$ to minimize the RSS and thus we have to solve the following optimization problem.
$$ \min_\beta \| y - X \beta \|_2^2 $$
Luckily, this is a convex minimization problem. All we have to do is to look for a value of $\beta$ in which the gradient is zero.
$$ 
\begin{aligned}
\nabla_\beta \;\| y - X \beta \|_2^2 &= 0 \\
\nabla_\beta \; \Bigl( y^T y - 2 y^T X \beta + \beta^T X^T X \beta \Bigr) &= 0 \\
\nabla_\beta \; \Bigl( - 2 y^T X \beta + \beta^T X^T X \beta \Bigr) &= 0 \\
 - 2 X^T y + 2  X^T X \beta&= 0 \\
 X^T X \beta&= X^T y \\
 \beta &= (X^T X)^{-1} X^T y
\end{aligned}
$$

So what if $X$ is not square? It is not a problem. If the dimensions of $X$ are $N \times (K+1)$ then the dimensions of $X$ are $(K+1)\times(K+1)$ which is always a square.

**Question**: Can any square matrix be inverted?

The implementation of linear regression is now really just a *single line*!
```{r}
beta <- solve(t(X) %*% X) %*% t(X) %*% y
print(beta)
```
To make sure that everything is OK, we should compare our implementation with the built-in linear regression.
```{r}
beta_in <- lm(y ~ X[,2])$coeff
print(beta_in)
```
And finally, the plot.
```{r}
plot(X[,2],y); grid()
abline(beta[1], beta[2])
```

## Computational Issues 

The first rule of numerical linear algebra is: **never compute a matrix inverse**. Computing a matrix inverse is:

  1. *Slow*: There are faster ways of solving systems of linear equations
  2. *Unstable*: Linear algebra implementation if finite precision can lead to disastrously large errors for ill-conditioned matrices

Luckily, there are many other ways of computing 
$$ \beta = (X^T X)^{-1} X^T y $$
```{r}
solve(t(X) %*% X) %*% t(X) %*% y
```

The most common alternatives are:

**1. Gaussian elimination**: Directly solve the system of linear equation. This is related to how a matrix inverse is often computed, but is faster by about a factor of $K$ and much more numerically stable.
```{r}
solve(t(X) %*% X, t(X) %*% y)
```

**2. Cholesky decomposition (LDL)**: The idea is that for any *positive-definite* symmetric matrix $A = U^T U$ where $U$ is an upper triangular matrix. Triangular matrices are very easy to invert and the procedure is computationally stable. Matrix $X^T X$ is symmetric and positive definite. Compute the Cholesky decomposition of $X^T X$.

The symmetric matrix is:
```{r}
t(X) %*% X
```
The Cholesky decomposition is:
```{r}
U <- chol(t(X) %*% X)
U
```


```{r}
t(U) %*% U
```
The linear regression can now be expressed as:
```{r}
chol2inv(U) %*% t(X) %*% y
```

**3. QR decomposition**: Any matrix can be decomposed to $A = Q R$ where $Q$ is an *orthogonal matrix* and $R$ is upper triangular. Orthogonal matrix satisfies $Q^ Q = I$. The transpose of $Q$ is also its inverse. QR is also stable and fast. We do not need to even compute $X^T X$ but instead compute the QR decomposition of $X$.

When $X = QR$ then
$$ X^T X = R^T Q^T Q R = R^T R$$

```{r}
qr.R(qr(X))
```
Notice that $R$ is the same as $U$ from the Cholesky decomposition and is easier to compute.
```{r}
R <- qr.R(qr(X)) 
chol2inv(R) %*% t(X) %*% y
```

## Column view of linear regression

Another way to view linear regression is a computing linear combination of the columns. Let $X_i$ be the vector that represent the feature $i$ for all samples. Then we are looking for a function that minimizes RSS for a linear combination of the feature vectors and the target.

$$ \min_\beta \| y - X_1 \beta_1 - \ldots - X_K \beta_K \|_2^2$$

Before discussing some benefits, lets visualize the simple example from before. 
```{r}
Xs <- rbind(c(1,2),
            c(1,5))
ys <- c(7,3)
```
The standard row view looks at each row as a data point. This is the plot (ignoring the intercept feature). The goal is again to connect the two points using a line.
```{r}
plot(Xs[,2],ys); grid()
```
The column view looks at each feature as a *vector*. Now, we include the intercept feature and get three vectors, including $y$. The goal is to linearly combine the dashed vectors to get the solid one.
```{r}
plot(NULL, xlab="", ylab="", xlim=c(0,9), ylim=c(0,9)); grid();
arrows(0,0,Xs[1,1], Xs[2,1],lty=2)
arrows(0,0,Xs[1,2], Xs[2,2],lty=2)
arrows(0,0,ys[1],ys[2])
```


This view can help to see, for example, that adding a feature that is linearly dependent will not reduce the RSS. As an example, consider our previous design matrix $X$ and add another feature. 
```{r}
Y <- cbind(X, c(3,6,5,8))
Y
```
Compute the coefficients of linear regression (make sure to remove the intercept):
```{r}
beta <- lm(y ~ Y - 1)$coeff
beta
```
Lets verify that the numbers really do add up. First the matrix of the values is:
```{r}
sapply(1:3, function(i) {beta[i] * Y[,i]})
```

```{r}
rowSums(sapply(1:3, function(i) {beta[i] * Y[,i]}))
y
```

Does RSS decrease when we add a feature that is a linear combination of the others?
```{r}
Z <- cbind(Y, Y[,2] + Y[,3])
Z
```
Nope, it does not decrease the error at all.
```{r}
summary(lm(y ~ Z - 1))$r.squared
summary(lm(y ~ Y - 1))$r.squared
```


# PCA

PCA is all about Normal distributions and *covariance matrices*. First, lets look at some examples of points generated from a normal distribution with different covariance matrices in 2 dimensions. That means that there are two features. To keep things simple, we will just assume that the mean is 0.
```{r}
mu <- c(0,0)
```
The simplest covariance matrix is just an identity matrix
```{r}
Sigma <- rbind(c(1,0),
               c(0,1))
Sigma
```

Lets sample from this distribution. The result will look very much like the design matrix with rows corresponding to data points and columns corresponding to features.
```{r}
library(MASS)
samp <- mvrnorm(10, mu = mu, Sigma = Sigma)
samp
```

```{r}
samp <- mvrnorm(3000, mu = mu, Sigma = Sigma)
plot(samp, type="p", asp=1); grid()
```

```{r}
Sigma <- rbind(c(10,0),
               c(0,1))
Sigma
```


What happens when we choose a different matrix?
```{r}
Sigma <- rbind(c(10,0),
               c(0,1))
samp <- mvrnorm(3000, mu = mu, Sigma = Sigma)
plot(samp, type="p", asp=1); grid()
```

```{r}
Sigma <- rbind(c(1,0),
               c(0,10))
Sigma
```


```{r}
samp <- mvrnorm(3000, mu = mu, Sigma = Sigma)
plot(samp, type="p", asp=1); grid()
```

Computing PCA on this data is very simple -- it is the axis with the highest variance and there are only to choose from. 
```{r}
prcomp(samp)
```

But what if the data is rotated?
```{r}
Sigma <- rbind(c(3,2),
               c(2,10))
Sigma
```

```{r}
samp <- mvrnorm(3000, mu = mu, Sigma = Sigma)
plot(samp, type="p", asp=1); grid()
```
PCA can recover this rotation:
```{r}
prcomp(samp)
```
Lets check this visually:
```{r}
prc <- prcomp(samp)
plot(samp, type="p", asp=1); grid()
abline(0,prc$rotation[2,1]/prc$rotation[1,1])
abline(0,prc$rotation[2,2]/prc$rotation[1,2])
```

How would we construct such a rotated covariance matrix? Lets say we want it to be with an angle of 45 degrees. Lets make the first principal component be: 
$$ v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}  $$
**Question**: What is the second principal component then?

Lets put them in a single matrix:
$$ V = \frac{1}{\sqrt{2}} \begin{bmatrix} | & | \\ v_1 & v_2 \\  | & | \end{bmatrix} $$
```{r}
v1 = sqrt(1/2) * c(1,1)
v2 = sqrt(1/2) * c(1,-1)
V = cbind(v1,v2)
V
```

So, we would like the vector $v_1$ behave really like the first unit vector $[1, 0]$. This is what the matrix inverse is for:
$$ \begin{bmatrix} 1 \\0 \end{bmatrix} = V^{-1} v_1$$
```{r}
t(V) %*% v1
t(V) %*% v2
```

Assume the unrotated covariance matrix:
```{r}
Sigma <- rbind(c(10,0),
               c(0,1))
Sigma
```
The plot looks like this:
```{r}
samp <- mvrnorm(3000, mu = mu, Sigma = Sigma)
prc <- prcomp(samp)
plot(samp, type="p", asp=1); grid()
abline(0,prc$rotation[2,1]/prc$rotation[1,1])
abline(0,prc$rotation[2,2]/prc$rotation[1,2])
```

We are now ready to construct the covariance matrix:
```{r}
newSigma <- V %*% Sigma %*% t(V)
newSigma
```
Lets plot it:
```{r}
samp <- mvrnorm(3000, mu = mu, Sigma = newSigma)
Sigma = V %*% t(V) 
prc <- prcomp(samp)
plot(samp, type="p", asp=1); grid()
abline(0,prc$rotation[2,1]/prc$rotation[1,1])
abline(0,prc$rotation[2,2]/prc$rotation[1,2])
```

How can we recover the rotation?

How does PCA recover the rotation? In two easy steps.

  1. Compute the *covariance matrix* from the data
  2. Compute eigenvectors of the matrix. Looking for a linear transformation of the features that will give us a diagonal matrix.

Lets start with the second step. If we have our covariance matrix, we can compute the eigenvalues and eigen-vectors, which satisfy:
$$ A x = \lambda x $$
The eigenvectors can be computed as follows:
```{r}
eigen(Sigma)
```
This of eigenvectors as dimensions in which the matrix behaves as diagonal. A very nice property of symmetric matrices (such as covariance matrices) is that their eigenvectors are *orthogonal*. So we can invert a matrix just by transposing it. Now we can diagonalize the matrix using the eigenvectors:
$$ V^{-1} \Sigma V = D$$
where $D$ is a diagonal matrix of *eigenvalues* and $V$ is the matrix of *eigenvectors*. Because the eigenvectors of a symmetric matrix are *orthogonal*, we get:
$$ V^T \Sigma V = D$$

**Question**: Show how this is the same thing as when we constructed the covariance matrix before.

```{r}
Sigma
```


Lets see:
```{r}
E = eigen(Sigma)
t(E$vectors) %*% Sigma %*% E$vectors
```

```{r}
newSigma
```

```{r}
E = eigen(newSigma)
E$vectors
```


What about our rotated newSigma?
```{r}
t(E$vectors) %*% newSigma %*% E$vectors
```
Nice, we were able to recover the rotation.

**Question**: How can we compute the covariance matrix from data?

What about? Homework: Show that this is true.
$$ \Sigma = \frac{1}{n} X^T X $$
Lets check numerically that it works.
```{r}
(t(samp) %*% samp) / nrow(samp)
```