In [1]:

from lec_utils import *

Discussion Slides: Multiple Linear Regression

Agenda 📆¶

The design matrix, observation vector, and parameter vector.
"Solving" the normal equations.

Example: Lifespan¶

Consider the DataFrame lifespan, which contains the 'lifespan', average 'hours_exercised' per day, and average 'packs_cigs' (packs of cigarettes smoked per day) for various individuals.

In [2]:

lifespan = pd.DataFrame({
    'lifespan': [86, 82, 72, 60],
    'hours_exercised': [2, 1.5, 1, 0.5],
    'packs_cigs': [0, 0, 1, 4]
})
lifespan

Out[2]:

	lifespan	hours_exercised	packs_cigs
0	86	2.0	0
1	82	1.5	0
2	72	1.0	1
3	60	0.5	4

Suppose we want to predict the 'lifespan' of an individual as a linear function of their 'hours_exercised' and 'packs_cigs'. In other words:

$$\text{predicted lifespan}_i = H(\text{hours exercised}_i, \text{packs cigs}_i) = w_0 + w_1 \cdot \text{hours exercised}_i + w_2 \cdot \text{packs cigs}_i$$

How do we find the optimal values of $w_0^*$, $w_1^*$, and $w_2^*$?

Augmented feature vectors¶

A feature vector, $\vec x_i$, contains the information used to make a prediction for a single individual.

In our example, $\vec x_i = \begin{bmatrix} \text{hours exercised}_i \\ \text{packs cigs}_i \end{bmatrix}$.

An augmented feature vector, $\text{Aug}(\vec x_i)$, inserts a 1 at the start of the feature vector, $\vec x_i$.

$$\text{Aug}(\vec x_i) = \begin{bmatrix} 1 \\ \text{hours exercised}_i \\ \text{packs cigs}_i \end{bmatrix}$$

Design matrices, observation vectors, and parameter vectors¶

Suppose we want to build a multiple linear regression model that uses multiple – specifically, $d$ – features to make predictions.

$$H(\vec x_i) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}$$

Define the design matrix $\color{#007aff} X \in \mathbb{R}^{n \times (d + 1)}$, observation vector $\color{orange}{\vec{y}} \in \mathbb{R}^n$, and parameter vector $\vec{w} \in \mathbb{R}^{d+1}$ as:

$${\color{#007aff}{ X= \begin{bmatrix} {1} & { x^{(1)}_1} & { x^{(2)}_1} & \dots & { x^{(d)}_1} \\\\ { 1} & { x^{(1)}_2} & { x^{(2)}_2} & \dots & { x^{(d)}_2} \\\\ \vdots & \vdots & \vdots & & \vdots \\\\ { 1} & { x^{(1)}_n} & { x^{(2)}_n} & \dots & { x^{(d)}_n} \end{bmatrix} = \begin{bmatrix} \text{Aug}({\vec{x_1}})^T \\\\ \text{Aug}({\vec{x_2}})^T \\\\ \vdots \\\\ \text{Aug}({\vec{x_n}})^T \end{bmatrix}}} \qquad {\color{orange}{\vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}} \qquad \vec{w} = \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{bmatrix}}$$

The design matrix, $\color{#007aff} X$, represents our entire dataset, whereas an augmented feature vector $\text{Aug}(\vec x_i)$ represents a single point.
We include a 1 in the first column of our design matrix – and as the first element in our augmented feature vector – so that the intercept (bias) term $w_0$ can be incorporated into our parameter vector.

The observation vector contains all of the target (or response) values corresponding to each observation in the dataset.

Minimizing mean squared error¶

Our goal is to find the optimal parameter vector, $\vec w^*$, which minimizes mean squared error.

$$\begin{align*} R_\text{sq}(\vec w) &= \frac{1}{n} \sum_{i = 1}^n (y_i - H(\vec x_i))^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \left(y_i - (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)} ) \right)^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \left(y_i - \text{Aug}(\vec x_i) \cdot \vec{w} \right)^2 \end{align*}$$

If we were to use a calculus-based approach, we'd need to take the partial derivative of $R_\text{sq}(\vec w)$ with respect to $w_0$, and with respect to $w_1$, and with respect to $w_2$, and so on, set them all to 0, and solve the resulting system of equations. That's infeasible!

The reason for introducing the concept of a design matrix, $X$, and observation vector, $\vec y$, was so that we could rewrite $R_\text{sq}(\vec w)$ as follows:

$$R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2$$

How does this help us?

The normal equations¶

Finding the optimal parameter vector, $\vec w^*$, boils down to finding the $\vec w$ that minimizes:

$$R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2$$

Intuitively, this means we want $X \vec w$ to be "as close" to $\vec y$ as possible. Remember, the only unknown is $\vec w$; $X$ and $\vec y$ come from our data.

As mentioned in this week's lectures, this is can be done by choosing $\vec w^*$ such that:
- the error vector, $\vec e = \vec y - X \vec w^*$,
- is orthogonal to the columns of $X$.

The above condition can be expressed as:

$$X^T (\vec y - X \vec w^*) = 0$$

Expanding, we have:

$$X^T \vec y - X^TX \vec w^* = 0 \implies \boxed{X^TX \vec y = X^TX \vec w^*}$$

The boxed equation above is known as the normal equations. The $\vec w^*$ that minimizes mean squared error is the one that satisfies the boxed condition.

Why are they called the normal equations?

If $X^TX$ is invertible, there's a unique solution for $\vec w^*$:

$$\vec w^* = (X^TX)^{-1}X^T \vec y$$

Big takeaway¶

We chose $\vec w^*$ so that:
- The vector of predictions, $\vec h^* = X \vec w^*$,
- is the orthogonal projection of $\vec y$
- onto the span of the columns of the design matrix, $X$.

This $\vec w^*$ is guaranteed to minimize mean squared error.

Example: Lifespan¶

Let's return to our example from earlier. Remember, the goal is to find the best choices for $w_0^*$, $w_1^*$, and $w_2^*$ in:

$$\text{predicted lifespan}_i = H(\text{hours exercised}_i, \text{packs cigs}_i) = w_0 + w_1 \cdot \text{hours exercised}_i + w_2 \cdot \text{packs cigs}_i$$

In [3]:

lifespan

Out[3]:

	lifespan	hours_exercised	packs_cigs
0	86	2.0	0
1	82	1.5	0
2	72	1.0	1
3	60	0.5	4

Our design matrix, observation vector, and parameter vector are defined as follows:

$$ {\color{black} {X = \begin{bmatrix} {1} & {1.0} & {1} \\ {1} & {1.5} & {0} \\ {1} & {0.0} & {2} \\ {1} & {1.0} & {0.5} \end{bmatrix}}} \qquad {\color{black} {\vec{y} = \begin{bmatrix} 82 \\ 90 \\ 68 \\ 77 \end{bmatrix}}} \qquad \vec{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} $$

Solving for the optimal parameter vector¶

We'll find $\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}$ using code. First, let's construct our design matrix, $X$.

In [4]:

X = lifespan[['hours_exercised', 'packs_cigs']].copy()
X['1'] = 1
X = X.iloc[:, [-1, 0, 1]]
X

Out[4]:

	1	hours_exercised	packs_cigs
0	1	2.0	0
1	1	1.5	0
2	1	1.0	1
3	1	0.5	4

Our observation vector, $\vec y$, is the 'lifespan' column of X.

In [5]:

y = lifespan['lifespan'].to_numpy()
y

Out[5]:

array([86, 82, 72, 60])

Recall that the optimal parameter vector, $\vec w^*$, is defined as:

$$\vec w^* = (X^TX)^{-1}X^T \vec y$$

In [6]:

w_star = np.linalg.inv(X.T @ X) @ X.T @ y 
w_star

Out[6]:

0    64.35
1    11.04
2    -2.52
dtype: float64

This is telling us that the optimal way to predict 'lifespan' as a function of 'hours_exercised' and 'packs_cigs' is:

$$\text{predicted lifespan}_i = H(\text{hours exercised}_i, \text{packs cigs}_i) = 64.35 + 11.04 \cdot \text{hours exercised}_i -2.52 \cdot \text{packs cigs}_i$$

We can now use this parameter vector to make predictions!

In [7]:

# My predicted lifespan if I exercise 3 hours a day and don't smoke.
# Equivalent to plugging in hours_exercised_i = 3 and packs_cigs_i = 0 into the equation above.
np.dot(w_star, np.array([1, 3, 0]))

Out[7]:

97.47826086956516