In [1]:
from lec_utils import *

Discussion Slides: Multiple Linear Regression

Agenda 📆¶

  • The design matrix, observation vector, and parameter vector.
  • "Solving" the normal equations.

Example: Lifespan¶

  • Consider the DataFrame lifespan, which contains the 'lifespan', average 'hours_exercised' per day, and average 'packs_cigs' (packs of cigarettes smoked per day) for various individuals.
In [2]:
lifespan = pd.DataFrame({
    'lifespan': [86, 82, 72, 60],
    'hours_exercised': [2, 1.5, 1, 0.5],
    'packs_cigs': [0, 0, 1, 4]
})
lifespan
Out[2]:
lifespan hours_exercised packs_cigs
0 86 2.0 0
1 82 1.5 0
2 72 1.0 1
3 60 0.5 4
  • Suppose we want to predict the 'lifespan' of an individual as a linear function of their 'hours_exercised' and 'packs_cigs'. In other words:
$$\text{predicted lifespan}_i = H(\text{hours exercised}_i, \text{packs cigs}_i) = w_0 + w_1 \cdot \text{hours exercised}_i + w_2 \cdot \text{packs cigs}_i$$
  • How do we find the optimal values of $w_0^*$, $w_1^*$, and $w_2^*$?

Augmented feature vectors¶

  • A feature vector, $\vec x_i$, contains the information used to make a prediction for a single individual.
  • In our example, $\vec x_i = \begin{bmatrix} \text{hours exercised}_i \\ \text{packs cigs}_i \end{bmatrix}$.
  • An augmented feature vector, $\text{Aug}(\vec x_i)$, inserts a 1 at the start of the feature vector, $\vec x_i$.
$$\text{Aug}(\vec x_i) = \begin{bmatrix} 1 \\ \text{hours exercised}_i \\ \text{packs cigs}_i \end{bmatrix}$$

Design matrices, observation vectors, and parameter vectors¶

  • Suppose we want to build a multiple linear regression model that uses multiple – specifically, $d$ – features to make predictions.
$$H(\vec x_i) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}$$
  • Define the design matrix $\color{#007aff} X \in \mathbb{R}^{n \times (d + 1)}$, observation vector $\color{orange}{\vec{y}} \in \mathbb{R}^n$, and parameter vector $\vec{w} \in \mathbb{R}^{d+1}$ as:
$${\color{#007aff}{ X= \begin{bmatrix} {1} & { x^{(1)}_1} & { x^{(2)}_1} & \dots & { x^{(d)}_1} \\\\ { 1} & { x^{(1)}_2} & { x^{(2)}_2} & \dots & { x^{(d)}_2} \\\\ \vdots & \vdots & \vdots & & \vdots \\\\ { 1} & { x^{(1)}_n} & { x^{(2)}_n} & \dots & { x^{(d)}_n} \end{bmatrix} = \begin{bmatrix} \text{Aug}({\vec{x_1}})^T \\\\ \text{Aug}({\vec{x_2}})^T \\\\ \vdots \\\\ \text{Aug}({\vec{x_n}})^T \end{bmatrix}}} \qquad {\color{orange}{\vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}} \qquad \vec{w} = \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{bmatrix}}$$
  • The design matrix, $\color{#007aff} X$, represents our entire dataset, whereas an augmented feature vector $\text{Aug}(\vec x_i)$ represents a single point.
    We include a 1 in the first column of our design matrix – and as the first element in our augmented feature vector – so that the intercept (bias) term $w_0$ can be incorporated into our parameter vector.
  • The observation vector contains all of the target (or response) values corresponding to each observation in the dataset.

Minimizing mean squared error¶

  • Our goal is to find the optimal parameter vector, $\vec w^*$, which minimizes mean squared error.
$$\begin{align*} R_\text{sq}(\vec w) &= \frac{1}{n} \sum_{i = 1}^n (y_i - H(\vec x_i))^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \left(y_i - (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)} ) \right)^2 \\ &= \frac{1}{n} \sum_{i = 1}^n \left(y_i - \text{Aug}(\vec x_i) \cdot \vec{w} \right)^2 \end{align*}$$
  • If we were to use a calculus-based approach, we'd need to take the partial derivative of $R_\text{sq}(\vec w)$ with respect to $w_0$, and with respect to $w_1$, and with respect to $w_2$, and so on, set them all to 0, and solve the resulting system of equations. That's infeasible!
  • The reason for introducing the concept of a design matrix, $X$, and observation vector, $\vec y$, was so that we could rewrite $R_\text{sq}(\vec w)$ as follows:
$$R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2$$
  • How does this help us?

The normal equations¶

  • Finding the optimal parameter vector, $\vec w^*$, boils down to finding the $\vec w$ that minimizes:
$$R_\text{sq}(\vec w) = \frac{1}{n} \lVert \vec y - X \vec w \rVert^2$$
  • Intuitively, this means we want $X \vec w$ to be "as close" to $\vec y$ as possible. Remember, the only unknown is $\vec w$; $X$ and $\vec y$ come from our data.
  • As mentioned in this week's lectures, this is can be done by choosing $\vec w^*$ such that:
    • the error vector, $\vec e = \vec y - X \vec w^*$,
    • is orthogonal to the columns of $X$.
  • The above condition can be expressed as:
$$X^T (\vec y - X \vec w^*) = 0$$
  • Expanding, we have:
$$X^T \vec y - X^TX \vec w^* = 0 \implies \boxed{X^TX \vec y = X^TX \vec w^*}$$
  • The boxed equation above is known as the normal equations. The $\vec w^*$ that minimizes mean squared error is the one that satisfies the boxed condition.


Why are they called the normal equations?

  • If $X^TX$ is invertible, there's a unique solution for $\vec w^*$:
$$\vec w^* = (X^TX)^{-1}X^T \vec y$$

Big takeaway¶

  • We chose $\vec w^*$ so that:
    • The vector of predictions, $\vec h^* = X \vec w^*$,
    • is the orthogonal projection of $\vec y$
    • onto the span of the columns of the design matrix, $X$.
  • This $\vec w^*$ is guaranteed to minimize mean squared error.

Example: Lifespan¶

  • Let's return to our example from earlier. Remember, the goal is to find the best choices for $w_0^*$, $w_1^*$, and $w_2^*$ in:
$$\text{predicted lifespan}_i = H(\text{hours exercised}_i, \text{packs cigs}_i) = w_0 + w_1 \cdot \text{hours exercised}_i + w_2 \cdot \text{packs cigs}_i$$
In [3]:
lifespan
Out[3]:
lifespan hours_exercised packs_cigs
0 86 2.0 0
1 82 1.5 0
2 72 1.0 1
3 60 0.5 4
  • Our design matrix, observation vector, and parameter vector are defined as follows:
$$ {\color{black} {X = \begin{bmatrix} {1} & {1.0} & {1} \\ {1} & {1.5} & {0} \\ {1} & {0.0} & {2} \\ {1} & {1.0} & {0.5} \end{bmatrix}}} \qquad {\color{black} {\vec{y} = \begin{bmatrix} 82 \\ 90 \\ 68 \\ 77 \end{bmatrix}}} \qquad \vec{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} $$

Solving for the optimal parameter vector¶

  • We'll find $\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}$ using code. First, let's construct our design matrix, $X$.
In [4]:
X = lifespan[['hours_exercised', 'packs_cigs']].copy()
X['1'] = 1
X = X.iloc[:, [-1, 0, 1]]
X
Out[4]:
1 hours_exercised packs_cigs
0 1 2.0 0
1 1 1.5 0
2 1 1.0 1
3 1 0.5 4
  • Our observation vector, $\vec y$, is the 'lifespan' column of X.
In [5]:
y = lifespan['lifespan'].to_numpy()
y
Out[5]:
array([86, 82, 72, 60])
  • Recall that the optimal parameter vector, $\vec w^*$, is defined as:
$$\vec w^* = (X^TX)^{-1}X^T \vec y$$
In [6]:
w_star = np.linalg.inv(X.T @ X) @ X.T @ y 
w_star
Out[6]:
0    64.35
1    11.04
2    -2.52
dtype: float64
  • This is telling us that the optimal way to predict 'lifespan' as a function of 'hours_exercised' and 'packs_cigs' is:
$$\text{predicted lifespan}_i = H(\text{hours exercised}_i, \text{packs cigs}_i) = 64.35 + 11.04 \cdot \text{hours exercised}_i -2.52 \cdot \text{packs cigs}_i$$
  • We can now use this parameter vector to make predictions!
In [7]:
# My predicted lifespan if I exercise 3 hours a day and don't smoke.
# Equivalent to plugging in hours_exercised_i = 3 and packs_cigs_i = 0 into the equation above.
np.dot(w_star, np.array([1, 3, 0]))
Out[7]:
97.47826086956516