In [1]:
from lec_utils import *
Discussion Slides: Multiple Linear Regression
Agenda 📆¶
- The design matrix, observation vector, and parameter vector.
- "Solving" the normal equations.
Example: Lifespan¶
- Consider the DataFrame
lifespan
, which contains the'lifespan'
, average'hours_exercised'
per day, and average'packs_cigs'
(packs of cigarettes smoked per day) for various individuals.
In [2]:
lifespan = pd.DataFrame({
'lifespan': [86, 82, 72, 60],
'hours_exercised': [2, 1.5, 1, 0.5],
'packs_cigs': [0, 0, 1, 4]
})
lifespan
Out[2]:
lifespan | hours_exercised | packs_cigs | |
---|---|---|---|
0 | 86 | 2.0 | 0 |
1 | 82 | 1.5 | 0 |
2 | 72 | 1.0 | 1 |
3 | 60 | 0.5 | 4 |
- Suppose we want to predict the
'lifespan'
of an individual as a linear function of their'hours_exercised'
and'packs_cigs'
. In other words:
- How do we find the optimal values of $w_0^*$, $w_1^*$, and $w_2^*$?
Augmented feature vectors¶
- A feature vector, $\vec x_i$, contains the information used to make a prediction for a single individual.
- In our example, $\vec x_i = \begin{bmatrix} \text{hours exercised}_i \\ \text{packs cigs}_i \end{bmatrix}$.
- An augmented feature vector, $\text{Aug}(\vec x_i)$, inserts a 1 at the start of the feature vector, $\vec x_i$.
Design matrices, observation vectors, and parameter vectors¶
- Suppose we want to build a multiple linear regression model that uses multiple – specifically, $d$ – features to make predictions.
- Define the design matrix $\color{#007aff} X \in \mathbb{R}^{n \times (d + 1)}$, observation vector $\color{orange}{\vec{y}} \in \mathbb{R}^n$, and parameter vector $\vec{w} \in \mathbb{R}^{d+1}$ as:
- The design matrix, $\color{#007aff} X$, represents our entire dataset, whereas an augmented feature vector $\text{Aug}(\vec x_i)$ represents a single point.
We include a 1 in the first column of our design matrix – and as the first element in our augmented feature vector – so that the intercept (bias) term $w_0$ can be incorporated into our parameter vector.
- The observation vector contains all of the target (or response) values corresponding to each observation in the dataset.
Minimizing mean squared error¶
- Our goal is to find the optimal parameter vector, $\vec w^*$, which minimizes mean squared error.
- If we were to use a calculus-based approach, we'd need to take the partial derivative of $R_\text{sq}(\vec w)$ with respect to $w_0$, and with respect to $w_1$, and with respect to $w_2$, and so on, set them all to 0, and solve the resulting system of equations. That's infeasible!
- The reason for introducing the concept of a design matrix, $X$, and observation vector, $\vec y$, was so that we could rewrite $R_\text{sq}(\vec w)$ as follows:
- How does this help us?
The normal equations¶
- Finding the optimal parameter vector, $\vec w^*$, boils down to finding the $\vec w$ that minimizes:
- Intuitively, this means we want $X \vec w$ to be "as close" to $\vec y$ as possible. Remember, the only unknown is $\vec w$; $X$ and $\vec y$ come from our data.
- As mentioned in this week's lectures, this is can be done by choosing $\vec w^*$ such that:
- the error vector, $\vec e = \vec y - X \vec w^*$,
- is orthogonal to the columns of $X$.
- The above condition can be expressed as:
- Expanding, we have:
- The boxed equation above is known as the normal equations. The $\vec w^*$ that minimizes mean squared error is the one that satisfies the boxed condition.
Why are they called the normal equations?
- If $X^TX$ is invertible, there's a unique solution for $\vec w^*$:
Big takeaway¶
- We chose $\vec w^*$ so that:
- The vector of predictions, $\vec h^* = X \vec w^*$,
- is the orthogonal projection of $\vec y$
- onto the span of the columns of the design matrix, $X$.
- This $\vec w^*$ is guaranteed to minimize mean squared error.
Example: Lifespan¶
- Let's return to our example from earlier. Remember, the goal is to find the best choices for $w_0^*$, $w_1^*$, and $w_2^*$ in:
In [3]:
lifespan
Out[3]:
lifespan | hours_exercised | packs_cigs | |
---|---|---|---|
0 | 86 | 2.0 | 0 |
1 | 82 | 1.5 | 0 |
2 | 72 | 1.0 | 1 |
3 | 60 | 0.5 | 4 |
- Our design matrix, observation vector, and parameter vector are defined as follows:
Solving for the optimal parameter vector¶
- We'll find $\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}$ using code. First, let's construct our design matrix, $X$.
In [4]:
X = lifespan[['hours_exercised', 'packs_cigs']].copy()
X['1'] = 1
X = X.iloc[:, [-1, 0, 1]]
X
Out[4]:
1 | hours_exercised | packs_cigs | |
---|---|---|---|
0 | 1 | 2.0 | 0 |
1 | 1 | 1.5 | 0 |
2 | 1 | 1.0 | 1 |
3 | 1 | 0.5 | 4 |
- Our observation vector, $\vec y$, is the
'lifespan'
column ofX
.
In [5]:
y = lifespan['lifespan'].to_numpy()
y
Out[5]:
array([86, 82, 72, 60])
- Recall that the optimal parameter vector, $\vec w^*$, is defined as:
In [6]:
w_star = np.linalg.inv(X.T @ X) @ X.T @ y
w_star
Out[6]:
0 64.35 1 11.04 2 -2.52 dtype: float64
- This is telling us that the optimal way to predict
'lifespan'
as a function of'hours_exercised'
and'packs_cigs'
is:
- We can now use this parameter vector to make predictions!
In [7]:
# My predicted lifespan if I exercise 3 hours a day and don't smoke.
# Equivalent to plugging in hours_exercised_i = 3 and packs_cigs_i = 0 into the equation above.
np.dot(w_star, np.array([1, 3, 0]))
Out[7]:
97.47826086956516