Multiple Linear Regression

The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

Problem 1

Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:

carat	length	width	price
0.40	4.81	4.76	1323
1.04	6.58	6.53	5102
0.40	4.74	4.76	696
0.40	4.67	4.65	798
0.50	4.90	4.95	987

Billy has enlisted our help in predicting the price of a diamond given various other features.

Problem 1.1

Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form

We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.

Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.

Answer: X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}

In this design matrix X - the first column consists of all 1s (for the intercept term w_0), the second column contains the carat values (for the coefficient w_1), and the third column contains the length values (for the coefficient w_2).

Problem 1.2

What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.

Answer: The predicted price is 4500 dollars.

From our optimal parameter vector we know that w_0^* = 2000, w_1^* = 10000, and w_2^* = -1000. We can compute the predicted price using our linear model:

\text{predicted price} = \begin{bmatrix} 1 & 0.65 & 4 \end{bmatrix} \cdot \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}

Computing: 2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 2000 + 6500 - 4000 = 4500

Problem 1.3

Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as

For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.

Answer:

\sum_{i = 1}^n e_i: Yes, this is guaranteed to be 0. This was discussed in Homework 7; it is a consequence of the fact that X^T (y - X \vec{w}^*) = 0 and that we have an intercept term in our prediction rule (and hence a column of all 1s in our design matrix, X).
|| \vec{y} - X \vec{w}^* ||^2: No, this is not guaranteed to be 0. This is the mean squared error of our prediction rule, multiplied by n. \vec{w}^* is found by minimizing mean squared error, but the minimum value of mean squared error isn’t necessarily 0 — in fact, this quantity is only 0 if we can write \vec{y} exactly as X \vec{w}^* with no prediction errors.
X^TX \vec{w}^*: No, this is not guaranteed to be 0, either.
2X^TX \vec{w}^* - 2X^T\vec{y}: Yes, this is guaranteed to be 0. Recall, the optimal parameter vector \vec{w}^* satisfies the normal equations X^TX\vec{w}^* = X^T \vec{y}. Subtracting X^T \vec{y} from both sides of this equation and multiplying both sides by 2 yields the desired result.

Problem 1.4

Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now

\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})

Answer:

X = \begin{bmatrix} 0.40 & 4.81 & 4.76 & 4.81 \cdot 4.76 \\ 1.04 & 6.58 & 6.53 & 6.58 \cdot 6.53 \end{bmatrix}
No, it’s not guaranteed that the \vec{w}_1^* for this new prediction rule is equal to the \vec{w}_1^* for the original prediction rule. The value of \vec{w}_1^* in the new prediction rule will be influenced by the fact that there’s no longer an intercept term and that there are two new features (width and area) that weren’t previously there.

Problem 2

Note that this is not the simple linear regression hypothesis function, H(x_i) = w_0 + w_1x_i.

To do so, we will find the optimal parameter vector \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix} that satisfies the normal equations. The first 5 rows of our dataset are as follows, though note that our dataset has n rows in total.

x	y
2	4
-1	4
3	4
-7	4
3	4

Suppose that x_1, x_2, ..., x_n have a mean of \bar{x} = 2 and a variance of \sigma_x^2 = 10.

Problem 2.1

Answer: X = \begin{bmatrix} 1 & 4 \\ 1 & 1 \\ 1 & 9 \\ 1 & 49 \\ 1 & 9 \end{bmatrix}

Recall our hypothesis function is H(x_i) = w_0 + w_1x_i^2. Since there is an intercept term present, and w_0 is the first parameter, the first column of our design matrix will be all 1s. Our second column should contain x_1^2, x_2^2, ..., x_n^2. This means we take each datapoint x_i and square it to form the second column of X.

Problem 2.2

Suppose, just in part (b), that after solving the normal equations, we find \vec{w}^* = \begin{bmatrix} 2 \\ -5 \end{bmatrix}. What is the predicted y value for x = 2? Give your answer as an integer with no variables. Show your work.

Answer: -18

(2)(1)+(-5)(4)=-18

To find the predicted y value, we plug in x_i = 2 into the hypothesis function H(x_i) = w_0 + w_1x_i^2, or take the dot product of \vec{w}^* with \begin{bmatrix}1 \\ 2^2\end{bmatrix}.

\begin{align*} &\begin{bmatrix} 2 \\ -5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 4 \end{bmatrix}\\ &(2)(1)+(-5)(4)\\ &2 - 20\\ &-18 \end{align*}

Problem 2.3

Let X_\text{tri} = 3 X. Using the fact that \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2, determine the value of the bottom-left value in the matrix X_\text{tri}^T X_\text{tri}, i.e. the value in the second row and first column. Give your answer as an expression involving n. Show your work.

Answer: 126n

X = \begin{bmatrix} 1 & x_1^2 \\ 1 & x_2^2 \\ \vdots & \vdots \\ 1 & x_n^2 \end{bmatrix}

X_{\text{tri}} = \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}

We want to know what the bottom left value of X_\text{tri}^T X_\text{tri} is. We figure this out with matrix multiplication!

\begin{align*} X_\text{tri}^T X_\text{tri} &= \begin{bmatrix} 3 & 3 & ... & 3\\ 3x_1^2 & 3x_2^2 & ... & 3x_n^2 \end{bmatrix} \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 3(3) & \sum_{i = 1}^n 3(3x_i^2) \\ \sum_{i = 1}^n 3(3x_i^2) & \sum_{i = 1}^n (3x_i^2)(3x_i^2)\end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 9 & \sum_{i = 1}^n 9x_i^2 \\ \sum_{i = 1}^n 9x_i^2 & \sum_{i = 1}^n (3x_i^2)^2 \end{bmatrix} \end{align*}

We can see that the bottom left element should be \sum_{i = 1}^n 9x_i^2.

From here we can use the fact given to us in the directions: \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2.

\begin{align*} \sum_{i = 1}^n 9x_i^2 &= 9\sum_{i = 1}^n x_i^2\\ &= 9(n \sigma_x^2 + n \bar{x}^2)\\ &= 9(10n + 2^2n)\\ &= 9(10n + 4n)\\ &= 9(14n) = 126n \end{align*}

Problem 3

\vec{u} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec{v} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}

We define X \in \mathbb{R}^{3 \times 2} to be the matrix whose first column is \vec u and whose second column is \vec v.

Problem 3.1

Find a scalar k such that \vec{y} is in \text{span}(\vec u, \vec v). Give your answer as a constant with no variables.

Answer: 252.

Vectors in \text{span}(\vec u, \vec v) are all linear combinations of \vec{u} and \vec{v}, meaning they have the form:

a \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + b \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} a \\ b \\ b \end{bmatrix}

From this, any vector in the span must have its second and third components equal. Since \vec{y} has its third component as 252, so k must equal to 252.

Problem 3.2

Show that: (X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}

Hint: If A = \begin{bmatrix} a_1 & 0 \\ 0 & a_2 \end{bmatrix}, then A^{-1} = \begin{bmatrix} \frac{1}{a_1} & 0 \\ 0 & \frac{1}{a_2} \end{bmatrix}.

Answer: We can construct the following series of matrices to get (X^TX)^{-1}X^T.

X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}
X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 1 \end{bmatrix}
X^TX = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}
(X^TX)^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{2} \end{bmatrix}
(X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}

Problem 3.3

Find scalars a and b such that a \vec u + b \vec v is the vector in \text{span}(\vec u, \vec v) that is as close to \vec{y} as possible. Give your answers as constants with no variables.

Answer: a = 4, b = 5.

The result from the part (b) implies that when using the normal equations to find coefficients for \vec u and \vec v – which we know from lecture produce an error vector whose length is minimized – the coefficient on \vec u must be y_1 and the coefficient on \vec v must be \frac{y_2 + y_3}{2}. This can be shown by taking the result from part (b), \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}, and multiplying it by the vector \vec y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix}.

Here, y_1 = 4, so a = 4. We also know y_2 = 2 and y_3 = 8, so b = \frac{2+8}{2} = 5.

Problem 3.4

Let \vec{e} = \vec{y} - (a \vec u + b \vec v), where a and b are the values you found in part (c).

Answer: 3 \sqrt{2}.

The correct value of a \vec u + b \vec v = \begin{bmatrix} 4 \\ 5 \\ 5\end{bmatrix}. Then, \vec{e} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix} - \begin{bmatrix} 4 \\ 5 \\ 5 \end{bmatrix} = \begin{bmatrix} 0 \\ -3 \\ 3 \end{bmatrix}, which has a length of \sqrt{0^2 + (-3)^2 + 3^2} = 3\sqrt{2}.

Problem 3.5

Is it true that, for any vector \vec{y} \in \mathbb{R}^3, we can find scalars c and d such that the sum of the entries in the vector \vec{y} - (c \vec u + d \vec v) is 0?

Answer: Yes, but for a reason that isn’t listed here.

Here’s the full reason:

We can use the normal equations to find c and d, no matter what \vec{y} is.
The error vector \vec e that results from using the normal equations is such that \vec e is orthogonal to the span of the columns of X.
The columns of X are just \vec u and \vec v. So, \vec e is orthogonal to any linear combination of \vec u and \vec v.
One of the many linear combinations of \vec u and \vec v is \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}.
This means that the vector \vec e is orthogonal to \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, which means that \vec{1}^T \vec{e} = 0 \implies \sum_{i = 1}^3 e_i = 0.

Problem 3.6

Suppose that Q \in \mathbb{R}^{100 \times 12}, \vec{s} \in \mathbb{R}^{100}, and \vec{f} \in \mathbb{R}^{12}. What are the dimensions of the following product?

Answer: Scalar.

\vec{s}^T: 1 x 100
Q: 100 x 12.
\vec{f}: 12 x 1.

The inner dimensions of 100 and 12 cancel, and so \vec{s}^T Q \vec{f} is of shape 1 x 1.

Problem 4

Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:

H(\text{cells}_i, \text{lines}_i, \text{max iterations}_i, \text{variables}_i) = w_0 + w_1 \cdot \text{cells}_i \cdot \text{lines}_i + w_2 \cdot (\text{max iterations}_i)^{\text{variables}_i - 10}

Problem 4.1

Answer:

100 \text{ rows} \times 3 \text{ columns}

There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing (\text{max iterations}_i)^{\text{variables}_i - 10}.

Problem 4.2

In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).

Answer:

This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.

Problem 5

Problem 5.1

x^{(1)}	x^{(2)}	x^{(3)}	y
0	6	8	-5
3	4	5	7
5	-1	-3	4
0	2	1	2

We want to use multiple regression to fit a prediction rule of the form H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} x_i^{(3)} + w_2 (x_i^{(2)} - x_i^{(3)})^2. Write down the design matrix X and observation vector \vec{y} for this scenario. No justification needed.

The design matrix X and observation vector \vec{y} are given by:

X = \begin{bmatrix} 1 & 0 & 4\\ 1 & 15 & 1\\ 1 & -15 & 4\\ 1 & 0 & 1 \end{bmatrix}

\vec{y} = \begin{bmatrix} -5\\ 7\\ 4\\ 2 \end{bmatrix}

The observation vector \vec{y} contains the target values from the dataset.

For the design matrix X, each row corresponds to one data point in our dataset, where x_i^{(1)}, x_i^{(2)}, and x_i^{(3)} represent three separate features for the i-th data point. Each row of X has the form \begin{bmatrix}1 & x_i^{(1)}x_i^{(3)} & (x_i^{(2)}-x_i^{(3)})^2\end{bmatrix}. The first column consists of all 1’s for the bias term w_0, which is not affected by the feature values.

Problem 5.2

For the X and \vec{y} that you have written down, let \vec{w} be the optimal parameter vector, which comes from solving the normal equations X^TX\vec{w}=X^T\vec{y}. Let \vec{e} = \vec{y} - X \vec{w} be the error vector, and let e_i be the ith component of this error vector. Show that 4e_1+e_2+4e_3+e_4=0.

The key to this problem is the fact that the error vector, \vec{e}, is orthogonal to the columns of the design matrix, X. As a refresher, if \vec{w^*} satisfies the normal equations, then:

We can rewrite the normal equation (X^TX\vec{w}=X^T\vec{y}) to allow substitution for \vec{e} = \vec{y} - X \vec{w}.

X^TX\vec{w}=X^T\vec{y} 0 = X^T\vec{y} - X^TX\vec{w} 0 = X^T(\vec{y}-X\vec{w}) 0 = X^T\vec{e}

The first step is to find X^T, which is easy because we found X above: \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix}

And now we can plug X^T and \vec e into our equation 0 = X^T\vec{e}. It might be easiest to find the right side first:

X^T\vec{e} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4\end{bmatrix}

= \begin{bmatrix} e_1 + e_2 + e_3 + e_4 \\ 15e_2 - 15e_3 \\ 4e_1 + e_2 + 4e_3 + e_4\end{bmatrix}

Finally, we set it equal to zero! 0 = e_1 + e_2 + e_3 + e_4 0 = 15e_2 - 15e_3 0 = 4e_1 + e_2 + 4e_3 + e_4

With this we have shown that 4e_1+e_2+4e_3+e_4=0.

Problem 6

Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:

In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.

Problem 6.1

Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{v}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{v}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

Answer: \vec{v}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)}.

Because the span of the resulting design matrix has not changed, the optimal predictions themselves will not change, because the optimal predictions come from projecting \vec{y} onto span(X). So, the problem boils down to figuring out how to choose the coefficients in \vec{v}^* so that the predictions of the resulting model are the same as those in the original model.

By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = v_1 + v_0 x_i^{(1)} + v_2 x_i^{(2)} + v_3 x_i^{(3)}.

Therefore the optimal parameters for the new model are related to the optimal parameters for the original model by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.

Problem 6.2

Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{v}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{v}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).

Answer: \vec{v}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}

Suppose our original prediction rule was of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)}.

By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = v_0 \cdot 2 + v_1 x_i^{(1)} + v_2 x_i^{(2)} + v_3 x_i^{(3)}.

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for the new model are related to the optimal parameters for the original model by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.

Problem 6.3

Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{v}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{v}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.

Answer: \vec{v}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}

Suppose our original prediction rule was of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)}.

By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) &= v_0 + v_1 x_i^{(1)} + v_2(x_i^{(2)}+1) + v_3 x_i^{(3)} \\ &= (v_0 + v_2) + v_1 x_i^{(1)} + v_2 x_i^{(2)} + v_3 x_i^{(3)} \end{aligned}

In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for the new model are related to the optimal parameters for the original model by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}

One way to think about this is that if we replace x_i^{(2)} with x_i^{(2)}+1, then our predictions will increase by the coefficient of x_i^{(2)}. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.

Problem 1

Problem 1.1

Click to view the solution.

Problem 1.2

Click to view the solution.

Problem 1.3

Click to view the solution.

Problem 1.4

Click to view the solution.

Problem 2

Problem 2.1

Click to view the solution.

Problem 2.2

Click to view the solution.

Problem 2.3

Click to view the solution.

Problem 3

Problem 3.1

Click to view the solution.

Problem 3.2

Click to view the solution.

Problem 3.3

Click to view the solution.

Problem 3.4

Click to view the solution.

Problem 3.5

Click to view the solution.

Problem 3.6

Click to view the solution.

Problem 4

Problem 4.1

Click to view the solution.

Problem 4.2

Click to view the solution.

Problem 5

Problem 5.1

Click to view the solution.

Problem 5.2

Click to view the solution.

Problem 6

Problem 6.1

Click to view the solution.

Problem 6.2

Click to view the solution.

Problem 6.3

Click to view the solution.