← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.
Note: We do not plan to
cover all problems here in the live discussion section; the problems
we don’t cover can be used for extra practice.
Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:
The first 5 rows of the 5000-row dataset are shown below:
carat | length | width | price |
---|---|---|---|
0.40 | 4.81 | 4.76 | 1323 |
1.04 | 6.58 | 6.53 | 5102 |
0.40 | 4.74 | 4.76 | 696 |
0.40 | 4.67 | 4.65 | 798 |
0.50 | 4.90 | 4.95 | 987 |
Billy has enlisted our help in predicting the price of a diamond
given various other features.
Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form
\text{predicted price} = w_0 + w_1 \cdot \text{carat} + w_2 \cdot \text{length}
We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.
Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.
Answer: X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}
In this design matrix X - the first column consists of all 1s (for the intercept term w_0), the second column contains the carat values (for the coefficient w_1), and the third column contains the length values (for the coefficient w_2).
Suppose the optimal parameter vector \vec{w}^* is given by
\vec{w}^* = \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}
What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.
Answer: The predicted price is 4500 dollars.
From our optimal parameter vector we know that w_0^* = 2000, w_1^* = 10000, and w_2^* = -1000. We can compute the predicted price using our linear model:
\text{predicted price} = \begin{bmatrix} 1 & 0.65 & 4 \end{bmatrix} \cdot \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}
Computing: 2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 2000 + 6500 - 4000 = 4500
Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as
\vec{e} = \vec{y} - X \vec{w}^*
where \vec{y} is the observation vector containing the prices for each diamond.
For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.
Answer:
Suppose we introduce two more features:
Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now
\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})
Answer:
Suppose we want to fit a hypothesis function of the form:
H(x_i) = w_0 + w_1 x_i^2
Note that this is not the simple linear regression hypothesis function, H(x_i) = w_0 + w_1x_i.
To do so, we will find the optimal parameter vector \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix} that satisfies the normal equations. The first 5 rows of our dataset are as follows, though note that our dataset has n rows in total.
x | y |
---|---|
2 | 4 |
-1 | 4 |
3 | 4 |
-7 | 4 |
3 | 4 |
Suppose that x_1, x_2, ..., x_n have a mean of \bar{x} = 2 and a variance of \sigma_x^2 = 10.
Write out the first 5 rows of the design matrix, X.
Answer: X = \begin{bmatrix} 1 & 4 \\ 1 & 1 \\ 1 & 9 \\ 1 & 49 \\ 1 & 9 \end{bmatrix}
Recall our hypothesis function is H(x_i) = w_0 + w_1x_i^2. Since there is an intercept term present, and w_0 is the first parameter, the first column of our design matrix will be all 1s. Our second column should contain x_1^2, x_2^2, ..., x_n^2. This means we take each datapoint x_i and square it to form the second column of X.
Suppose, just in part (b), that after solving the normal equations, we find \vec{w}^* = \begin{bmatrix} 2 \\ -5 \end{bmatrix}. What is the predicted y value for x = 2? Give your answer as an integer with no variables. Show your work.
Answer: -18
(2)(1)+(-5)(4)=-18
To find the predicted y value, we plug in x_i = 2 into the hypothesis function H(x_i) = w_0 + w_1x_i^2, or take the dot product of \vec{w}^* with \begin{bmatrix}1 \\ 2^2\end{bmatrix}.
\begin{align*} &\begin{bmatrix} 2 \\ -5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 4 \end{bmatrix}\\ &(2)(1)+(-5)(4)\\ &2 - 20\\ &-18 \end{align*}
Let X_\text{tri} = 3 X. Using the fact that \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2, determine the value of the bottom-left value in the matrix X_\text{tri}^T X_\text{tri}, i.e. the value in the second row and first column. Give your answer as an expression involving n. Show your work.
Answer: 126n
X = \begin{bmatrix} 1 & x_1^2 \\ 1 & x_2^2 \\ \vdots & \vdots \\ 1 & x_n^2 \end{bmatrix}
X_{\text{tri}} = \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}
We want to know what the bottom left value of X_\text{tri}^T X_\text{tri} is. We figure this out with matrix multiplication!
\begin{align*} X_\text{tri}^T X_\text{tri} &= \begin{bmatrix} 3 & 3 & ... & 3\\ 3x_1^2 & 3x_2^2 & ... & 3x_n^2 \end{bmatrix} \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 3(3) & \sum_{i = 1}^n 3(3x_i^2) \\ \sum_{i = 1}^n 3(3x_i^2) & \sum_{i = 1}^n (3x_i^2)(3x_i^2)\end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 9 & \sum_{i = 1}^n 9x_i^2 \\ \sum_{i = 1}^n 9x_i^2 & \sum_{i = 1}^n (3x_i^2)^2 \end{bmatrix} \end{align*}
We can see that the bottom left element should be \sum_{i = 1}^n 9x_i^2.
From here we can use the fact given to us in the directions: \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2.
\begin{align*} \sum_{i = 1}^n 9x_i^2 &= 9\sum_{i = 1}^n x_i^2\\ &= 9(n \sigma_x^2 + n \bar{x}^2)\\ &= 9(10n + 2^2n)\\ &= 9(10n + 4n)\\ &= 9(14n) = 126n \end{align*}
Consider the vectors \vec{u} and \vec{v}, defined below.
\vec{u} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec{v} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}
We define X \in \mathbb{R}^{3 \times 2} to be the matrix whose first column is \vec u and whose second column is \vec v.
In this part only, let \vec{y} = \begin{bmatrix} -1 \\ k \\ 252 \end{bmatrix}.
Find a scalar k such that \vec{y} is in \text{span}(\vec u, \vec v). Give your answer as a constant with no variables.
Answer: 252.
Vectors in \text{span}(\vec u, \vec v) are all linear combinations of \vec{u} and \vec{v}, meaning they have the form:
a \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + b \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} a \\ b \\ b \end{bmatrix}
From this, any vector in the span must have its second and third components equal. Since \vec{y} has its third component as 252, so k must equal to 252.
Show that: (X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}
Hint: If A = \begin{bmatrix} a_1 & 0 \\ 0 & a_2 \end{bmatrix}, then A^{-1} = \begin{bmatrix} \frac{1}{a_1} & 0 \\ 0 & \frac{1}{a_2} \end{bmatrix}.
Answer: We can construct the following series of matrices to get (X^TX)^{-1}X^T.
In parts 3 and 4 only, let \vec{y} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix}.
Find scalars a and b such that a \vec u + b \vec v is the vector in \text{span}(\vec u, \vec v) that is as close to \vec{y} as possible. Give your answers as constants with no variables.
Answer: a = 4, b = 5.
The result from the part (b) implies that when using the normal equations to find coefficients for \vec u and \vec v – which we know from lecture produce an error vector whose length is minimized – the coefficient on \vec u must be y_1 and the coefficient on \vec v must be \frac{y_2 + y_3}{2}. This can be shown by taking the result from part (b), \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}, and multiplying it by the vector \vec y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix}.
Here, y_1 = 4, so a = 4. We also know y_2 = 2 and y_3 = 8, so b = \frac{2+8}{2} = 5.
Let \vec{e} = \vec{y} - (a \vec u + b \vec v), where a and b are the values you found in part (c).
What is \lVert \vec{e} \rVert?
0
3 \sqrt{2}
4 \sqrt{2}
6
6 \sqrt{2}
2\sqrt{21}
Answer: 3 \sqrt{2}.
The correct value of a \vec u + b \vec v = \begin{bmatrix} 4 \\ 5 \\ 5\end{bmatrix}. Then, \vec{e} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix} - \begin{bmatrix} 4 \\ 5 \\ 5 \end{bmatrix} = \begin{bmatrix} 0 \\ -3 \\ 3 \end{bmatrix}, which has a length of \sqrt{0^2 + (-3)^2 + 3^2} = 3\sqrt{2}.
Is it true that, for any vector \vec{y} \in \mathbb{R}^3, we can find scalars c and d such that the sum of the entries in the vector \vec{y} - (c \vec u + d \vec v) is 0?
Yes, because \vec{u} and \vec{v} are linearly independent.
Yes, because \vec{u} and \vec{v} are orthogonal.
Yes, but for a reason that isn’t listed here.
No, because \vec{y} is not necessarily in \text{span}(\vec{u}, \vec{v}).
No, because neither \vec{u} nor \vec{v} is equal to the vector \begin{bmatrix} 1 & 1 & 1 \end{bmatrix}^T.
No, but for a reason that isn’t listed here.
Answer: Yes, but for a reason that isn’t listed here.
Here’s the full reason:
Suppose that Q \in \mathbb{R}^{100 \times 12}, \vec{s} \in \mathbb{R}^{100}, and \vec{f} \in \mathbb{R}^{12}. What are the dimensions of the following product?
\vec{s}^T Q \vec{f}
scalar
12 \times 1 vector
100 \times 1 vector
100 \times 12 matrix
12 \times 12 matrix
12 \times 100 matrix
undefined
Answer: Scalar.
The inner dimensions of 100 and 12 cancel, and so \vec{s}^T Q \vec{f} is of shape 1 x 1.
Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:
cells: number of cells in the notebook
lines: number of lines of code
max iterations: largest number of iterations in any loop in the notebook, or 1 if there are no loops
variables: number of variables defined in the notebook
runtime: number of seconds for the notebook to run on Datahub
Then we use multiple regression to fit a prediction rule of the form
H(\text{cells}_i, \text{lines}_i, \text{max iterations}_i, \text{variables}_i) = w_0 + w_1 \cdot \text{cells}_i \cdot \text{lines}_i + w_2 \cdot (\text{max iterations}_i)^{\text{variables}_i - 10}
What are the dimensions of the design matrix X?
\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}
So, what should r and c be for: r rows \times c columns.
Answer:
100 \text{ rows} \times 3 \text{ columns}
There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing (\text{max iterations}_i)^{\text{variables}_i - 10}.
In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).
Answer:
This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.
Consider the dataset shown below.
x^{(1)} | x^{(2)} | x^{(3)} | y |
---|---|---|---|
0 | 6 | 8 | -5 |
3 | 4 | 5 | 7 |
5 | -1 | -3 | 4 |
0 | 2 | 1 | 2 |
We want to use multiple regression to fit a prediction rule of the form H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} x_i^{(3)} + w_2 (x_i^{(2)} - x_i^{(3)})^2. Write down the design matrix X and observation vector \vec{y} for this scenario. No justification needed.
The design matrix X and observation vector \vec{y} are given by:
X = \begin{bmatrix} 1 & 0 & 4\\ 1 & 15 & 1\\ 1 & -15 & 4\\ 1 & 0 & 1 \end{bmatrix}
\vec{y} = \begin{bmatrix} -5\\ 7\\ 4\\ 2 \end{bmatrix}
The observation vector \vec{y} contains the target values from the dataset.
For the design matrix X, each row corresponds to one data point in our dataset, where x_i^{(1)}, x_i^{(2)}, and x_i^{(3)} represent three separate features for the i-th data point. Each row of X has the form \begin{bmatrix}1 & x_i^{(1)}x_i^{(3)} & (x_i^{(2)}-x_i^{(3)})^2\end{bmatrix}. The first column consists of all 1’s for the bias term w_0, which is not affected by the feature values.
For the X and \vec{y} that you have written down, let \vec{w} be the optimal parameter vector, which comes from solving the normal equations X^TX\vec{w}=X^T\vec{y}. Let \vec{e} = \vec{y} - X \vec{w} be the error vector, and let e_i be the ith component of this error vector. Show that 4e_1+e_2+4e_3+e_4=0.
The key to this problem is the fact that the error vector, \vec{e}, is orthogonal to the columns of the design matrix, X. As a refresher, if \vec{w^*} satisfies the normal equations, then:
We can rewrite the normal equation (X^TX\vec{w}=X^T\vec{y}) to allow substitution for \vec{e} = \vec{y} - X \vec{w}.
X^TX\vec{w}=X^T\vec{y} 0 = X^T\vec{y} - X^TX\vec{w} 0 = X^T(\vec{y}-X\vec{w}) 0 = X^T\vec{e}
The first step is to find X^T, which is easy because we found X above: \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix}
And now we can plug X^T and \vec e into our equation 0 = X^T\vec{e}. It might be easiest to find the right side first:
X^T\vec{e} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4\end{bmatrix}
= \begin{bmatrix} e_1 + e_2 + e_3 + e_4 \\ 15e_2 - 15e_3 \\ 4e_1 + e_2 + 4e_3 + e_4\end{bmatrix}
Finally, we set it equal to zero! 0 = e_1 + e_2 + e_3 + e_4 0 = 15e_2 - 15e_3 0 = 4e_1 + e_2 + 4e_3 + e_4
With this we have shown that 4e_1+e_2+4e_3+e_4=0.
Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:
\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}
In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.
Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{v}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{v}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).
Answer: \vec{v}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}
Suppose our original prediction rule was of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)}.
Because the span of the resulting design matrix has not changed, the optimal predictions themselves will not change, because the optimal predictions come from projecting \vec{y} onto span(X). So, the problem boils down to figuring out how to choose the coefficients in \vec{v}^* so that the predictions of the resulting model are the same as those in the original model.
By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = v_1 + v_0 x_i^{(1)} + v_2 x_i^{(2)} + v_3 x_i^{(3)}.
Therefore the optimal parameters for the new model are related to the optimal parameters for the original model by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.
Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{v}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{v}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).
Answer: \vec{v}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}
Suppose our original prediction rule was of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)}.
Because the span of the resulting design matrix has not changed, the optimal predictions themselves will not change, because the optimal predictions come from projecting \vec{y} onto span(X). So, the problem boils down to figuring out how to choose the coefficients in \vec{v}^* so that the predictions of the resulting model are the same as those in the original model.
By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = v_0 \cdot 2 + v_1 x_i^{(1)} + v_2 x_i^{(2)} + v_3 x_i^{(3)}.
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for the new model are related to the optimal parameters for the original model by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.
Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{v}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{v}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.
Answer: \vec{v}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}
Suppose our original prediction rule was of the form: H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + w_3 x_i^{(3)}.
Because the span of the resulting design matrix has not changed, the optimal predictions themselves will not change, because the optimal predictions come from projecting \vec{y} onto span(X). So, the problem boils down to figuring out how to choose the coefficients in \vec{v}^* so that the predictions of the resulting model are the same as those in the original model.
By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H(x_i^{(1)}, x_i^{(2)}, x_i^{(3)}) &= v_0 + v_1 x_i^{(1)} + v_2(x_i^{(2)}+1) + v_3 x_i^{(3)} \\ &= (v_0 + v_2) + v_1 x_i^{(1)} + v_2 x_i^{(2)} + v_3 x_i^{(3)} \end{aligned}
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for the new model are related to the optimal parameters for the original model by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
One way to think about this is that if we replace x_i^{(2)} with x_i^{(2)}+1, then our predictions will increase by the coefficient of x_i^{(2)}. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.