← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.
Note: We do not plan to
cover all problems here in the live discussion section; the problems
we don’t cover can be used for extra practice.
Billy decides to take on a part-time job as a waiter at the Panda Express in Pierpont. For two months, he kept track of all of the total bills he gave out to customers along with the tips they then gave him, all in dollars. Below is a scatter plot of Billy’s tips and total bills.
Throughout this question, assume we are trying to fit a linear prediction rule H(x) = w_0 + w_1x that uses total bills to predict tips, and assume we are finding optimal parameters by minimizing mean squared error.
Which of these is the most likely value for r, the correlation between total bill and tips? Why?
-1 \qquad -0.75 \qquad -0.25 \qquad 0 \qquad 0.25 \qquad 0.75 \qquad 1
0.75.
It seems like there is a pretty strong, but not perfect, linear association between total bills and tips.
The variance of the tip amounts is 2.1. Let M be the mean squared error of the best linear prediction rule on this dataset (under squared loss). Is M less than, equal to, or greater than 2.1? How can you tell?
M is less than 2.1. The variance is equal to the MSE of the constant prediction rule.
Note that the MSE of the best linear prediction rule will always be less than or equal to the MSE of the best constant prediction rule h. The only case in which these two MSEs are the same is when the best linear prediction rule is a flat line with slope 0, which is the same as a constant prediction. In all other cases, the linear prediction rule will make better predictions and hence have a lower MSE than the constant prediction rule.
In this case, the best linear prediction rule is clearly not flat, so M < 2.1.
Suppose we use the formulas from class on Billy’s dataset and calculate the optimal slope w_1^* and intercept w_0^* for this prediction rule.
Suppose we add the value of 1 to every total bill x, effectively shifting the scatter plot 1 unit to the right. Note that doing this does not change the value of w_1^*. What amount should we add to each tip y so that the value of w_0^* also does not change? Your answer should involve one or more of \bar{x}, \bar{y}, w_0^*, w_1^*, and any constants.
Note: To receive full points, you must provide a rigorous explanation, though this explanation only takes a few lines. However, we will award partial credit to solutions with the correct answer, and it’s possible to arrive at the correct answer by drawing a picture and thinking intuitively about what happens.
We should add w_1^* to each tip y.
First, we present the rigorous solution.
Let \bar{x}_\text{old} represent the previous mean of the x’s and \bar{x}_\text{new} represent the new mean of the x’s. Then, we know that \bar{x}_\text{new} = \bar{x}_\text{old} + 1.
Also, let \bar{y}_\text{old} and \bar{y}_\text{new} represent the old and new mean of the y’s. We will try and find a relationship between these two quantities.
We want the two intercepts to be the same. The intercept for the old line is \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} and the intercept for the new line is \bar{y}_\text{new} - w_1^* \bar{x}_\text{new}. Setting these equal yields
\begin{aligned} \bar{y}_\text{new} - w_1^* \bar{x}_\text{new} &= \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} \\ \bar{y}_\text{new} - w_1^* (\bar{x}_\text{old} + 1) &= \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} \\ \bar{y}_\text{new} &= \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} + w_1^* (\bar{x}_\text{old} + 1) \\ \bar{y}_{\text{new}} &= \bar{y}_\text{old} + w_1^* \end{aligned}
Thus, in order for the intercepts to be equal, we need the mean of the new y’s to be w_1^* greater than the mean of the old y’s. Since we’re told we’re adding the same constant to each y that constant is w_1^*.
Another way to approach the question is as follows: consider any point that lies directly on a line with slope w_1^* and intercept w_0^*. Consider how the slope between two points on a line is calculated: \text{slope} = \frac{y_2 - y_1}{x_2 - x_1}. If x_2 - x_1 = 1, in order for the slope to remain fixed we must have that y_2 - y_1 = \text{slope}. For a concrete example, think of the line y = 5x + 2. The point (1, 7) is on the line, as is the point (1 + 1, 7 + 5) = (2, 12).
In our case, none of our points are guaranteed to be on the line defined by slope w_1^* and intercept w_0^*. Instead, we just want to be guaranteed that the points have the same regression line after being shifted. If we follow the same principle, though, and add 1 to every x and w_1^* to every y, the points’ relative positions to the line will not change (i.e. the vertical distance from each point to the line will not change), and so that will remain the line with the lowest MSE, and hence w_0^* and w_1^* won’t change.
Suppose we have a dataset of n houses that were recently sold in the Ann Arbor area. For each house, we have its square footage and most recent sale price. The correlation between square footage and price is r.
First, we minimize mean squared error to fit a linear prediction rule that uses square footage to predict price. The resulting prediction rule has an intercept of w_0^* and slope of w_1^*. In other words,
\text{predicted price} = w_0^* + w_1^* \cdot \text{square footage}
We’re now interested in minimizing mean squared error to fit a linear prediction rule that uses price to predict square footage. Suppose this new regression line has an intercept of \beta_0^* and slope of \beta_1^*.
What is \beta_1^*? Give your answer in terms of one or more of n, r, w_0^*, and w_1^*. Show your work.
\beta_1^* = \frac{r^2}{w_1^*}
Throughout this solution, let x represent square footage and y represent price.
We know that w_1^* = r \frac{\sigma_y}{\sigma_x}. But what about \beta_1^*?
When we take a rule that predicts price from square footage and transform it into a rule that predicts square footage from price, the roles of x and y have swapped; suddenly, square footage is no longer our independent variable, but our dependent variable, and vice versa for price. This means that the altered dataset we work with when using our new prediction rule has \sigma_x standard deviation for its dependent variable (square footage), and \sigma_y for its independent variable (price). So, we can write the formula for \beta_1^* as follows: \beta_1^* = r \frac{\sigma_x}{\sigma_y}
In essence, swapping the independent and dependent variables of a dataset changes the slope of the regression line from r \frac{\sigma_y}{\sigma_x} to r \frac{\sigma_x}{\sigma_y}.
From here, we can use a little algebra to get our \beta_1^* in terms of one or more n, r, w_0^*, and w_1^*:
\begin{align*} \beta_1^* &= r \frac{\sigma_x}{\sigma_y} \\ w_1^* \cdot \beta_1^* &= w_1^* \cdot r \frac{\sigma_x}{\sigma_y} \\ w_1^* \cdot \beta_1^* &= ( r \frac{\sigma_y}{\sigma_x}) \cdot r \frac{\sigma_x}{\sigma_y} \end{align*}
The fractions \frac{\sigma_y}{\sigma_x} and \frac{\sigma_x}{\sigma_y} cancel out and we get:
\begin{align*} w_1^* \cdot \beta_1^* &= r^2 \\ \beta_1^* &= \frac{r^2}{w_1^*} \end{align*}
For this part only, assume that the following quantities hold:
Given this information, what is \beta_0^*? Give your answer as a constant, rounded to two decimal places. Show your work.
\beta_0^* = 1278.56
We start with the formula for the intercept of the regression line. Note that x and y are opposite what they’d normally be since we’re using price to predict square footage.
\beta_0^* = \bar{x} - \beta_1^* \bar{y}
We’re told that the average square footage of homes in the dataset is 2000, so \bar{x} = 2000. We also know from part (a) that \beta_1^* = \frac{r^2}{w_1^*}, and from the information given in this part this is \beta_1^* = \frac{r^2}{w_1^*} = \frac{0.6^2}{250}.
Finally, we need the average price of all homes in the dataset, \bar{y}. We aren’t given this information directly, but we can use the fact that (\bar{x}, \bar{y}) are on the regression line that uses square footage to predict price to find \bar{y}. Specifically, we have that \bar{y} = w_0^* + w_1^* \bar{x}; we know that w_0^* = 1000, \bar{x} = 2000, and w_1^* = 250, so \bar{y} = 1000 + 2000 \cdot 250 = 501000.
Putting these pieces together, we have
\begin{align*} \beta_0^* &= \bar{x} - \beta_1^* \bar{y} \\ &= 2000 - \frac{0.6^2}{250} \cdot 501000 \\ &= 2000 - 0.6^2 \cdot 2004 \\ &= 1278.56 \end{align*}
Suppose we want to fit a hypothesis function of the form:
H(x) = w_0 + w_1 x^2
Note that this is not the simple linear regression hypothesis function, H(x) = w_0 + w_1x.
To do so, we will find the optimal parameter vector \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix} that satisfies the normal equations. The first 5 rows of our dataset are as follows, though note that our dataset has n rows in total.
x | y |
---|---|
2 | 4 |
-1 | 4 |
3 | 4 |
-7 | 4 |
3 | 4 |
Suppose that x_1, x_2, ..., x_n have a mean of \bar{x} = 2 and a variance of \sigma_x^2 = 10.
Write out the first 5 rows of the design matrix, X.
X = \begin{bmatrix} 1 & 4 \\ 1 & 1 \\ 1 & 9 \\ 1 & 49 \\ 1 & 9 \end{bmatrix}
Recall our hypothesis function is H(x) = w_0 + w_1x^2. Since there is a w_0 present our X matrix should contain a column of ones. This means that our first column will be ones. Our second column should be x^2. This means we take each datapoint x and square it inside of X.
The average score on this problem was 84%.
Suppose, just in part (b), that after solving the normal equations, we find \vec{w}^* = \begin{bmatrix} 2 \\ -5 \end{bmatrix}. What is the predicted y value for x = 2? Give your answer as an integer with no variables. Show your work.
(2)(1)+(-5)(4)=-18
To find the predicted y value all you need to do is plug x = 2 into the hypothesis function H(x) = w_0 + w_1x^2, or take the dot product of \vec{w}^* with \begin{bmatrix}1 \\ 2^2\end{bmatrix}.
\begin{align*} &\begin{bmatrix} 2 \\ -5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 4 \end{bmatrix}\\ &(2)(1)+(-5)(4)\\ &2 - 20\\ &-18 \end{align*}
The average score on this problem was 78%.
Let X_\text{tri} = 3 X. Using the fact that \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2, determine the value of the bottom-left value in the matrix X_\text{tri}^T X_\text{tri}, i.e. the value in the second row and first column. Give your answer as an expression involving n. Show your work.
126n
To figure out a pattern it can be easier to use variables instead of numbers. Like so:
X = \begin{bmatrix} 1 & x_1^2 \\ 1 & x_2^2 \\ \vdots & \vdots \\ 1 & x_n^2 \end{bmatrix}
We can now create X_{\text{tri}}:
X_{\text{tri}} = \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}
We want to know what the bottom left value of X_\text{tri}^T X_\text{tri} is. We figure this out with matrix multiplication!
\begin{align*} X_\text{tri}^T X_\text{tri} &= \begin{bmatrix} 3 & 3 & ... & 3\\ 3x_1^2 & 3x_2^2 & ... & 3x_n^2 \end{bmatrix} \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 3(3) & \sum_{i = 1}^n 3(3x_i^2) \\ \sum_{i = 1}^n 3(3x_i^2) & \sum_{i = 1}^n (3x_i^2)(3x_i^2)\end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 9 & \sum_{i = 1}^n 9x_i^2 \\ \sum_{i = 1}^n 9x_i^2 & \sum_{i = 1}^n (3x_i^2)^2 \end{bmatrix} \end{align*}
We can see that the bottom left element should be \sum_{i = 1}^n 9x_i^2.
From here we can use the fact given to us in the directions: \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2.
\begin{align*} &\sum_{i = 1}^n 9x_i^2\\ &9\sum_{i = 1}^n x_i^2\\ &\text{Notice now we can replace $\sum_{i = 1}^n x_i^2$ with $n \sigma_x^2 + n \bar{x}^2$.}\\ &9(n \sigma_x^2 + n \bar{x}^2)\\ &\text{We know that $\sigma_x^2 = 10$ and $\bar x = 2$ fron the directions before part a.}\\ &9(10n + 2^2n)\\ &9(10n + 4n)\\ &9(14n) = 126n \end{align*}
The average score on this problem was 39%.
Suppose we’re given a dataset of n points, (x_1, y_1), (x_2, y_2), ..., (x_n, y_n), where \bar{x} is the mean of x_1, x_2, ..., x_n and \bar{y} is the mean of y_1, y_2, ..., y_n.
Using this dataset, we create a transformed dataset of n points, (x_1', y_1'), (x_2', y_2'), ..., (x_n', y_n'), where:
x_i' = 4x_i - 3 \qquad y_i' = y_i + 24
That is, the transformed dataset is of the form (4x_1 - 3, y_1 + 24), ..., (4x_n - 3, y_n + 24).
We decide to fit a simple linear hypothesis function H(x') = w_0 + w_1x' on the transformed dataset using squared loss. We find that w_0^* = 7 and w_1^* = 2, so H^*(x') = 7 + 2x'.
Suppose we were to fit a simple linear hypothesis function through the original dataset, (x_1, y_1), (x_2, y_2), ..., (x_n, y_n), again using squared loss. What would the optimal slope be?
2
4
6
8
11
12
24
8.
Relative to the dataset with x', the dataset with x has an x-variable that’s “compressed” by a factor of 4, so the slope increases by a factor of 4 to 2 \cdot 4 = 8.
Concretely, this can be shown by looking at the formula 2 = r\frac{SD(y')}{SD(x')}, recognizing that SD(y') = SD(y) since the y values have the same spread in both datasets, and that SD(x') = 4 SD(x).
Recall, the hypothesis function H^* was fit on the transformed dataset,
(x_1', y_1'), (x_2', y_2'), ..., (x_n', y_n'). H^* happens to pass through the point (\bar{x}, \bar{y}). What is the value of \bar{x}? Give your answer as an integer with no variables.
5.
The key idea is that the regression line always passes through (\text{mean } x, \text{mean } y) in the dataset we used to fit it. So, we know that: 2 \bar{x'} + 7 = \bar{y'}. This first equation can be rewritten as: 2 \cdot (4\bar{x} - 3) + 7 = \bar{y} + 24.
We’re also told this line passes through (\bar{x}, \bar{y}), which means that it’s also true that: 2 \bar{x} + 7 = \bar{y}.
Now we have a system of two equations:
\begin{cases} 2 \cdot (4\bar{x} - 3) + 7 = \bar{y} + 24 \\ 2 \bar{x} + 7 = \bar{y} \end{cases}
\dots and solving our system of two equations gives: \bar{x} = 5.
Suppose you have a dataset \{(x_1, y_1), (x_2,y_2), \dots, (x_8, y_8)\} with n=8 ordered pairs such that the variance of \{x_1, x_2, \dots, x_8\} is 50. Let m be the slope of the regression line fit to this data.
Suppose now we fit a regression line to the dataset \{(x_1, y_2), (x_2,y_1), \dots, (x_8, y_8)\} where the first two y-values have been swapped. Let m' be the slope of this new regression line.
If x_1 = 3, y_1 =7, x_2=8, and y_2=2, what is the difference between the new slope and the old slope? That is, what is m' - m? The answer you get should be a number with no variables.
Hint: There are many equivalent formulas for the slope of the regression line. We recommend using the version of the formula without \overline{y}.
m' - m = \dfrac{1}{16}
Using the formula for the slope of the regression line, we have:
\begin{aligned} m &= \frac{\sum_{i=1}^n (x_i - \overline x)y_i}{\sum_{i=1}^n (x_i - \overline x)^2}\\ &= \frac{\sum_{i=1}^n (x_i - \overline x)y_i}{n\cdot \sigma_x^2}\\ &= \frac{(3-\bar{x})\cdot 7 + (8 - \bar{x})\cdot 2 + \sum_{i=3}^n (x_i - \overline x)y_i}{8\cdot 50}. \\ \end{aligned}
Note that by switching the first two y-values, the terms in the sum from i=3 to n, the number of data points n, and the variance of the x-values are all unchanged.
So the slope becomes:
\begin{aligned} m' &= \frac{(3-\bar{x})\cdot 2 + (8 - \bar{x})\cdot 7 + \sum_{i=3}^n (x_i - \overline x)y_i}{8\cdot 50} \\ \end{aligned}
and the difference between these slopes is given by:
\begin{aligned} m'-m &= \frac{(3-\bar{x})\cdot 2 + (8 - \bar{x})\cdot 7 - ((3-\bar{x})\cdot 7 + (8 - \bar{x})\cdot 2)}{8\cdot 50}\\ &= \frac{(3-\bar{x})\cdot 2 + (8 - \bar{x})\cdot 7 - (3-\bar{x})\cdot 7 - (8 - \bar{x})\cdot 2}{8\cdot 50}\\ &= \frac{(3-\bar{x})\cdot (-5) + (8 - \bar{x})\cdot 5}{8\cdot 50}\\ &= \frac{ -15+5\bar{x} + 40 -5\bar{x}}{8\cdot 50}\\ &= \frac{ 25}{8\cdot 50}\\ &= \frac{ 1}{16} \end{aligned}
Suppose we are given a dataset of points \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} and for some reason, we want to make predictions using a prediction rule of the form H(x) = 17 + w_1x.
Write down an expression for the mean squared error of a prediction rule of this form, as a function of the parameter w_1.
MSE(w_1) = \dfrac1n \displaystyle\sum_{i=1}^n (y_i - (17 + w_1x_i))^2
Minimize the function MSE(w_1) to find the parameter w_1^* which defines the optimal prediction rule H^*(x) = 17 + w_1^*x. Show all your work and explain your steps.
Fill in your final answer below:
w_1^* = \dfrac{\displaystyle\sum_{i=1}^n x_i(y_i - 17)}{\displaystyle\sum_{i=1}^n x_i^2}
To minimize a function of one variable, we need to take the derivative, set it equal to zero, and solve. \begin{aligned} MSE(w_1) &= \dfrac1n \displaystyle\sum_{i=1}^n (y_i - 17 - w_1x_i)^2 \\ MSE'(w_1) &= \dfrac1n \displaystyle\sum_{i=1}^n -2x_i(y_i - 17 - w_1x_i)) \qquad \text{using the chain rule} \\ 0 &= \dfrac1n \displaystyle\sum_{i=1}^n -2x_i(y_i - 17) + \dfrac1n \displaystyle\sum_{i=1}^n 2x_i^2w_1 \qquad \text{splitting up the sum} \\ 0 &= \displaystyle\sum_{i=1}^n -x_i(y_i - 17) + \displaystyle\sum_{i=1}^n x_i^2w_1 \qquad \text{multiplying through by } \frac{n}{2} \\ w_1 \displaystyle\sum_{i=1}^n x_i^2 &= \displaystyle\sum_{i=1}^n x_i(y_i - 17) \qquad \text{rearranging terms and pulling out } w_1 \\ w_1 & = \dfrac{\displaystyle\sum_{i=1}^n x_i(y_i - 17)}{\displaystyle\sum_{i=1}^n x_i^2} \end{aligned}
True or False: For an arbitrary dataset, the prediction rule H^*(x) = 17 + w_1^*x goes through the point (\bar x, \bar y).
True
False
False.
When we fit a prediction rule of the form H(x) = w_0+w_1x using simple linear regression, the formula for the intercept w_0 is designed to make sure the regression line passes through the point (\bar x, \bar y). Here, we don’t have the freedom to control our intercept, as it’s forced to be 17. This means we can’t guarantee that the prediction rule H^*(x) = 17 + w_1^*x goes through the point (\bar x, \bar y).
A simple example shows that this is the case. Consider the dataset (-2, 0) and (2, 0). The point (\bar x, \bar y) is the origin, but the prediction rule H^*(x) does not pass through the origin because it has an intercept of 17.
True or False: For an arbitrary dataset, the mean squared error associated with H^*(x) is greater than or equal to the mean squared error associated with the regression line.
True
False
True.
The regression line is the prediction rule of the form H(x) = w_0+w_1x with the smallest mean squared error (MSE). H^*(x) is one example of a prediction rule of that form so unless it happens to be the regression line itself, the regression line will have lower MSE because it was designed to have the lowest possible MSE. This means the MSE associated with H^*(x) is greater than or equal to the MSE associated with the regression line.
Albert collected 400 data points from a radiation detector. Each data point contains 3 features: feature A, feature B and feature C. The true particle energy E is also reported. Albert wants to design a linear regression algorithm to predict the energy E of each particle, given a combination of one or more of feature A, B, and C. As the first step, Albert calculated the correlation coefficients among A, B, C and E. He wrote it down in the following table, where each cell of the table represents the correlaton of two terms:
A | B | C | E | |
---|---|---|---|---|
A | 1 | -0.99 | 0.13 | 0.8 |
B | -0.99 | 1 | 0.25 | -0.95 |
C | 0.13 | 0.25 | 1 | 0.72 |
E | 0.8 | -0.95 | 0.72 | 1 |
Albert wants to start with a simple model: fitting only a single feature to obtain the true energy (i.e. y = w_0+w_1 x). Which feature should he choose as x to get the lowest mean square error?
A
B
C
B
B is the correct answer, because it has the highest absolute correlation (0.95), the negative sign in front of B just means it is negatively correlated to energy, and it can be compensated by a negative sign in the weight.
Albert wants to add another feature to his linear regression in part (a) to further boost the model’s performance. (i.e. y = w_0 + w_1 x + w_2 x_2) Which feature should he choose as x_2 to make additional improvements?
A
B
C
C
C is the correct answer, because although A has a higher correlation with energy, it also has an extremely high correlation with B (-0.99), that means adding A into the fit will not be too useful, since it provides almost the same information as B.
Albert further refines his algorithm by fitting a prediction rule of the form: \begin{aligned} H(A,B,C) = w_0 + w_1 \cdot A\cdot C + w_2 \cdot B^{C-7} \end{aligned}
Given this prediction rule, what are the dimensions of the design matrix X?
\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}
So, what are r and c in r \text{ rows} \times c \text{ columns}?
400 \text{ rows} \times 3 \text{ columns}
Recall there are 400 data points, which means there will be 400 rows. There will be 3 columns; one is the bias column of all 1s, one is for the feature A\cdot C, and one is for the feature B^{C-7}.
Note that we have two simplified closed form expressions for the estimated slope w in simple linear regression that you have already seen in discussions and lectures:
\begin{align*} w &= \frac{\sum_i (x_i - \overline{x}) y_i}{\sum_i (x_i - \overline{x})^2} \\ \\ w &= \frac{\sum_i (y_i - \overline{y}) x_i }{\sum_i (x_i - \overline{x})^2} \end{align*}
where we have dataset D = [(x_1,y_1), \ldots, (x_n,y_n)] and sample means \overline{x} = {1 \over n} \sum_{i} x_i, \quad \overline{y} = {1 \over n} \sum_{i} y_i. Without further explanation, \sum_i means \sum_{i=1}^n
Are (1) and (2) equivalent? That is, is the following equality true? Prove or disprove it. \sum_i (x_i - \overline{x}) y_i = \sum_i (y_i - \overline{y}) x_i
True.
\begin{align*} & \sum_i (x_i - \overline{x}) y_i = \sum_i (y_i - \overline{y}) x_i \\ & \Leftrightarrow \sum_i x_i y_i - \overline{x} \sum_i y_i = \sum_i x_i y_i - \overline{y} \sum_i x_i \\ & \Leftrightarrow \overline{x} \sum_i y_i = \overline{y} \sum_i x_i \\ & \Leftrightarrow {1 \over n} \sum_i x_i \sum_i y_i = {1 \over n} \sum_i y_i \sum_i x_i \\ \end{align*}
True or False: If the dataset shifted right by a constant distance a, that is, we have the new dataset D_a = (x_1 + a,y_1), \ldots, (x_n + a,y_n), then will the estimated slope w change or not?
True
False
False. By (1) in part (a), we can view w as only being affected by x_i - \overline{x}, which is unchanged after shifting horizontally. Therefore, w is unchanged.
True or False: If the dataset shifted up by a constant distance b, that is, we have the new dataset D_b = [(x_1,y_1 + b), \ldots, (x_n,y_n + b)], then will the estimated slope w change or not?
True
False
False. By (2) in part (a), we can view w as only being affected by y_i - \overline{y}, which is unchanged after shifting vertically. Therefore, w is unchanged.
Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h
Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.
R_{\text{midterm}}(h)=\frac{1}{n}\sum_{i=1}^{n}[(\alpha y_i - h)^2+\lambda h]=[\frac{1}{n}\sum_{i=1}^{n}(\alpha y_i - h)^2] +\lambda h
The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.
h^*=\alpha \bar{y} - \frac{\lambda}{2}
\begin{align*} \frac{d}{dh}R_{\text{midterm}}(h)&= [\frac{2}{n}\sum_{i=1}^{n}(h- \alpha y_i )] +\lambda \\ &=2 h-2\alpha \bar{y} + \lambda. \end{align*}
By setting \frac{d}{dh}R_{\text{midterm}}(h)=0 we get 2 h^*-2\alpha \bar{y} + \lambda=0 \Rightarrow h^*=\alpha \bar{y} - \frac{\lambda}{2}.
For a given dataset \{y_1, y_2, \dots, y_n\}, let M_{abs}(h) represent the median absolute error of the constant prediction h on that dataset (as opposed to the mean absolute error R_{abs}(h)).
For the dataset \{4, 9, 10, 14, 15\}, what is M_{abs}(9)?
5
The first step is to calculate the absolute errors (|y_i - h|).
\begin{align*} \text{Absolute Errors} &= \{|4-9|, |9-9|, |10-9|, |14-9|, |15-9|\} \\ \text{Absolute Errors} &= \{|-5|, |0|, |1|, |5|, |6|\} \\ \text{Absolute Errors} &= \{5, 0, 1, 5, 6\} \end{align*}
Now we have to order the values inside of the absolute errors: \{0, 1, 5, 5, 6\}. We can see the median is 5, so M_{abs}(9) =5.
For the same dataset \{4, 9, 10, 14, 15\}, find another integer h such that M_{abs}(9) = M_{abs}(h).
5 or 15
Our goal is to find another number that will give us the same median of absolute errors as in part (a).
One way to do this is to plug in a number and guess. Another way requires noticing you can modify 10 (the middle element) to become 5 in either direction (negative or positive) because of the absolute value.
We can solve this equation to get |10-x| = 5 \rightarrow x = 15 \text{ and } x = 5.
We can then test this by following the same steps as we did in part (a).
For x = 15:
\begin{align*} \text{Absolute Errors} &= \{|4-15|, |9-15|, |10-15|, |14-15|, |15-15|\} \\ \text{Absolute Errors} &= \{|-11|, |-6|, |-5|, |-1|, |0|\} \\ \text{Absolute Errors} &= \{11, 6, 5, 1, 0\} \end{align*}
Then we order the elements to get the absolute errors: \{0, 1, 5, 6, 11\}. We can see the median is 5, so M_{abs}(15) =5.
For x = 5:
\begin{align*} \text{Absolute Errors} &= \{|4-5|, |9-5|, |10-5|, |14-5|, |15-5|\} \\ \text{Absolute Errors} &= \{|-1|, |4|, |5|, |9|, |10|\} \\ \text{Absolute Errors} &= \{1, 4, 5, 9, 10\} \end{align*}
We do not have to re-order the elements because they are in order already. We can see the median is 5, so M_{abs}(5) =5.
Based on your answers to parts (a) and (b), discuss in at most two sentences what is problematic about using the median absolute error to make predictions.
The numbers 5 and 15 are clearly bad predictions (close to the extreme values in the dataset), yet they are considered just as good a prediction by this metric as the number 9, which is roughly in the center of the dataset. Intuitively, 9 is a much better prediction, but this way of measuring the quality of a prediction does not recognize that.
Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:
The first 5 rows of the 5000-row dataset are shown below:
carat | length | width | price |
---|---|---|---|
0.40 | 4.81 | 4.76 | 1323 |
1.04 | 6.58 | 6.53 | 5102 |
0.40 | 4.74 | 4.76 | 696 |
0.40 | 4.67 | 4.65 | 798 |
0.50 | 4.90 | 4.95 | 987 |
Billy has enlisted our help in predicting the price of a diamond
given various other features.
Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form
\text{predicted price} = w_0 + w_1 \cdot \text{carat} + w_2 \cdot \text{length}
We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.
Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.
X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}
Suppose the optimal parameter vector \vec{w}^* is given by
\vec{w}^* = \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}
What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.
The predicted price is 4500 dollars.
2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 4500
Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as
\vec{e} = \vec{y} - X \vec{w}^*
where \vec{y} is the observation vector containing the prices for each diamond.
For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.
Suppose we introduce two more features:
Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now
\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})
Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:
\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}
In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.
Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{w_a}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{w_a}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).
\vec{w_a}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}
Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.
Where: \vec{w_a}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}
By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H_2(\vec{x}) = v_1 + v_0x_1 + v_2x_2+ v_3x_3.
Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.
Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{w_b}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{w_b}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).
\vec{w_b}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}
Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.
Where: \vec{w_b}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}
By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H_2(\vec{x}) = v_0\cdot 2+ v_1x_1 + v_2x_2+ v_3x_3.
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.
Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{w_c}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{w_c}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.
\vec{w_c}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}
Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.
Where: \vec{w_c}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}
By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H_2(\vec{x}) &= v_0+ v_1x_1 + v_2(x_2+1)+ v_3x_3 \\ &= (v_0 + v_2) + v_1x_1 + v_2x_2+ v_3x_3 \end{aligned}
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
One way to think about this is that if we replace x_2 with x_2+1, then our predictions will increase by the coefficient of x_2. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.
Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:
cells: number of cells in the notebook
lines: number of lines of code
max iterations: largest number of iterations in any loop in the notebook, or 1 if there are no loops
variables: number of variables defined in the notebook
runtime: number of seconds for the notebook to run on Datahub
Then we use multiple regression to fit a prediction rule of the form H(\text{cells, lines, max iterations, variables}) = w_0 + w_1 \cdot \text{cells} \cdot \text{lines} + w_2 \cdot (\text{max iterations})^{\text{variables} - 10}
What are the dimensions of the design matrix X?
\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}
So, what should r and c be for: r rows \times c columns.
100 \text{ rows} \times 3 \text{ columns}
There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing \text{max iterations})^{\text{variables} - 10}.
In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).
This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.
Consider the vectors \vec{u} and \vec{v}, defined below.
\vec{u} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec{v} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}
We define X \in \mathbb{R}^{3 \times 2} to be the matrix whose first column is \vec u and whose second column is \vec v.
In this part only, let \vec{y} = \begin{bmatrix} -1 \\ k \\ 252 \end{bmatrix}.
Find a scalar k such that \vec{y} is in \text{span}(\vec u, \vec v). Give your answer as a constant with no variables.
252.
Vectors in \text{span}(\vec u, \vec v) must have an equal 2nd and 3rd component, and the third component is 252, so the second must be as well.
Show that: (X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}
Hint: If A = \begin{bmatrix} a_1 & 0 \\ 0 & a_2 \end{bmatrix}, then A^{-1} = \begin{bmatrix} \frac{1}{a_1} & 0 \\ 0 & \frac{1}{a_2} \end{bmatrix}.
We can construct the following series of matrices to get (X^TX)^{-1}X^T.
In parts (c) and (d) only, let \vec{y} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix}.
Find scalars a and b such that a \vec u + b \vec v is the vector in \text{span}(\vec u, \vec v) that is as close to \vec{y} as possible. Give your answers as constants with no variables.
a = 4, b = 5.
The result from the part (b) implies that when using the normal equations to find coefficients for \vec u and \vec v – which we know from lecture produce an error vector whose length is minimized – the coefficient on \vec u must be y_1 and the coefficient on \vec v must be \frac{y_2 + y_3}{2}. This can be shown by taking the result from part (b), \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}, and multiplying it by the vector \vec y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix}.
Here, y_1 = 4, so a = 4. We also know y_2 = 2 and y_3 = 8, so b = \frac{2+8}{2} = 5.
Let \vec{e} = \vec{y} - (a \vec u + b \vec v), where a and b are the values you found in part (c).
What is \lVert \vec{e} \rVert?
0
3 \sqrt{2}
4 \sqrt{2}
6
6 \sqrt{2}
2\sqrt{21}
3 \sqrt{2}.
The correct value of a \vec u + b \vec v = \begin{bmatrix} 4 \\ 5 \\ 5\end{bmatrix}. Then, \vec{e} = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix} - \begin{bmatrix} 4 \\ 5 \\ 5 \end{bmatrix} = \begin{bmatrix} 0 \\ -3 \\ 3 \end{bmatrix}, which has a length of \sqrt{0^2 + (-3)^2 + 3^2} = 3\sqrt{2}.
Is it true that, for any vector \vec{y} \in \mathbb{R}^3, we can find scalars c and d such that the sum of the entries in the vector \vec{y} - (c \vec u + d \vec v) is 0?
Yes, because \vec{u} and \vec{v} are linearly independent.
Yes, because \vec{u} and \vec{v} are orthogonal.
Yes, but for a reason that isn’t listed here.
No, because \vec{y} is not necessarily in
No, because neither \vec{u} nor \vec{v} is equal to the vector
No, but for a reason that isn’t listed here.
Yes, but for a reason that isn’t listed here.
Here’s the full reason:
Suppose that Q \in \mathbb{R}^{100 \times 12}, \vec{s} \in \mathbb{R}^{100}, and \vec{f} \in \mathbb{R}^{12}. What are the dimensions of the following product?
\vec{s}^T Q \vec{f}
scalar
12 \times 1 vector
100 \times 1 vector
100 \times 12 matrix
12 \times 12 matrix
12 \times 100 matrix
undefined
Correct: Scalar.
The inner dimensions of 100 and 12 cancel, and so \vec{s}^T Q \vec{f} is of shape 1 x 1.