← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.
Note: We do not plan to
cover all problems here in the live discussion section; the problems
we don’t cover can be used for extra practice.
Every week, Lauren goes to her local grocery store and buys a varying amount of vegetable but always buys exactly one pound of meat (either beef, fish, or chicken). We use a linear regression model to predict her total grocery bill. We’ve collected a dataset containing the pounds of vegetables bought, the type of meat bought, and the total bill. Below we display the first few rows of the dataset and two plots generated using the entire training set.
Suppose we fit the following linear regression models to predict
'total'
using the squared loss function. Based on the data
and visualizations shown above, for each of the following models H(x), determine whether each fitted
model coefficient w^* is
positive (+), negative (-), or exactly 0. The notation \text{meat=beef} refers to the one-hot
encoded 'meat'
column with value 1 if the original value in the
'meat'
column was 'beef'
and 0 otherwise. Likewise, \text{meat=chicken} and \text{meat=fish} are the one-hot encoded
'meat'
columns for 'chicken'
and
'fish'
, respectively.
For example, in part (iv), you’ll need to provide three answers: one for w_0^* (either positive, negative, or 0), one for w_1^* (either positive, negative, or 0), and one for w_2^* (either positive, negative, or 0).
Model i. H(x) = w_0
Answer: w_0^* must
be positive
If H(x) = w_0, then w_0^* will be the mean 'total'
value in our dataset, and since all of our observed 'total'
values are positive, their mean – and hence, w_0^* – must also be positive.
Model ii. H(x) = w_0 + w_1 \cdot
\text{veg}
Answer: w_0^* must be positive; w_1^* must be positive
Of the three graphs provided, the middle one shows the relationship
between 'total'
and 'veg'
. We see that if we
were to draw a line of best fit, the y-intercept (w_0^*) would be positive, and so would the
slope (w_1^*).
Model iii. H(x) = w_0 + w_1 \cdot
\text{(meat=chicken)}
Answer: w_0^* must be positive; w_1^* must be negative
Here’s the key to solving this part (and the following few): the
input x has a 'meat'
of
'chicken'
, H(x) = w_0 +
w_1. If the input x has a
'meat'
of something other than 'chicken'
, then
H(x) = w_0. So:
'total'
prediction that makes sense for non-chicken inputs,
and'total'
prediction for chickens.For all three 'meat'
categories, the average observed
'total'
value is positive, so it would make sense that for
non-chickens, the constant prediction w_0^* is positive. Based on the third graph,
it seems that 'chicken'
s tend to have a lower
'total'
value on average than the other two categories, so
if the input x is a
'chicken'
(that is, if \text{meat=chicken} = 1), then the constant
'total'
prediction should be less than the constant
'total'
prediction for other 'meat'
s. Since we
want w_0^* + w_1^*
to be less than w_0^*, w_1^*
must then be negative.
Model iv. H(x) = w_0 + w_1 \cdot
\text{(meat=beef)} + w_2 \cdot \text{(meat=chicken)}
Answer: w_0^* must
be positive; w_1^* must be negative;
w_2^* must be negative
H(x) makes one of three predictions:
'meat'
value of 'chicken'
, then it predicts w_0 + w_2.'meat'
value of 'beef'
, then it predicts w_0 + w_1.'meat'
value of 'fish'
, then it predicts w_0.Think of w_1^* and w_2^* – the optimal w_1 and w_2
– as being adjustments to the mean 'total'
amount for
'fish'
. Per the third graph, 'fish'
have the
highest mean 'total'
amount of the three
'meat'
types, so w_0^*
should be positive while w_1^* and
w_2^* should be negative.
Model v. H(x) = w_0 + w_1 \cdot
\text{(meat=beef)} + w_2 \cdot \text{(meat=chicken)} + w_3 \cdot
\text{(meat=fish)}
Answer: Not
enough information for any of the four coefficients!
Like in the previous part, H(x) makes one of three predictions:
'meat'
value of 'chicken'
, then it predicts w_0 + w_2.'meat'
value of 'beef'
, then it predicts w_0 + w_1.'meat'
value of 'fish'
, then it predicts w_0 + w_3.Since the mean minimizes mean squared error for the constant model,
we’d expect w_0^* + w_2^* to be the
mean 'total'
for 'chicken'
, w_0^* + w_1^* to be the mean
'total'
for 'beef'
, and w_0^* + w_3^* to be the mean total for
'fish'
. The issue is that there are infinitely many
combinations of w_0^*, w_1^*, w_2^*,
w_3^* that allow this to happen!
Pretend, for example, that:
'total'
for 'chicken
is 8.'total'
for 'beef'
is 12.'total'
for 'fish'
is 15.Then, w_0^* = -10, w_1^* = 22, w_2^* = 18, w_3^* = 25 and w_0^* = 20, w_1^* = -8, w_2^* = -12, w_3^* = -5 work, but the signs of the coefficients are inconsistent. As such, it’s impossible to tell!
Suppose we fit the model H(x) = w_0 + w_1 \cdot \text{veg} + w_2 \cdot \text{(meat=beef)} + w_3 \cdot \text{(meat=fish)}. After fitting, we find that \vec{w^*}=[-3, 5, 8, 12]
What is the prediction of this model on the first point in our dataset?
-3
2
5
10
13
22
25
Answer: 10
Plugging in our weights \vec{w}^* to the model H(x) and filling in data from the row
veg | meat | total |
---|---|---|
1 | beef | 13 |
gives us -3 + 5(1) + 8(1) + 12(0) = 10.
Following the same model H(x) and weights from the previous problem, what is the loss of this model on the second point in our dataset, using squared error loss?
0
1
5
6
8
24
25
169
Answer: 25
The squared loss for a single point is (\text{actual} - \text{predicted})^2. Here,
our actual 'total'
value is 19, and our predicted value
'total'
value is -3 + 5(3) + 8(0)
+ 12(1) = -3 + 15 + 12 = 24, so the squared loss is (19 - 24)^2 = (-5)^2 = 25.
Billy’s aunt owns a jewellery store, and gives him data on 5000 of the diamonds in her store. For each diamond, we have:
The first 5 rows of the 5000-row dataset are shown below:
carat | length | width | price |
---|---|---|---|
0.40 | 4.81 | 4.76 | 1323 |
1.04 | 6.58 | 6.53 | 5102 |
0.40 | 4.74 | 4.76 | 696 |
0.40 | 4.67 | 4.65 | 798 |
0.50 | 4.90 | 4.95 | 987 |
Billy has enlisted our help in predicting the price of a diamond
given various other features.
Suppose we want to fit a linear prediction rule that uses two features, carat and length, to predict price. Specifically, our prediction rule will be of the form
\text{predicted price} = w_0 + w_1 \cdot \text{carat} + w_2 \cdot \text{length}
We will use least squares to find \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \end{bmatrix}.
Write out the first 5 rows of the design matrix, X. Your matrix should not have any variables in it.
X = \begin{bmatrix} 1 & 0.40 & 4.81 \\ 1 & 1.04 & 6.58 \\ 1 & 0.40 & 4.74 \\ 1 & 0.40 & 4.67 \\ 1 & 0.50 & 4.90 \end{bmatrix}
Suppose the optimal parameter vector \vec{w}^* is given by
\vec{w}^* = \begin{bmatrix} 2000 \\ 10000 \\ -1000 \end{bmatrix}
What is the predicted price of a diamond with 0.65 carats and a length of 4 centimeters? Show your work.
The predicted price is 4500 dollars.
2000 + 10000 \cdot 0.65 - 1000 \cdot 4 = 4500
Suppose \vec{e} = \begin{bmatrix} e_1 \\ e_2 \\ ... \\ e_n \end{bmatrix} is the error/residual vector, defined as
\vec{e} = \vec{y} - X \vec{w}^*
where \vec{y} is the observation vector containing the prices for each diamond.
For each of the following quantities, state whether they are guaranteed to be equal to 0 the scalar, \vec{0} the vector of all 0s, or neither. No justification is necessary.
Suppose we introduce two more features:
Suppose we also decide to remove the intercept term of our prediction rule. With all of these changes, our prediction rule is now
\text{predicted price} = w_1 \cdot \text{carat} + w_2 \cdot \text{length} + w_3 \cdot \text{width} + w_4 \cdot (\text{length} \cdot \text{width})
The DataFrame new_releases
contains the following
information for songs that were recently released:
"genre"
: the genre of the song (one of the following
5 possibilities: "Hip-Hop/Rap"
, "Pop"
,
"Country"
, "Alternative"
, or
"International"
)
"rec_label"
: the record label of the artist who
released the song (one of the following 4 possibilities:
"EMI"
, "SME"
, "UMG"
, or
"WMG"
)
"danceability"
: how easy the song is to dance to,
according to the Spotify API (between 0 and 1)
"speechiness"
: what proportion of the song is made
up of spoken words, according to the Spotify API (between 0 and
1)
"first_month"
: the number of total streams the song
had on Spotify in the first month it was released
The first few rows of new_releases
are shown below
(though new_releases
has many more rows than are shown
below).
We decide to build a linear regression model that predicts
"first_month"
given all other information. To start, we
conduct a train-test split, splitting new_releases
into
X_train
, X_test
, y_train
, and
y_test
.
We then fit two linear models (with intercept terms) to the training data:
Model 1 (lr_one
): Uses "danceability"
only.
Model 2 (lr_two
): Uses "danceability"
and "speechiness"
only.
Consider the following outputs.
>>> X_train.shape[0]
50
>>> np.sum((y_train - lr_two.predict(X_train)) ** 2)
500000 # five hundred thousand
What is Model 2 (lr_two
)’s training RMSE (square root of
mean squared error)? Give your answer as an integer.
Answer: 100
We are given that there are n=50 data points, and that the sum of squared errors \sum_{i = 1}^n (y_i - H(x_i))^2 is 500{,}000. Then:
\begin{aligned} \text{RMSE} &= \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - H(x_i))^2} \\ &= \sqrt{\frac{1}{50} \cdot 500{,}000} \\ &= \sqrt{10{,}000} \\ &= 100\end{aligned}
Now, suppose we fit two more linear models (with intercept terms) to the training data:
Model 3 (\texttt{lr\_drop}):
Uses "danceability"
and "speechiness"
as-is,
and one-hot encodes "genre"
and "rec_label"
,
using OneHotEncoder(drop="first")
.
Model 4 (\texttt{lr\_no\_drop}):
Uses "danceability"
and "speechiness"
as-is,
and one-hot encodes "genre"
and "rec_label"
,
using OneHotEncoder()
.
Note that the only difference between Model 3 and Model 4 is the fact
that Model 3 uses drop="first"
.
How many one-hot encoded columns are used in each model? In other words, how many binary columns are used in each model? Give both answers as integers.
Hint: Make sure to look closely at the description of
new_releases
at the top of the previous page, and don’t
include the already-quantitative features.
number of one-hot encoded columns in Model 3 (lr_drop
)
=
number of one-hot encoded columns in Model 4
(lr_no_drop
) =
Answer: 7 and 9
There are 5 unique values of "genre"
and 4 unique values
of "rec_label"
, so if we create a single one-hot encoded
column for each one, there would be 5 + 4 =
9 one-hot encoded columns (which there are in
lr_no_drop
).
If we drop one one-hot-encoded column per category, which is what
drop="first"
does, then we only have (5 - 1) + (4 - 1) = 7 one-hot encoded columns
(which there are in lr_drop
).
Recall, in Model 4 (lr_no_drop
) we one-hot encoded
"genre"
and "rec_label"
, and did not use
drop="first"
when instantiating our
OneHotEncoder
.
Suppose we are given the following coefficients in Model 4:
The coefficient on "genre_Pop"
is 2000.
The coefficient on "genre_Country"
is 1000.
The coefficient on "danceability"
is 10^6 = 1{,}000{,}000.
Daisy and Billy are two artists signed to the same
"rec_label"
who each just released a new song with the same
"speechiness"
. Daisy is a "Pop"
artist while
Billy is a "Country"
artist.
Model 4 predicted that Daisy’s song and Billy’s song will have the
same "first_month"
streams. What is the absolute
difference between Daisy’s song’s "danceability"
and Billy’s song’s "danceability"
? Give your answer as a
simplified fraction.
Answer: \frac{1}{1000}
“My favorite problem on the exam!" -Suraj
Model 4 is made up of 11 features, i.e. 11 columns.
4 of the columns correspond to the different values of
"rec_label"
. Since Daisy and Billy have the same
"rec_label"
, their values in these four columns are all the
same.
One of the columns corresponds to "speechiness"
.
Since Daisy’s song and Billy’s song have the same
"speechiness"
, their values in this column are the
same.
5 of the columns correspond to the different values of
"genre"
. Daisy is a "Pop"
artist, so she has a
1 in the "genre_Pop"
column and a 0 in the other four
"genre_"
columns, and similarly Billy has a 1 in the
"genre_Country"
column and 0s in the others.
One of the columns corresponds to "danceability"
,
and Daisy and Billy have different quantitative values in this
column.
Let d_1 and d_2.
The key is in recognizing that all features in Daisy’s prediction and
Billy’s prediction are the same, other than the coefficients on
"genre_Pop"
, "genre_Country"
, and
"danceability"
. Let’s let d_1 be Daisy’s song’s
"danceability"
, and let d_2 be Billy’s song’s
"danceability"
. Then:
\begin{aligned} 2000 + 10^{6} \cdot d_1 = 1000 + 10^{6} \cdot d_2 \\ 1000 &= 10^{6} (d_2 - d_1) \\ \frac{1}{1000} &= d_2 - d_1\end{aligned}
Thus, the absolute difference between their songs’
"danceability"
s is \frac{1}{1000}.
Consider the dataset shown below.
x^{(1)} | x^{(2)} | x^{(3)} | y |
---|---|---|---|
0 | 6 | 8 | -5 |
3 | 4 | 5 | 7 |
5 | -1 | -3 | 4 |
0 | 2 | 1 | 2 |
We want to use multiple regression to fit a prediction rule of the form H(x^{(1)}, x^{(2)}, x^{(3)}) = w_0 + w_1 x^{(1)}x^{(3)} + w_2 (x^{(2)}-x^{(3)})^2. Write down the design matrix X and observation vector \vec{y} for this scenario. No justification needed.
The design matrix X and observation vector \vec{y} are given by:
\begin{align*} X &= \begin{bmatrix} 1 & 0 & 4\\ 1 & 15 & 1\\ 1 & -15 & 4\\ 1 & 0 & 1 \end{bmatrix} \\ \vec{y} &= \begin{bmatrix} -5\\ 7\\ 4\\ 2 \end{bmatrix} \end{align*}
We got \vec{y} by looking at our dataset and seeing the y column.
The matrix X was found by looking at the equation H(x). You can think of each row of X being: \begin{bmatrix}1 & x^{(1)}x^{(3)} & (x^{(2)}-x^{(3)})^2\end{bmatrix}. Recall our bias term here is not affected by x^{(i)}, but it still exists! So we will always have the first element in our row be 1. We can then easily calculate the other elements in the matrix.
For the X and \vec{y} that you have written down, let \vec{w} be the optimal parameter vector, which comes from solving the normal equations X^TX\vec{w}=X^T\vec{y}. Let \vec{e} = \vec{y} - X \vec{w} be the error vector, and let e_i be the ith component of this error vector. Show that 4e_1+e_2+4e_3+e_4=0.
The key to this problem is the fact that the error vector, \vec{e}, is orthogonal to the columns of the design matrix, X. As a refresher, if \vec{w^*} satisfies the normal equations, then:
We can rewrite the normal equation (X^TX\vec{w}=X^T\vec{y}) to allow substitution for \vec{e} = \vec{y} - X \vec{w}.
\begin{align*} X^TX\vec{w}&=X^T\vec{y} \\ 0 &= X^T\vec{y} - X^TX\vec{w} \\ 0 &= X^T(\vec{y}-X\vec{w}) \\ 0 &= X^T\vec{e} \end{align*}
The first step is to find X^T, which is easy because we found X above: \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix}
And now we can plug X^T and \vec e into our equation 0 = X^T\vec{e}. It might be easiest to find the right side first: \begin{align*} X^T\vec{e} &= \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 15 & -15 & 0 \\ 4 & 1 & 4 & 1 \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ e_4\end{bmatrix} \\ &= \begin{bmatrix} e_1 + e_2 + e_3 + e_4 \\ 15e_2 - 15e_3 \\ 4e_1 + e_2 + 4e_3 + e_4\end{bmatrix} \end{align*}
Finally, we set it equal to zero! \begin{align*} 0 &= e_1 + e_2 + e_3 + e_4 \\ 0 &= 15e_2 - 15e_3 \\ 0 &= 4e_1 + e_2 + 4e_3 + e_4 \end{align*}
With this we have shown that 4e_1+e_2+4e_3+e_4=0.
Jasmine and Aritra are trying to build models that predict the number
of open rooms a hotel room has. To do so, they use price
,
the average listing price of rooms at the hotel, along with a one hot
encoded version of the hotel’s chain
. For the
purposes of this question, assume the only possible hotel chains are
Marriott, Hilton, and Other.
First, Jasmine fits a linear model without an intercept term. Her prediction rule, H_1, looks like:
H_{1}(x) = w_1 \cdot \texttt{price} + w_2 \cdot \texttt{is\_Marriott} + w_3 \cdot \texttt{is\_Hilton} + w_4 \cdot \texttt{is\_Other}
After fitting her model, \vec{w}^* = \begin{bmatrix}−0.5 \\ 200 \\ 300 \\ 50 \end{bmatrix}.
Answers:
If we plug in the weights from \vec{w}^* into the equation for H_1(x), noting that the values of the one-hot variables are either 0 or 1, we get -0.5 \cdot 250 + 200 \cdot 1 + 300 \cdot 0 + 50 \cdot 0, or -125 + 200, or 75.
For part 2, the difference between predicted and actual number of rooms is 75 - 45 = 30, and the squared loss is simply 30^2 = 900.
For part 3, the answer is False, since the fact that our best-fit line has an error at a given input does not mean that this input was not seen in the training data. In other words, the line of best fit does not always have to pass through all of the training points (and due to overfitting, in most cases you don’t want it to).
As a reminder,
H_{1}(x) = w_1 \cdot \texttt{price} + w_2 \cdot \texttt{is\_Marriott} + w_3 \cdot \texttt{is\_Hilton} + w_4 \cdot \texttt{is\_Other}
\vec{w}^* = \begin{bmatrix}−0.5 \\ 200 \\ 300 \\ 50 \end{bmatrix}
H_2(x) = \beta_0 + \beta_1 \cdot \texttt{price} + \beta_2 \cdot \texttt{is\_Marriott} + \beta_3 \cdot \texttt{is\_Hilton}
After fitting his model, Aritra finds \beta_{0}^{*} = 50. Given that, what are \beta_{1}^{*}, \beta_{2}^{*}, and \beta_{3}^{*}? Give your answers as numbers with no variables.
Answers:
As we saw in the previous question, these models should yield an
equivalent best-fit line. The relationship between price
and predicted outcome will be the same, so \beta_{1}^{*} will be -0.5, the same weight as in H_1.
If \beta_{0}^{*} = 50, this means we
are adding 50 to all predictions, and that the “adjustments” we make to
the prediction for is_Marriott
and is_Hilton
should therefore be reduced to compensate for this; therefore, the new
weights \beta_{2}^{*} and \beta_{3}^{*} will be 150 and 250.
Suppose we want to predict how long it takes to run a Jupyter notebook on Datahub. For 100 different Jupyter notebooks, we collect the following 5 pieces of information:
cells: number of cells in the notebook
lines: number of lines of code
max iterations: largest number of iterations in any loop in the notebook, or 1 if there are no loops
variables: number of variables defined in the notebook
runtime: number of seconds for the notebook to run on Datahub
Then we use multiple regression to fit a prediction rule of the form H(\text{cells, lines, max iterations, variables}) = w_0 + w_1 \cdot \text{cells} \cdot \text{lines} + w_2 \cdot (\text{max iterations})^{\text{variables} - 10}
What are the dimensions of the design matrix X?
\begin{bmatrix} & & & \\ & & & \\ & & & \\ \end{bmatrix}_{r \times c}
So, what should r and c be for: r rows \times c columns.
100 \text{ rows} \times 3 \text{ columns}
There should be 100 rows because there are 100 different Jupyter notebooks with different information within them. There should be 3 columns, one for each w_i. In this case we have w_0, which means X will have a column of ones, w_1, which means X will have a second column of \text{cells} \cdot \text{lines}, and w_2, which will be the last column in X containing \text{max iterations})^{\text{variables} - 10}.
In one sentence, what does the entry in row 3, column 2 of the design matrix X represent? (Count rows and columns starting at 1, not 0).
This entry represents the product of the number of cells and number of lines of code for the third Jupyter notebook in the training dataset.
and consider the two prediction rules
\begin{aligned} H^*(\text{cells, lines, max iterations, variables}) &= w_0^* + w_1^* \cdot \text{cells} \cdot \text{lines} + w_2^* \cdot (\text{max iterations})^{\text{variables} - 10}\\ H^{\circ}(\text{cells, lines, max iterations, variables}) &= w_0^{\circ} + w_1^{\circ} \cdot \text{cells} \cdot \text{lines} + w_2^{\circ} \cdot (\text{max iterations})^{\text{variables} - 10} \end{aligned}Let \text{MSE} represent the mean squared error of a prediction rule, and let \text{MAE} represent the mean absolute error of a prediction rule. Select the symbol that should go in each blank.
\text{MSE}(H^*) ___ \text{MSE}(H^{\circ})
\leq
\geq
=
\leq
It is given that \vec{w}^* is the optimal parameter vector. Here’s one thing we know about the optimal parameter vector \vec{w}^*: it is optimal, which means that any changes made to it will, at best, keep our predictions of the exact same quality, and, at worst, reduce the quality of our predictions and increase our error. And since \vec{w} \degree is just the optimal parameter vector but with some small changes to the weights, it stands that \vec{w} \degree is liable to create equal or greater error!
In other words, \vec{w} \degree is a slightly worse version of \vec{w}^*, meaning that H \degree(x) is a slightly worse version of H^*(x). So, H \degree(x) will have equal or higher error than H^*(x).
Hence: \text{MSE}(H^*) \leq \text{MSE}(H^{\circ})
Let X be a design matrix with 4 columns, such that the first column is a column of all 1s. Let \vec{y} be an observation vector. Let \vec{w}^* = (X^TX)^{-1}X^T\vec{y}. We’ll name the components of \vec{w}^* as follows:
\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}
In this problem, we’ll consider various modifications to the design matrix and see how they affect the solution to the normal equations.
Let X_a be the design matrix that comes from interchanging the first two columns of X. Let \vec{w_a}^* = (X_a^TX_a)^{-1}X_a^T\vec{y}. Express the components \vec{w_a}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).
\vec{w_a}^* = \begin{bmatrix} w_1^* \\ w_0^* \\ w_2^* \\ w_3^* \end{bmatrix}
Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.
Where: \vec{w_a}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}
By swapping the first two columns of our design matrix, this changes the prediction rule to be of the form: H_2(\vec{x}) = v_1 + v_0x_1 + v_2x_2+ v_3x_3.
Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= w_1^* \\ v_1^* &= w_0^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the prediction rule, which interchanges those weights in the parameter vector.
Let X_b be the design matrix that comes from adding one to each entry of the first column of X. Let \vec{w_b}^* = (X_b^TX_b)^{-1}X_b^T\vec{y}. Express the components \vec{w_b}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^* (which were the components of \vec{w}^*).
\vec{w_b}^* = \begin{bmatrix} \dfrac{w_0^*}{2} \\ w_1^* \\ w_2^* \\ w_3^*\end{bmatrix}
Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.
Where: \vec{w_b}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}
By adding one to each entry of the first column of the design matrix, we are changing the column of 1s to be a column of 2s. This changes the prediction rule to be of the form: H_2(\vec{x}) = v_0\cdot 2+ v_1x_1 + v_2x_2+ v_3x_3.
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by: \begin{aligned} v_0^* &= \dfrac{w_0^*}{2} \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
This is saying we just halve the intercept term. For example, imagine fitting a line to data in \mathbb{R}^2 and finding that the best-fitting line is y=12+3x. If we had to write this in the form y=v_0\cdot 2 + v_1x, we would find that the best choice for v_0 is 6 and the best choice for v_1 is 3.
Let X_c be the design matrix that comes from adding one to each entry of the third column of X. Let \vec{w_c}^* = (X_c^TX_c)^{-1}X_c^T\vec{y}. Express the components \vec{w_c}^* in terms of w_0^*, w_1^*, w_2^*, and w_3^*, which were the components of \vec{w}^*.
\vec{w_c}^* = \begin{bmatrix} w_0^* - w_2^* \\ w_1^* \\ w_2^* \\ w_3^* \end{bmatrix}
Suppose our original prediction rule was of the form: H(\vec{x}) = w_0 + w_1x_1+ w_2x_2+ w_3x_3.
Where: \vec{w_c}^* = \begin{bmatrix} v_0^* \\ v_1^* \\ v_2^* \\ v_3^* \end{bmatrix}
By adding one to each entry of the third column of the design matrix, this changes the prediction rule to be of the form: \begin{aligned} H_2(\vec{x}) &= v_0+ v_1x_1 + v_2(x_2+1)+ v_3x_3 \\ &= (v_0 + v_2) + v_1x_1 + v_2x_2+ v_3x_3 \end{aligned}
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. Therefore the optimal parameters for H_2 are related to the optimal parameters for H by \begin{aligned} v_0^* &= w_0^* - w_2^* \\ v_1^* &= w_1^* \\ v_2^* &= w_2^* \\ v_3^* &= w_3^* \end{aligned}
One way to think about this is that if we replace x_2 with x_2+1, then our predictions will increase by the coefficient of x_2. In order to keep our predictions the same, we would need to adjust our intercept term by subtracting this same amount.
The two plots below show the total number of boots (top) and sandals
(bottom) purchased per month in the df
table. Assume that
there is one data point per month.
For each of the following regression models, use the visualizations shown above to select the value that is closest to the fitted model weights. If it is not possible to determine the model weight, select “Not enough info”. For the models below:
boot
refers to the number of boots
sold.sandal
refers to the number of sandals
sold.summer=1
is a column with value 1 if the month is
between March (03) and August (08), inclusive.winter=1
is a column with value 1 if the month is
between September (09) and February (02), inclusive.boot
= w_0
w_0:
0
50
100
Not enough info
Answer: 50
boot
= w_0 + w_1 \cdot
\text{sandals}
w_0:
-100
-1
0
1
100
Not enough info
w_1:
-100
-1
1
100
Not enough info
Answer:
w_0: 100
w_1: -1
boot
= w_0 + w_1 \cdot
(\text{summer=1})
w_0:
-100
-1
0
1
100
Not enough info
w_1:
-80
-1
0
1
80
Not enough info
Answer:
w_0: 100
w_1: -80
sandal
= w_0 + w_1 \cdot
(\text{summer=1})
w_0:
-20
-1
0
1
20
Not enough info
w_1:
-80
-1
0
1
80
Not enough info
Answer:
w_0: 20
w_1: 80
sandal
= w_0 + w_1 \cdot
(\text{summer=1}) + w_2 \cdot (\text{winter=1})
w_0:
-20
-1
0
1
20
Not enough info
w_1:
-80
-1
0
1
80
Not enough info
w_2:
-80
-1
0
1
80
Not enough info
Answer:
w_0: Not enough info
w_1: Not enough info
w_2: Not enough info
Reggie and Essie are given a dataset of real features x_i \in \mathbb{R} and observations y_i. Essie proposes the following linear prediction rule: H_1(\alpha_0,\alpha_1) = \alpha_0 + \alpha_1 x_i. and Reggie proposes to use v_i=(x_i)^2 and the prediction rule H_2(\gamma_0,\gamma_1) = \gamma_0 + \gamma_1 v_i.
Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) < minimum MSE(H_1). Explain.
Example: If the datapoints follow a quadratic form y_i=x_i^2 for all i, then the H_2 prediction rule will achieve a zero error while H_1>0 since the data do not follow a linear form.
Give an example of a dataset \{(x_i,y_i)\}_{i=1}^n for which minimum MSE(H_2) = minimum MSE(H_1). Explain.
Example 1: If the response variables are constant y_i=c for all i, then for both prediciton rules by setting \alpha_0=\gamma_0=c and \alpha_1=\gamma_1=0, both predictors will achieve MSE=0.
Example 2: when every single value of the features x_i and x^2_ i coincide in the dataset (this occurs when x = 0 or x = 1), the parameters of both prediction rules will be the same, as will the MSE.
A new feature z has been added to the dataset.
Essie proposes a linear regression model using two predictor variables x,z as H_3(w_0,w_1,w_2) = w_0 + w_1 x_i +w_2 z_i.
Explain if the following statement is True or False (prove or provide counter-example).
Reggie claims that having more features will lead to a smaller error, therefore the following prediction rule will give a smaller MSE: H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + \alpha_1 x_i +\alpha_2 z_i + \alpha_3 (2x_i-z_i)
H_4 can be rewritten as H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = \alpha_0 + (\alpha_1+2\alpha_3) x_i +(\alpha_2 - \alpha_3)z_i By setting \tilde{\alpha}_1=\alpha_1+2\alpha_3 and \tilde{\alpha_2}= \alpha_2 - \alpha_3 then
H_4(\alpha_0,\alpha_1,\alpha_2,\alpha_3) = H_4(\alpha_0,\tilde{\alpha}_1,\tilde{\alpha}_2) = \alpha_0 + \tilde{\alpha_1} x_i +\tilde{\alpha}_2 z_i
Thus H_4 and H_3 have the same normal equations and therefore the same minimum MSE.