The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
This
video π₯, recorded in office hours, gives an overview of
Loss Functions, the Constant Model, Mean, and Variance, while this
video π₯ overviews Simple Linear Regression.
Problem 1
Biff the Wolverine just made an Instagram account and has been
keeping track of the number of likes his posts have received so far.
His first 7 posts have received a mean of 16 likes; the specific like
counts in sorted order are
8,12,12,15,18,20,27
Biff the Wolverine wants to predict the number of likes his next post
will receive, using a constant prediction rule h. For each loss function L(yiβ,h), determine the constant prediction
hβ that minimizes average loss. If you
believe there are multiple minimizers, specify them all. If you believe
you need more information to answer the question or that there is no
minimizer, state that clearly. Give a brief justification for
each answer.
Problem 1.1
L(yiβ,h)=β£yiββhβ£
This is absolute loss, and hence weβre looking for the minimizer of
mean absolute error, which is the median, 15.
Problem 1.2
L(yiβ,h)=(yiββh)2
This is squared loss, and hence weβre looking for the minimizer of
mean squared error, which is the mean, 16.
Problem 1.3
L(yiβ,h)=4(yiββh)2
This is squared loss, multiplied by a constant. Note that when we go
to minimize empirical risk for this loss function, we will take the
derivative of empirical risk and set it equal to 0; at that point the
constant factor of 4 can be divided from both sides, so this problem
boils down to minimizing ordinary mean squared error. The only
difference is that the graph of mean squared error will be stretched
vertically by a factor of 4; the minimizing value will be in the same
place.
For more justification, here we consider any general re-scaling Ξ±(yiββh)2:
This is a scaled version of 0-1 loss. We know that empirical risk for
0-1 loss is minimized at the mode, so that also applies here. The mode,
i.e. the most common value, is 12.
Problem 1.5
L(yiβ,h)=(3yiββ4h)2
Note that we can write (3yβ4h)2
as (3(yβ34βh))2=9(yβ34βh)2. As weβve seen,
the constant factor out front has no impact on the minimizing value.
Using the same principle as in the last part, we can say that 34βhβ=xΛβΉhβ=43βxΛ=43ββ 16=12
Problem 1.6
L(yiβ,h)=(yiββh)3
Hint: Do not spend too long on this subpart.
No minimizer.
Note that unlike β£yiββhβ£, (yiββh)2, and all of the other loss
functions weβve seen, (yiββh)3 tends
towards ββ, rather than having a
minimum output of 0. This means that there is no h that minimizes n1ββi=1nβ(yiββh)3; the
larger we make h, the more negative
(and hence βsmaller") this empirical risk becomes.
Problem 2
You may find the following properties of logarithms helpful in this
question. Assume that all logarithms in this question are natural
logarithms, i.e. of base e.
elog(x)=x
log(a)+log(b)=log(aβ b)
log(a)βlog(b)=log(baβ)
log(ac)=clog(a)
dxdβlogx=x1β
Billy is trying his hand at coming up with loss functions. He comes
up with the Billy loss, LBβ(yiβ,h),
defined as follows:
LBβ(yiβ,h)=[log(hyiββ)]2
Throughout this problem, assume that all yiβs are positive.
Problem 2.1
Show that: dhdβLBβ(yiβ,h)=βh2βlog(hyiββ)
Billy decides to take on a part-time job as a waiter at the Panda
Express in Pierpont. For two months, he kept track of all of the total
bills he gave out to customers along with the tips they then gave him,
all in dollars. Below is a scatter plot of Billyβs tips and total
bills.
Throughout this question, assume we are trying to fit a linear
prediction rule H(x)=w0β+w1βx that
uses total bills to predict tips, and assume we are finding optimal
parameters by minimizing mean squared error.
Problem 3.1
Which of these is the most likely value for r, the correlation between total bill and
tips? Why?
β1β0.75β0.2500.250.751
0.75.
It seems like there is a pretty strong, but not perfect, linear
association between total bills and tips.
Problem 3.2
The variance of the tip amounts is 2.1. Let M be the mean squared error of the best
linear prediction rule on this dataset (under squared loss). Is M less than, equal to, or greater than 2.1?
How can you tell?
M is less than 2.1. The variance is
equal to the MSE of the constant prediction rule.
Note that the MSE of the best linear prediction rule will always be
less than or equal to the MSE of the best constant prediction rule h. The only case in which these two MSEs are
the same is when the best linear prediction rule is a flat line with
slope 0, which is the same as a constant prediction. In all other cases,
the linear prediction rule will make better predictions and hence have a
lower MSE than the constant prediction rule.
In this case, the best linear prediction rule is clearly not flat, so
M<2.1.
Problem 3.3
Suppose we use the formulas from class on Billyβs dataset and
calculate the optimal slope w1ββ and
intercept w0ββ for this prediction
rule.
Suppose we add the value of 1 to every total bill x, effectively shifting the scatter plot 1
unit to the right. Note that doing this does not change the
value of w1ββ. What amount
should we add to each tip y so that the
value of w0ββ also does not change?
Your answer should involve one or more of xΛ,yΛβ,w0ββ,w1ββ, and any
constants.
Note: To receive full points, you must provide a rigorous
explanation, though this explanation only takes a few lines. However, we
will award partial credit to solutions with the correct answer, and itβs
possible to arrive at the correct answer by drawing a picture and
thinking intuitively about what happens.
We should add w1ββ to each tip
y.
First, we present the rigorous solution.
Let xΛoldβ represent the
previous mean of the xβs and xΛnewβ represent the new mean of
the xβs. Then, we know that xΛnewβ=xΛoldβ+1.
Also, let yΛβoldβ and
yΛβnewβ represent the old
and new mean of the yβs. We will try
and find a relationship between these two quantities.
We want the two intercepts to be the same. The intercept for the old
line is yΛβoldββw1ββxΛoldβ and the intercept for the new line is yΛβnewββw1ββxΛnewβ. Setting these equal yields
Thus, in order for the intercepts to be equal, we need the mean of
the new yβs to be w1ββ greater than the mean of the old yβs. Since weβre told weβre adding the same
constant to each y that constant is
w1ββ.
Another way to approach the question is as follows: consider any
point that lies directly on a line with slope w1ββ and intercept w0ββ. Consider how the slope between two
points on a line is calculated: slope=x2ββx1βy2ββy1ββ. If x2ββx1β=1, in order for the slope to remain fixed we must have that
y2ββy1β=slope. For a
concrete example, think of the line y=5x+2. The point (1,7) is on the
line, as is the point (1+1,7+5)=(2,12).
In our case, none of our points are guaranteed to be
on the line defined by slope w1ββ and intercept w0ββ. Instead, we just want to be guaranteed
that the points have the same regression line after being shifted. If we
follow the same principle, though, and add 1 to every x and w1ββ
to every y, the pointsβ relative
positions to the line will not change (i.e. the vertical distance from
each point to the line will not change), and so that will remain the
line with the lowest MSE, and hence w0ββ and w1ββ wonβt change.
Problem 4
Suppose we have a dataset of n
houses that were recently sold in the Ann Arbor area. For each house, we
have its square footage and most recent sale price. The correlation
between square footage and price is r.
Problem 4.1
First, we minimize mean squared error to fit a linear prediction rule
that uses square footage to predict price. The resulting prediction rule
has an intercept of w0ββ and slope of
w1ββ. In other words,
Weβre now interested in minimizing mean squared error to fit a linear
prediction rule that uses price to predict square
footage. Suppose this new regression line has an intercept of
Ξ²0ββ and slope of Ξ²1ββ.
What is Ξ²1ββ? Give your answer
in terms of one or more of n, r, w0ββ,
and w1ββ. Show your work.
Ξ²1ββ=w1ββr2β
Throughout this solution, let x
represent square footage and y
represent price.
We know that w1ββ=rΟxβΟyββ. But what about Ξ²1ββ?
When we take a rule that predicts price from square footage and
transform it into a rule that predicts square footage from price, the
roles of x and y have swapped; suddenly, square footage is
no longer our independent variable, but our dependent variable, and vice
versa for price. This means that the altered dataset we work with when
using our new prediction rule has Οxβ standard deviation for its dependent
variable (square footage), and Οyβ
for its independent variable (price). So, we can write the formula for
Ξ²1ββ as follows: Ξ²1ββ=rΟyβΟxββ
In essence, swapping the independent and dependent variables of a
dataset changes the slope of the regression line from rΟxβΟyββ to rΟyβΟxββ.
From here, we can use a little algebra to get our Ξ²1ββ in terms of one or more n, r, w0ββ, and w1ββ:
For this part only, assume that the following
quantities hold:
n=100
r=0.6
w0ββ=1000
w1ββ=250
The average square footage of homes in the dataset is 2000
Given this information, what is Ξ²0ββ? Give your answer as a constant,
rounded to two decimal places. Show your work.
Ξ²0ββ=1278.56
We start with the formula for the intercept of the regression line.
Note that x and y are opposite what theyβd normally be since
weβre using price to predict square footage.
Ξ²0ββ=xΛβΞ²1ββyΛβ
Weβre told that the average square footage of homes in the dataset is
2000, so xΛ=2000. We also know
from part (a) that Ξ²1ββ=w1ββr2β, and from the information given in this part
this is Ξ²1ββ=w1ββr2β=2500.62β.
Finally, we need the average price of all homes in the dataset, yΛβ. We arenβt given this information
directly, but we can use the fact that (xΛ,yΛβ) are on the regression line
that uses square footage to predict price to find yΛβ. Specifically, we have that yΛβ=w0ββ+w1ββxΛ; we know that
w0ββ=1000, xΛ=2000, and w1ββ=250, so yΛβ=1000+2000β 250=501000.
The mean of 12 non-negative numbers is 45. Suppose we remove 2 of
these numbers. What is the largest possible value of the mean of the
remaining 10 numbers? Show your work.
54.
To maximize the mean of the remaining 10 numbers, we want to minimize
the numbers that are removed. The smallest possible non-negative number
is 0, so to maximize the mean of the remaining 10, we should remove two
0s from the set of numbers. Recall that the sum of the 12 number set is
12β 45; then, the maximum possible
mean of the remaining 10 is
1012β 45β2β 0β=56ββ 45=54
Problem 6
Let Rsqβ(h) represent the mean
squared error of a constant prediction h for a given dataset. Find a dataset {y1β,y2β} such that the graph of Rsqβ(h) has its minimum at the point (7,16).
The dataset is {3,11}.
Weβve already learned that Rsqβ(h)
is minimized at the mean of the data, and the minimum value of Rsβq(h) is the variance of the data. So we
need to provide a dataset of two points with a mean of 7 and a variance of 16. Recall that the variance is the average
squared distance of each data point to the mean. Since we want a
variance of 16, we can make each point
4 units away from the mean. Therefore,
our data set can be y1β=3,y2β=11.
In fact, this is the only solution.
A more calculative approach uses the formulas for mean and variance
and solves a system of two equations:
Consider a dataset D with 5 data points {7,5,1,2,a}, where a is a positive real
number. Note that a is not necessarily
an integer.
Problem 7.1
Express the mean of D as a function
of a, simplify the expression as much
as possible.
Mean(D)=5aβ+3
Problem 7.2
Depending on the range of a, the
median of D could assume one of three
possible values. Write out all possible median of D along with the corresponding range of a for each case. Express the ranges using
double inequalities, e.g., i.e. 3<aβ€8:
Determine the range of a that
satisfies: Mean(D)<Median(D) Make sure to show your work.
415β<a<10
Since there are 3 possible median
values, we will have to discuss each situation separately.
In case 1, when 0<aβ€2, Median(D)=2. So, we have:
Mean(D)3+5aβaβ<Median(D)<2<β5β
But a<β5 is in conflict with the
condition 0<aβ€2, therefore there
is no solution in this situation, and Median(D)=2 is impossible.
In case 2, when 2<a<5, Median(D)=a. So, we have:
Mean(D)3+5aβ3aβ<Median(D)<a<54βa>415ββ
So a has to be larger than 415β. But remember from the
prerequisite condition that 2<a<5.
To satisfy both conditions, we must have 415β<a<5.
In case 3, when aβ₯5, Median(D)=5. So, we have:
Mean(D)3+5aβaβ<Median(D)<5<10β
combining with the prerequisite condition, we have 5β€a<10
Combining the range of all three cases, we have 415β<a<10 as our final
answer.
Problem 8
Consider a dataset of nintegers, y1β,y2β,...,ynβ, whose histogram is given below:
Problem 8.1
Which of the following is closest to the constant prediction hβ that minimizes:
n1βi=1βnβ{01βyiβ=hyiβξ =hβ
1
5
6
7
11
15
30
30.
The minimizer of empirical risk for the constant model when using
zero-one loss is the mode.
Problem 8.2
Which of the following is closest to the constant prediction hβ that minimizes: n1βi=1βnββ£yiββhβ£
1
5
6
7
11
15
30
7.
The minimizer of empirical risk for the constant model when using
absolute loss is the median. If the bar at 30 wasnβt there, the median
would be 6, but the existence of that bar drags the βhalfwayβ point up
slightly, to 7.
Problem 8.3
Which of the following is closest to the constant prediction hβ that minimizes: n1βi=1βnβ(yiββh)2
1
5
6
7
11
15
30
11.
The minimizer of empirical risk for the constant model when using
squared loss is the mean. The mean is heavily influenced by the presence
of outliers, of which there are many at 30, dragging the mean up to 11.
While you canβt calculate the mean here, given the large right tail,
this question can be answered by understanding that the mean must be
larger than the median, which is 7, and 11 is the next biggest
option.
Problem 8.4
Which of the following is closest to the constant prediction hβ that minimizes: pββlimβn1βi=1βnββ£yiββhβ£p
1
5
6
7
11
15
30
15.
The minimizer of empirical risk for the constant model when using
infinity loss is the midrange, i.e. halfway between the min and max.
Problem 9
Suppose there is a dataset containing 10000 integers:
2500 of them are 3s
2500 of them are 5s
4500 of them are 7s
500 of them are 9s.
Problem 9.1
Calculate the median of this dataset.
6
We know there is an even number of integers in this dataset because
10000%2=0. We can find the middle
of the dataset as follows: 210000β=5000. This means the element in the 5000th position and 5001st
position can give us our median. The element at the 5000th position is a
5 because 2500+2500=5000. The element at the 5001st
position is a 7 because the next number
after 5 is 7. We can then plug 5 and 7 into
the equation: 2x5000β+x5001ββ=25+7β=6
Problem 9.2
How does the mean of this dataset compared to its median?
The mean is larger than the median
The mean is smaller than the median
The mean and the median are equal
The mean is smaller than the median.
We can calculate the mean as follows: 100002500β 3+2500β 5+4500β 7+500β 9β=5.6 Using part (a) we know that 5.6<6, which means the mean is smaller
than the median.
Problem 10
Define the extreme mean (EM) of a
dataset to be the average of its largest and smallest values. Let f(x)=β3x+4. Show that for any dataset x1ββ€x2ββ€β―β€xnβ, EM(f(x1β),f(x2β),β¦,f(xnβ))=f(EM(x1β,x2β,β¦,xnβ)).
This linear transformation reverses the order of the data because if
a<b, then β3a>β3b and so adding four to both sides
gives f(a)>f(b). Since x1ββ€x2ββ€β―β€xnβ, this means
that the smallest of f(x1β),f(x2β),β¦,f(xnβ) is f(xnβ) and the largest
is f(x1β). Therefore,
Consider a dataset of n values,
y1β,y2β,...,ynβ, all of which are
non-negative. Weβre interested in fitting a constant model, H(x)=h, to the data, using the new
βWolverineβ loss function:
Lwolverineβ(yiβ,h)=wiβ(yi2ββh2)2
Here, wiβ corresponds to the
βweightβ assigned to the data point yiβ, the idea being that different data
points can be weighted differently when finding the optimal constant
prediction, hβ.
For example, for the dataset y1β=1,y2β=5,y3β=2, we will end up with different values of hβ when we use the weights w1β=w2β=w3β=1 and when we use weights
w1β=8,w2β=4,w3β=3.
Problem 11.1
Find βhβLwolverineββ, the derivative of the Wolverine
loss function with respect to h. Show
your work.
βhβLβ=β4wiβh(yi2ββh2)
To solve this problem we simply take the derivative of Lwolverineβ(yiβ,h)=wiβ(yi2ββh2)2.
We can use the chain rule to find the derivative. The chain rule is:
βhββ[f(g(h))]=fβ²(g(h))gβ²(h).
Note that (yi2ββh2)2 is the area
we care about inside of Lwolverineβ(yiβ,h)=wiβ(yi2ββh2)2 because that is where h is!.
In this case f(h)=h2 and g(h)=yi2ββh2. We can then take the
derivative of both to get: fβ²(h)=2h and gβ²(x)=β2h.
This tells us the derivative is: βhβLβ=(wiβ)β2(yi2ββh2)β(β2h), which can be simplified to βhβLβ=β4wiβh(yi2ββh2).
Problem 11.2
Prove that the constant prediction that minimizes average loss for
the Wolverine loss function is:
hβ=βi=1nβwiββi=1nβwiβyi2βββ
The recipe for average loss is to find the derivative of the risk
function, set it equal to zero, and solve for hβ.
We know that average loss follows the equation R(L(yiβ,h))=n1ββi=1nβL(yiβ,h). This means that Rwolverineβ(h)=n1ββi=1nβwiβ(yi2ββh2)2.
Recall we have already found the derivative of Lwolverineβ(yiβ,h)=wiβ(yi2ββh2)2. Which means that βhβRβ(h)=n1ββi=1nββhβLβ(h). So we can set βhββ(h)Rwolverineβ(h)=n1ββi=1nββ4hwiβ(yi2ββh2).
We can now do the last two steps: 00000i=1βnβwiβh2h2i=1βnβwiβh2hββ=n1βi=1βnββ4hwiβ(yi2ββh2)=nβ4hβi=1βnβwiβh(yi2ββh2)=i=1βnβwiβ(yi2ββh2)=i=1βnβwiβyi2ββwiβh2=i=1βnβwiβyi2ββi=1βnβwiβh2=i=1βnβwiβyi2β=i=1βnβwiβyi2β=βi=1nβwiββi=1nβwiβyi2ββ=βi=1nβwiββi=1nβwiβyi2ββββ
Problem 11.3
For a dataset of non-negative values y1β,y2β,...,ynβ with weights w1β,1,...,1, evaluate: w1βββlimβhβ
The maximum of y1β,y2β,...,ynβ
The mean of y1β,y2β,...,ynβ1β
The mean of y2β,y3β,...,ynβ
The mean of y2β,y3β,...,ynβ,
multiplied by nβ1nβ
y1β
ynβ
y1β
Recall from part b hβ=βi=1nβwiββi=1nβwiβyi2βββ.
The problem is asking us limw1βββββi=1nβwiββi=1nβwiβyi2βββ.
We can further rewrite the problem to get something like this: limw1ββββw1β+(nβ1)w1βy12β+βi=1nβ1βyi2βββ. Note that nβ1βi=1nβ1βyi2ββ is
insignificant because it is a constant. Constants compared to infinity
can be ignored. We now have something like w1βw1βy12βββ. We can cancel
out the w1β to get y12ββ, which becomes y1β.
Problem 12
Suppose weβre given a dataset of n
points, (x1β,y1β),(x2β,y2β),...,(xnβ,ynβ), where xΛ is the mean
of x1β,x2β,...,xnβ and yΛβ is the mean of y1β,y2β,...,ynβ.
Using this dataset, we create a transformed dataset of n points, (x1β²β,y1β²β),(x2β²β,y2β²β),...,(xnβ²β,ynβ²β), where:
xiβ²β=4xiββ3yiβ²β=yiβ+24
That is, the transformed dataset is of the form (4x1ββ3,y1β+24),...,(4xnββ3,ynβ+24).
We decide to fit a simple linear hypothesis function H(xβ²)=w0β+w1βxβ² on the
transformed dataset using squared loss. We find that w0ββ=7 and w1ββ=2, so Hβ(xβ²)=7+2xβ².
Problem 12.1
Suppose we were to fit a simple linear hypothesis function through
the original dataset, (x1β,y1β),(x2β,y2β),...,(xnβ,ynβ), again using squared loss. What would the optimal
slope be?
2
4
6
8
11
12
24
8.
Relative to the dataset with xβ²,
the dataset with x has an x-variable thatβs βcompressedβ by a factor of
4, so the slope increases by a factor of 4 to 2β 4=8.
Concretely, this can be shown by looking at the formula 2=rSD(xβ²)SD(yβ²)β,
recognizing that SD(yβ²)=SD(y)
since the y values have the same spread
in both datasets, and that SD(xβ²)=4SD(x).
Problem 12.2
Recall, the hypothesis function Hβ
was fit on the transformed dataset,
(x1β²β,y1β²β),(x2β²β,y2β²β),...,(xnβ²β,ynβ²β). Hβ
happens to pass through the point (xΛ,yΛβ). What is the value of xΛ? Give your answer as an integer with
no variables.
5.
The key idea is that the regression line always passes through (mean x,mean y) in the
dataset we used to fit it. So, we know that: 2xβ²Λ+7=yβ²Λβ. This first equation can be
rewritten as: 2β (4xΛβ3)+7=yΛβ+24.
Weβre also told this line passes through (xΛ,yΛβ), which means that itβs
also true that: 2xΛ+7=yΛβ.
Now we have a system of two equations:
{2β (4xΛβ3)+7=yΛβ+242xΛ+7=yΛββ
β¦ and solving our system of two
equations gives: xΛ=5.
Problem 13
For a given dataset {y1β,y2β,β¦,ynβ}, let Mabsβ(h) represent
the median absolute error of the constant prediction
h on that dataset (as opposed to the
mean absolute error Rabsβ(h)).
Problem 13.1
For the dataset {4,9,10,14,15}, what is Mabsβ(9)?
5
The first step is to calculate the absolute errors (β£yiββhβ£).
Now we have to order the values inside of the absolute errors: {0,1,5,5,6}. We can see the median is
5, so Mabsβ(9)=5.
Problem 13.2
For the same dataset {4,9,10,14,15}, find another integer h
such that Mabsβ(9)=Mabsβ(h).
5 or 15
Our goal is to find another number that will give us the same median
of absolute errors as in part (a).
One way to do this is to plug in a number and guess. Another way
requires noticing you can modify 10
(the middle element) to become 5 in
either direction (negative or positive) because of the absolute
value.
We can solve this equation to get β£10βxβ£=5βx=15 and x=5.
We can then test this by following the same steps as we did in part
(a).
We do not have to re-order the elements because they are in order
already. We can see the median is 5, so
Mabsβ(5)=5.
Problem 13.3
Based on your answers to parts (a) and (b), discuss in at
most two sentences what is problematic about using the median
absolute error to make predictions.
The numbers 5 and 15 are clearly bad predictions (close to the
extreme values in the dataset), yet they are considered just as good a
prediction by this metric as the number 9, which is roughly in the center of the
dataset. Intuitively, 9 is a much
better prediction, but this way of measuring the quality of a prediction
does not recognize that.
Problem 14
Suppose we are given a dataset of points {(x1β,y1β),(x2β,y2β),β¦,(xnβ,ynβ)}
and for some reason, we want to make predictions using a prediction rule
of the form H(x)=17+w1βx.
Problem 14.1
Write down an expression for the mean squared error of a prediction
rule of this form, as a function of the parameter w1β.
Minimize the function MSE(w1β) to
find the parameter w1ββ which defines
the optimal prediction rule Hβ(x)=17+w1ββx. Show all your work and explain your steps.
To minimize a function of one variable, we need to take the
derivative, set it equal to zero, and solve. MSE(w1β)MSEβ²(w1β)00w1βi=1βnβxi2βw1ββ=n1βi=1βnβ(yiββ17βw1βxiβ)2=n1βi=1βnββ2xiβ(yiββ17βw1βxiβ))using the chain rule=n1βi=1βnββ2xiβ(yiββ17)+n1βi=1βnβ2xi2βw1βsplitting up the sum=i=1βnββxiβ(yiββ17)+i=1βnβxi2βw1βmultiplying through by 2nβ=i=1βnβxiβ(yiββ17)rearranging terms and pulling out w1β=i=1βnβxi2βi=1βnβxiβ(yiββ17)ββ
Problem 14.3
True or False: For an arbitrary dataset, the prediction rule Hβ(x)=17+w1ββx goes through the point
(xΛ,yΛβ).
True
False
False.
When we fit a prediction rule of the form H(x)=w0β+w1βx using simple linear
regression, the formula for the intercept w0β is designed to make sure the regression
line passes through the point (xΛ,yΛβ). Here, we donβt have the freedom to control our intercept, as
itβs forced to be 17. This means we
canβt guarantee that the prediction rule Hβ(x)=17+w1ββx goes through the point
(xΛ,yΛβ).
A simple example shows that this is the case. Consider the dataset
(β2,0) and (2,0). The point (xΛ,yΛβ) is the origin, but the
prediction rule Hβ(x) does not pass
through the origin because it has an intercept of 17.
Problem 14.4
True or False: For an arbitrary dataset, the mean squared error
associated with Hβ(x) is greater than
or equal to the mean squared error associated with the regression
line.
True
False
True.
The regression line is the prediction rule of the form H(x)=w0β+w1βx with the smallest mean
squared error (MSE). Hβ(x) is one
example of a prediction rule of that form so unless it happens to be the
regression line itself, the regression line will have lower MSE because
it was designed to have the lowest possible MSE. This means the MSE
associated with Hβ(x) is greater than
or equal to the MSE associated with the regression line.
Problem 15
Suppose you have a dataset {(x1β,y1β),(x2β,y2β),β¦,(x8β,y8β)} with n=8 ordered pairs such that the variance of
{x1β,x2β,β¦,x8β} is 50. Let m be
the slope of the regression line fit to this data.
Suppose now we fit a regression line to the dataset {(x1β,y2β),(x2β,y1β),β¦,(x8β,y8β)}
where the first two y-values have been
swapped. Let mβ² be the slope of
this new regression line.
If x1β=3, y1β=7, x2β=8, and y2β=2, what is the difference between the new
slope and the old slope? That is, what is mβ²βm? The answer you get should be a
number with no variables.
Hint: There are many equivalent formulas for the
slope of the regression line. We recommend using the version of the
formula without yβ.
mβ²βm=161β
Using the formula for the slope of the regression line, we have:
Note that by switching the first two y-values, the terms in the sum from i=3 to n,
the number of data points n, and the
variance of the x-values are all
unchanged.
Note that we have two simplified closed form expressions for the
estimated slope w in simple linear
regression that you have already seen in discussions and lectures:
where we have dataset D=[(x1β,y1β),β¦,(xnβ,ynβ)] and sample means β x=n1ββiβxiβ,yβ=n1ββiβyiβ. Without further
explanation, βiβ means βi=1nβ
Problem 16.1
Are (1) and (2) equivalent? That is, is the following
equality true? Prove or disprove it. iββ(xiββx)yiβ=iββ(yiββyβ)xiβ
True or False: If the dataset shifted right by a constant distance
a, that is, we have the new dataset
Daβ=(x1β+a,y1β),β¦,(xnβ+a,ynβ), then will the estimated slope w change or not?
True
False
False. By (1) in part (a), we can
view w as only being affected by xiββx, which is unchanged after
shifting horizontally. Therefore, w is
unchanged.
Problem 16.3
True or False: If the dataset shifted up by a constant distance b, that is, we have the new dataset Dbβ=[(x1β,y1β+b),β¦,(xnβ,ynβ+b)],
then will the estimated slope w change
or not?
True
False
False. By (2) in part (a), we can
view w as only being affected by yiββyβ, which is unchanged after
shifting vertically. Therefore, w is
unchanged.
Problem 17
Consider a dataset that consists of y1β,β―,ynβ. In class, we used calculus to minimize mean squared
error, Rsqβ(h)=n1ββi=1nβ(hβyiβ)2. In this problem, we want you to apply the same
approach to a slightly different loss function defined below: Lmidtermβ(y,h)=(Ξ±yβh)2+Ξ»h
Problem 17.1
Write down the empiricial risk Rmidtermβ(h) by using the above loss
function.
The mean of dataset is yΛβ, i.e.
yΛβ=n1ββi=1nβyiβ. Find hβ that minimizes
Rmidtermβ(h) using calculus.
Your result should be in terms of yΛβ, Ξ± and Ξ».