Loss Functions and Simple Linear Regression

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

This video πŸŽ₯, recorded in office hours, gives an overview of Loss Functions, the Constant Model, Mean, and Variance, while this video πŸŽ₯ overviews Simple Linear Regression.


Problem 1

Biff the Wolverine just made an Instagram account and has been keeping track of the number of likes his posts have received so far.

His first 7 posts have received a mean of 16 likes; the specific like counts in sorted order are

8,12,12,15,18,20,278, 12, 12, 15, 18, 20, 27

Biff the Wolverine wants to predict the number of likes his next post will receive, using a constant prediction rule hh. For each loss function L(yi,h)L(y_i, h), determine the constant prediction hβˆ—h^* that minimizes average loss. If you believe there are multiple minimizers, specify them all. If you believe you need more information to answer the question or that there is no minimizer, state that clearly. Give a brief justification for each answer.


Problem 1.1

L(yi,h)=∣yiβˆ’h∣L(y_i, h) = |y_i - h|

This is absolute loss, and hence we’re looking for the minimizer of mean absolute error, which is the median, 15.


Problem 1.2

L(yi,h)=(yiβˆ’h)2L(y_i, h) = (y_i - h)^2

This is squared loss, and hence we’re looking for the minimizer of mean squared error, which is the mean, 16.


Problem 1.3

L(yi,h)=4(yiβˆ’h)2L(y_i, h) = 4(y_i - h)^2

This is squared loss, multiplied by a constant. Note that when we go to minimize empirical risk for this loss function, we will take the derivative of empirical risk and set it equal to 0; at that point the constant factor of 4 can be divided from both sides, so this problem boils down to minimizing ordinary mean squared error. The only difference is that the graph of mean squared error will be stretched vertically by a factor of 4; the minimizing value will be in the same place.

For more justification, here we consider any general re-scaling Ξ±(yiβˆ’h)2\alpha (y_i-h)^2:

Rsq(h)=1nβˆ‘i=1nΞ±(yiβˆ’h)2=Ξ±β‹…1nβˆ‘i=1n(yiβˆ’h)2ddhRsq(h)=Ξ±β‹…1nβˆ‘i=1n2(yiβˆ’h)(βˆ’1)=0β€…β€ŠβŸΉβ€…β€Šβˆ’2Ξ±nβˆ‘i=1n(yiβˆ’h)=0β€…β€ŠβŸΉβ€…β€Šβˆ‘i=1n(yiβˆ’h)=0β€…β€ŠβŸΉβ€…β€Šhβˆ—=1nβˆ‘i=1nyi \begin{aligned} R_{sq}(h) &= \frac{1}{n} \sum_{i = 1}^n \alpha (y_i - h)^2 \\ &= \alpha \cdot \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2 \\ \frac{d}{dh} R_{sq}(h) &= \alpha \cdot \frac{1}{n} \sum_{i = 1}^n 2(y_i - h)(-1) = 0\\ &\implies -\frac{2\alpha}{n}\sum_{i = 1}^n (y_i - h) = 0 \\ &\implies \sum_{i = 1}^n (y_i - h) = 0 \\ &\implies h^* = \frac{1}{n} \sum_{i = 1}^n y_i \end{aligned}


Problem 1.4

L(yi,h)={0h=yi100h≠yiL(y_i, h) = \begin{cases} 0 & h = y_i \\ 100 & h \neq y_i \end{cases}

This is a scaled version of 0-1 loss. We know that empirical risk for 0-1 loss is minimized at the mode, so that also applies here. The mode, i.e. the most common value, is 12.


Problem 1.5

L(yi,h)=(3yiβˆ’4h)2L(y_i, h) = (3y_i - 4h)^2

Note that we can write (3yβˆ’4h)2(3y - 4h)^2 as (3(yβˆ’43h))2=9(yβˆ’43h)2\left( 3 \left( y - \frac{4}{3}h \right) \right)^2 = 9 \left( y - \frac{4}{3}h \right)^2. As we’ve seen, the constant factor out front has no impact on the minimizing value. Using the same principle as in the last part, we can say that 43hβˆ—=xΛ‰β€…β€ŠβŸΉβ€…β€Šhβˆ—=34xΛ‰=34β‹…16=12\frac{4}{3} h^* = \bar{x} \implies h^* = \frac{3}{4} \bar{x} = \frac{3}{4} \cdot 16 = 12


Problem 1.6

L(yi,h)=(yiβˆ’h)3L(y_i, h) = (y_i - h)^3

Hint: Do not spend too long on this subpart.

No minimizer.

Note that unlike ∣yiβˆ’h∣|y_i - h|, (yiβˆ’h)2(y_i - h)^2, and all of the other loss functions we’ve seen, (yiβˆ’h)3(y_i - h)^3 tends towards βˆ’βˆž-\infty, rather than having a minimum output of 0. This means that there is no hh that minimizes 1nβˆ‘i=1n(yiβˆ’h)3\frac{1}{n} \sum_{i = 1}^n (y_i - h)^3; the larger we make hh, the more negative (and hence β€œsmaller") this empirical risk becomes.



Problem 2

You may find the following properties of logarithms helpful in this question. Assume that all logarithms in this question are natural logarithms, i.e. of base ee.

Billy is trying his hand at coming up with loss functions. He comes up with the Billy loss, LB(yi,h)L_B(y_i, h), defined as follows:

LB(yi,h)=[log⁑(yih)]2L_B(y_i, h) = \left[ \log \left( \frac{y_i}{h} \right) \right]^2

Throughout this problem, assume that all yiy_is are positive.


Problem 2.1

Show that: ddhLB(yi,h)=βˆ’2hlog⁑(yih)\frac{d}{dh} L_B(y_i, h) = - \frac{2}{h} \log \left( \frac{y_i}{h} \right)

ddhLB(yi,h)=ddh[log⁑(yih)]2=2β‹…log⁑(yih)β‹…ddhlog⁑(yih)=2β‹…log⁑(yih)β‹…ddh(log⁑(y)βˆ’log⁑(h))=2β‹…log⁑(yih)β‹…(βˆ’1h)=βˆ’2hlog⁑(yih) \begin{align*} \frac{d}{dh} L_B(y_i, h) &= \frac{d}{dh} \left[ \log \left( \frac{y_i}{h} \right) \right]^2 \\ &= 2 \cdot \log \left( \frac{y_i}{h} \right) \cdot \frac{d}{dh} \log \left( \frac{y_i}{h} \right) \\ &= 2 \cdot \log \left( \frac{y_i}{h} \right) \cdot \frac{d}{dh} \left( \log(y) - \log(h) \right) \\ &= 2 \cdot \log \left( \frac{y_i}{h} \right) \cdot \left( - \frac{1}{h} \right) \\ &= -\frac{2}{h} \log \left( \frac{y_i}{h} \right) \end{align*}


Problem 2.2

Show that the constant prediction hβˆ—h^* that minimizes average Billy loss for the constant model is:

hβˆ—=(y1β‹…y2β‹…...β‹…yn)1nh^* = \left(y_1 \cdot y_2 \cdot ... \cdot y_n \right)^{\frac{1}{n}}

You do not need to perform a second derivative test, but otherwise you must show your work.

Hint: To confirm that you’re interpreting the result correctly, hβˆ—h^* for the dataset 3, 5, 16 is (3β‹…5β‹…16)13=24013β‰ˆ6.214(3 \cdot 5 \cdot 16)^{\frac{1}{3}} = 240^{\frac{1}{3}} \approx 6.214.

RB(h)=1nβˆ‘i=1n[log⁑(yih)]2ddhRB(h)=1nβˆ‘i=1nddh[log⁑(yih)]2=1nβˆ‘i=1nβˆ’2hlog⁑(yih)=βˆ’2nhβˆ‘i=1nlog⁑(yih)=00=βˆ‘i=1nlog⁑(yih)=βˆ‘i=1n(log⁑(yi)βˆ’log⁑(h))0=βˆ‘i=1nlog⁑(yi)βˆ’log⁑(h)βˆ‘i=1n10=(log⁑(y1)+log⁑(y2)+...+log⁑(yn))βˆ’nlog⁑(h)log⁑(hn)=log⁑(y1β‹…y2β‹…...β‹…yn)hn=y1β‹…y2β‹…...β‹…ynhβˆ—=(y1β‹…y2β‹…...β‹…yn)1n \begin{align*} R_B(h) &= \frac{1}{n} \sum_{i = 1}^n \left[ \log \left( \frac{y_i}{h} \right) \right]^2 \\ \frac{d}{dh} R_B(h) &= \frac{1}{n} \sum_{i = 1}^n \frac{d}{dh} \left[ \log \left( \frac{y_i}{h} \right) \right]^2 \\ &= \frac{1}{n} \sum_{i = 1}^n -\frac{2}{h} \log \left( \frac{y_i}{h} \right) \\ &= -\frac{2}{nh} \sum_{i = 1}^n \log \left( \frac{y_i}{h} \right) = 0 \\ 0 &= \sum_{i = 1}^n \log \left( \frac{y_i}{h} \right) = \sum_{i = 1}^n \left( \log(y_i) - \log(h)\right) \\ 0 &= \sum_{i = 1}^n \log(y_i) - \log(h) \sum_{i = 1}^n 1 \\ 0 &= \left( \log(y_1) + \log(y_2) + ... + \log(y_n) \right) - n \log(h) \\ \log(h^n) &= \log(y_1 \cdot y_2 \cdot ... \cdot y_n) \\ h^n &= y_1 \cdot y_2 \cdot ... \cdot y_n \\ h^* &= (y_1 \cdot y_2 \cdot ... \cdot y_n)^{\frac{1}{n}} \end{align*}



Problem 3

Billy decides to take on a part-time job as a waiter at the Panda Express in Pierpont. For two months, he kept track of all of the total bills he gave out to customers along with the tips they then gave him, all in dollars. Below is a scatter plot of Billy’s tips and total bills.

Throughout this question, assume we are trying to fit a linear prediction rule H(x)=w0+w1xH(x) = w_0 + w_1x that uses total bills to predict tips, and assume we are finding optimal parameters by minimizing mean squared error.


Problem 3.1

Which of these is the most likely value for rr, the correlation between total bill and tips? Why?

βˆ’1βˆ’0.75βˆ’0.2500.250.751-1 \qquad -0.75 \qquad -0.25 \qquad 0 \qquad 0.25 \qquad 0.75 \qquad 1

0.75.

It seems like there is a pretty strong, but not perfect, linear association between total bills and tips.


Problem 3.2

The variance of the tip amounts is 2.1. Let MM be the mean squared error of the best linear prediction rule on this dataset (under squared loss). Is MM less than, equal to, or greater than 2.1? How can you tell?

MM is less than 2.1. The variance is equal to the MSE of the constant prediction rule.

Note that the MSE of the best linear prediction rule will always be less than or equal to the MSE of the best constant prediction rule hh. The only case in which these two MSEs are the same is when the best linear prediction rule is a flat line with slope 0, which is the same as a constant prediction. In all other cases, the linear prediction rule will make better predictions and hence have a lower MSE than the constant prediction rule.

In this case, the best linear prediction rule is clearly not flat, so M<2.1M < 2.1.


Problem 3.3

Suppose we use the formulas from class on Billy’s dataset and calculate the optimal slope w1βˆ—w_1^* and intercept w0βˆ—w_0^* for this prediction rule.

Suppose we add the value of 1 to every total bill xx, effectively shifting the scatter plot 1 unit to the right. Note that doing this does not change the value of w1βˆ—w_1^*. What amount should we add to each tip yy so that the value of w0βˆ—w_0^* also does not change? Your answer should involve one or more of xΛ‰,yΛ‰,w0βˆ—,w1βˆ—,\bar{x}, \bar{y}, w_0^*, w_1^*, and any constants.

Note: To receive full points, you must provide a rigorous explanation, though this explanation only takes a few lines. However, we will award partial credit to solutions with the correct answer, and it’s possible to arrive at the correct answer by drawing a picture and thinking intuitively about what happens.

We should add w1βˆ—w_1^* to each tip yy.

First, we present the rigorous solution.

Let xΛ‰old\bar{x}_\text{old} represent the previous mean of the xx’s and xΛ‰new\bar{x}_\text{new} represent the new mean of the xx’s. Then, we know that xΛ‰new=xΛ‰old+1\bar{x}_\text{new} = \bar{x}_\text{old} + 1.

Also, let yΛ‰old\bar{y}_\text{old} and yΛ‰new\bar{y}_\text{new} represent the old and new mean of the yy’s. We will try and find a relationship between these two quantities.

We want the two intercepts to be the same. The intercept for the old line is yΛ‰oldβˆ’w1βˆ—xΛ‰old\bar{y}_\text{old} - w_1^* \bar{x}_\text{old} and the intercept for the new line is yΛ‰newβˆ’w1βˆ—xΛ‰new\bar{y}_\text{new} - w_1^* \bar{x}_\text{new}. Setting these equal yields

yΛ‰newβˆ’w1βˆ—xΛ‰new=yΛ‰oldβˆ’w1βˆ—xΛ‰oldyΛ‰newβˆ’w1βˆ—(xΛ‰old+1)=yΛ‰oldβˆ’w1βˆ—xΛ‰oldyΛ‰new=yΛ‰oldβˆ’w1βˆ—xΛ‰old+w1βˆ—(xΛ‰old+1)yΛ‰new=yΛ‰old+w1βˆ— \begin{aligned} \bar{y}_\text{new} - w_1^* \bar{x}_\text{new} &= \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} \\ \bar{y}_\text{new} - w_1^* (\bar{x}_\text{old} + 1) &= \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} \\ \bar{y}_\text{new} &= \bar{y}_\text{old} - w_1^* \bar{x}_\text{old} + w_1^* (\bar{x}_\text{old} + 1) \\ \bar{y}_{\text{new}} &= \bar{y}_\text{old} + w_1^* \end{aligned}

Thus, in order for the intercepts to be equal, we need the mean of the new yy’s to be w1βˆ—w_1^* greater than the mean of the old yy’s. Since we’re told we’re adding the same constant to each yy that constant is w1βˆ—w_1^*.

Another way to approach the question is as follows: consider any point that lies directly on a line with slope w1βˆ—w_1^* and intercept w0βˆ—w_0^*. Consider how the slope between two points on a line is calculated: slope=y2βˆ’y1x2βˆ’x1\text{slope} = \frac{y_2 - y_1}{x_2 - x_1}. If x2βˆ’x1=1x_2 - x_1 = 1, in order for the slope to remain fixed we must have that y2βˆ’y1=slopey_2 - y_1 = \text{slope}. For a concrete example, think of the line y=5x+2y = 5x + 2. The point (1,7)(1, 7) is on the line, as is the point (1+1,7+5)=(2,12)(1 + 1, 7 + 5) = (2, 12).

In our case, none of our points are guaranteed to be on the line defined by slope w1βˆ—w_1^* and intercept w0βˆ—w_0^*. Instead, we just want to be guaranteed that the points have the same regression line after being shifted. If we follow the same principle, though, and add 1 to every xx and w1βˆ—w_1^* to every yy, the points’ relative positions to the line will not change (i.e. the vertical distance from each point to the line will not change), and so that will remain the line with the lowest MSE, and hence w0βˆ—w_0^* and w1βˆ—w_1^* won’t change.



Problem 4

Suppose we have a dataset of nn houses that were recently sold in the Ann Arbor area. For each house, we have its square footage and most recent sale price. The correlation between square footage and price is rr.


Problem 4.1

First, we minimize mean squared error to fit a linear prediction rule that uses square footage to predict price. The resulting prediction rule has an intercept of w0βˆ—w_0^* and slope of w1βˆ—w_1^*. In other words,

predicted price=w0βˆ—+w1βˆ—β‹…square footage\text{predicted price} = w_0^* + w_1^* \cdot \text{square footage}

We’re now interested in minimizing mean squared error to fit a linear prediction rule that uses price to predict square footage. Suppose this new regression line has an intercept of Ξ²0βˆ—\beta_0^* and slope of Ξ²1βˆ—\beta_1^*.

What is Ξ²1βˆ—\beta_1^*? Give your answer in terms of one or more of nn, rr, w0βˆ—w_0^*, and w1βˆ—w_1^*. Show your work.

Ξ²1βˆ—=r2w1βˆ—\beta_1^* = \frac{r^2}{w_1^*}

Throughout this solution, let xx represent square footage and yy represent price.

We know that w1βˆ—=rΟƒyΟƒxw_1^* = r \frac{\sigma_y}{\sigma_x}. But what about Ξ²1βˆ—\beta_1^*?

When we take a rule that predicts price from square footage and transform it into a rule that predicts square footage from price, the roles of xx and yy have swapped; suddenly, square footage is no longer our independent variable, but our dependent variable, and vice versa for price. This means that the altered dataset we work with when using our new prediction rule has Οƒx\sigma_x standard deviation for its dependent variable (square footage), and Οƒy\sigma_y for its independent variable (price). So, we can write the formula for Ξ²1βˆ—\beta_1^* as follows: Ξ²1βˆ—=rΟƒxΟƒy\beta_1^* = r \frac{\sigma_x}{\sigma_y}

In essence, swapping the independent and dependent variables of a dataset changes the slope of the regression line from rσyσxr \frac{\sigma_y}{\sigma_x} to rσxσyr \frac{\sigma_x}{\sigma_y}.

From here, we can use a little algebra to get our Ξ²1βˆ—\beta_1^* in terms of one or more nn, rr, w0βˆ—w_0^*, and w1βˆ—w_1^*:

Ξ²1βˆ—=rΟƒxΟƒyw1βˆ—β‹…Ξ²1βˆ—=w1βˆ—β‹…rΟƒxΟƒyw1βˆ—β‹…Ξ²1βˆ—=(rΟƒyΟƒx)β‹…rΟƒxΟƒy \begin{align*} \beta_1^* &= r \frac{\sigma_x}{\sigma_y} \\ w_1^* \cdot \beta_1^* &= w_1^* \cdot r \frac{\sigma_x}{\sigma_y} \\ w_1^* \cdot \beta_1^* &= ( r \frac{\sigma_y}{\sigma_x}) \cdot r \frac{\sigma_x}{\sigma_y} \end{align*}

The fractions σyσx\frac{\sigma_y}{\sigma_x} and σxσy\frac{\sigma_x}{\sigma_y} cancel out and we get:

w1βˆ—β‹…Ξ²1βˆ—=r2Ξ²1βˆ—=r2w1βˆ— \begin{align*} w_1^* \cdot \beta_1^* &= r^2 \\ \beta_1^* &= \frac{r^2}{w_1^*} \end{align*}


Problem 4.2

For this part only, assume that the following quantities hold:

Given this information, what is Ξ²0βˆ—\beta_0^*? Give your answer as a constant, rounded to two decimal places. Show your work.

Ξ²0βˆ—=1278.56\beta_0^* = 1278.56

We start with the formula for the intercept of the regression line. Note that xx and yy are opposite what they’d normally be since we’re using price to predict square footage.

Ξ²0βˆ—=xΛ‰βˆ’Ξ²1βˆ—yΛ‰\beta_0^* = \bar{x} - \beta_1^* \bar{y}

We’re told that the average square footage of homes in the dataset is 2000, so xΛ‰=2000\bar{x} = 2000. We also know from part (a) that Ξ²1βˆ—=r2w1βˆ—\beta_1^* = \frac{r^2}{w_1^*}, and from the information given in this part this is Ξ²1βˆ—=r2w1βˆ—=0.62250\beta_1^* = \frac{r^2}{w_1^*} = \frac{0.6^2}{250}.

Finally, we need the average price of all homes in the dataset, yΛ‰\bar{y}. We aren’t given this information directly, but we can use the fact that (xΛ‰,yΛ‰)(\bar{x}, \bar{y}) are on the regression line that uses square footage to predict price to find yΛ‰\bar{y}. Specifically, we have that yΛ‰=w0βˆ—+w1βˆ—xΛ‰\bar{y} = w_0^* + w_1^* \bar{x}; we know that w0βˆ—=1000w_0^* = 1000, xΛ‰=2000\bar{x} = 2000, and w1βˆ—=250w_1^* = 250, so yΛ‰=1000+2000β‹…250=501000\bar{y} = 1000 + 2000 \cdot 250 = 501000.

Putting these pieces together, we have

Ξ²0βˆ—=xΛ‰βˆ’Ξ²1βˆ—yΛ‰=2000βˆ’0.62250β‹…501000=2000βˆ’0.62β‹…2004=1278.56 \begin{align*} \beta_0^* &= \bar{x} - \beta_1^* \bar{y} \\ &= 2000 - \frac{0.6^2}{250} \cdot 501000 \\ &= 2000 - 0.6^2 \cdot 2004 \\ &= 1278.56 \end{align*}



Problem 5

The mean of 12 non-negative numbers is 45. Suppose we remove 2 of these numbers. What is the largest possible value of the mean of the remaining 10 numbers? Show your work.

5454.

To maximize the mean of the remaining 10 numbers, we want to minimize the numbers that are removed. The smallest possible non-negative number is 0, so to maximize the mean of the remaining 10, we should remove two 0s from the set of numbers. Recall that the sum of the 12 number set is 12β‹…4512 \cdot 45; then, the maximum possible mean of the remaining 10 is

12β‹…45βˆ’2β‹…010=65β‹…45=54\frac{12 \cdot 45 - 2 \cdot 0}{10} = \frac{6}{5} \cdot 45 = 54


Problem 6

Let Rsq(h)R_{sq}(h) represent the mean squared error of a constant prediction hh for a given dataset. Find a dataset {y1,y2}\{y_1, y_2\} such that the graph of Rsq(h)R_{sq}(h) has its minimum at the point (7,16)(7,16).

The dataset is {3,113, 11}.

We’ve already learned that Rsq(h)R_{sq}(h) is minimized at the mean of the data, and the minimum value of Rsq(h)R_sq(h) is the variance of the data. So we need to provide a dataset of two points with a mean of 77 and a variance of 1616. Recall that the variance is the average squared distance of each data point to the mean. Since we want a variance of 1616, we can make each point 44 units away from the mean. Therefore, our data set can be y1=3,y2=11.y_1 = 3, y_2 = 11. In fact, this is the only solution.

A more calculative approach uses the formulas for mean and variance and solves a system of two equations:

y1+y22=712((y1βˆ’7)2+(y2βˆ’7)2)=16\begin{aligned} \frac{y_1+y_2}{2} &= 7 \\ \frac12 \left((y_1 - 7)^2 + (y_2 - 7)^2 \right) &= 16 \end{aligned}


Problem 7

Consider a dataset DD with 55 data points {7,5,1,2,a}\{7,5,1,2,a\}, where a is a positive real number. Note that aa is not necessarily an integer.


Problem 7.1

Express the mean of DD as a function of aa, simplify the expression as much as possible.

Mean(D)=a5+3\text{Mean($D$)} = \frac{a}{5} + 3


Problem 7.2

Depending on the range of aa, the median of DD could assume one of three possible values. Write out all possible median of DD along with the corresponding range of aa for each case. Express the ranges using double inequalities, e.g., i.e. 3<a≀83<a\leq8:

{Median(D)=2if a is in the range of 0<a≀2Median(D)=aif a is in the range of 2<a≀5Median(D)=5if a is in the range of 5<aβ‰€βˆž \begin{cases} \text{Median($D$)} = 2 & \text{if a is in the range of } 0<a\leq2 \\ \text{Median($D$)} = a & \text{if a is in the range of } 2<a\leq5 \\ \text{Median($D$)} = 5 & \text{if a is in the range of } 5<a\leq\infty \\ \end{cases}


Problem 7.3

Determine the range of aa that satisfies: Mean(D)<Median(D)\text{Mean}(D) < \text{Median}(D) Make sure to show your work.

154<a<10\dfrac{15}{4}<a<10

Since there are 33 possible median values, we will have to discuss each situation separately.

In case 11, when 0<a≀20<a\leq2, Median(D)=2\text{Median}(D) = 2. So, we have:

Mean(D)<Median(D)3+a5<2a<βˆ’5 \begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< 2\\ a&<-5 \end{align*}

But a<βˆ’5a<-5 is in conflict with the condition 0<a≀20<a\leq2, therefore there is no solution in this situation, and Median(D)=2(D) = 2 is impossible.

In case 22, when 2<a<52<a<5, Median(D)=a\text{Median}(D) = a. So, we have:

Mean(D)<Median(D)3+a5<a3<45aa>154 \begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< a\\ 3 &< \frac{4}{5} a\\ a &> \frac{15}{4}\\ \end{align*}

So aa has to be larger than 154\frac{15}{4}. But remember from the prerequisite condition that 2<a<52<a<5.

To satisfy both conditions, we must have 154<a<5\frac{15}{4}<a<5.

In case 3, when aβ‰₯5a\geq5, Median(D)=5\text{Median}(D) = 5. So, we have:

Mean(D)<Median(D)3+a5<5a<10 \begin{align*} \text{Mean}(D) &< \text{Median}(D)\\ 3 + \frac{a}{5} &< 5\\ a&<10 \end{align*}

combining with the prerequisite condition, we have 5≀a<105\leq a<10

Combining the range of all three cases, we have 154<a<10\dfrac{15}{4}<a<10 as our final answer.



Problem 8

Consider a dataset of nn integers, y1,y2,...,yny_1, y_2, ..., y_n, whose histogram is given below:


Problem 8.1

Which of the following is closest to the constant prediction hβˆ—h^* that minimizes:

1nβˆ‘i=1n{0yi=h1yiβ‰ h\displaystyle \frac{1}{n} \sum_{i = 1}^n \begin{cases} 0 & y_i = h \\ 1 & y_i \neq h \end{cases}

30.30.

The minimizer of empirical risk for the constant model when using zero-one loss is the mode.


Problem 8.2

Which of the following is closest to the constant prediction hβˆ—h^* that minimizes: 1nβˆ‘i=1n∣yiβˆ’h∣\displaystyle \frac{1}{n} \sum_{i = 1}^n |y_i - h|

7.7.

The minimizer of empirical risk for the constant model when using absolute loss is the median. If the bar at 30 wasn’t there, the median would be 6, but the existence of that bar drags the β€œhalfway” point up slightly, to 7.


Problem 8.3

Which of the following is closest to the constant prediction hβˆ—h^* that minimizes: 1nβˆ‘i=1n(yiβˆ’h)2\displaystyle \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2

11.11.

The minimizer of empirical risk for the constant model when using squared loss is the mean. The mean is heavily influenced by the presence of outliers, of which there are many at 30, dragging the mean up to 11. While you can’t calculate the mean here, given the large right tail, this question can be answered by understanding that the mean must be larger than the median, which is 7, and 11 is the next biggest option.


Problem 8.4

Which of the following is closest to the constant prediction hβˆ—h^* that minimizes: lim⁑pβ†’βˆž1nβˆ‘i=1n∣yiβˆ’h∣p\displaystyle \lim_{p \rightarrow \infty} \frac{1}{n} \sum_{i = 1}^n |y_i - h|^p

15.15.

The minimizer of empirical risk for the constant model when using infinity loss is the midrange, i.e. halfway between the min and max.



Problem 9

Suppose there is a dataset containing 10000 integers:


Problem 9.1

Calculate the median of this dataset.

66

We know there is an even number of integers in this dataset because 10000%2=010000 \% 2 = 0. We can find the middle of the dataset as follows: 100002=5000\frac{10000}{2} = 5000. This means the element in the 5000th position and 5001st position can give us our median. The element at the 5000th position is a 55 because 2500+2500=50002500 + 2500 = 5000. The element at the 5001st position is a 77 because the next number after 55 is 77. We can then plug 55 and 77 into the equation: x5000+x50012=5+72=6\frac{x_{5000} + x_{5001}}{2} = \frac{5 + 7}{2} = 6


Problem 9.2

How does the mean of this dataset compared to its median?

The mean is smaller than the median.

We can calculate the mean as follows: 2500β‹…3+2500β‹…5+4500β‹…7+500β‹…910000=5.6\frac{2500 \cdot 3 + 2500 \cdot 5 + 4500 \cdot 7 + 500 \cdot 9}{10000} = 5.6 Using part (a) we know that 5.6<65.6 < 6, which means the mean is smaller than the median.



Problem 10

Define the extreme mean (EMEM) of a dataset to be the average of its largest and smallest values. Let f(x)=βˆ’3x+4.f(x)=-3x+4. Show that for any dataset x1≀x2≀⋯≀xnx_1\leq x_2 \leq \dots \leq x_n, EM(f(x1),f(x2),…,f(xn))=f(EM(x1,x2,…,xn)).EM(f(x_1), f(x_2), \dots, f(x_n)) = f(EM(x_1, x_2, \dots, x_n)).

This linear transformation reverses the order of the data because if a<ba<b, then βˆ’3a>βˆ’3b-3a>-3b and so adding four to both sides gives f(a)>f(b)f(a)>f(b). Since x1≀x2≀⋯≀xnx_1\leq x_2 \leq \dots \leq x_n, this means that the smallest of f(x1),f(x2),…,f(xn)f(x_1), f(x_2), \dots, f(x_n) is f(xn)f(x_n) and the largest is f(x1)f(x_1). Therefore,

EM(f(x1),f(x2),…,f(xn))=f(xn)+f(x1)2=βˆ’3xn+4βˆ’3x1+42=βˆ’3xnβˆ’3x12+4=βˆ’3(x1+xn2)+4=βˆ’3EM(x1,x2,…,xn)+4=f(EM(x1,x2,…,xn)).\begin{aligned} EM(f(x_1), f(x_2), \dots, f(x_n)) &= \dfrac{f(x_n) + f(x_1)}{2} \\ &= \dfrac{-3x_n+4-3x_1+4}{2} \\ &= \dfrac{-3x_n-3x_1}{2} + 4\\ &= -3\left(\dfrac{x_1+x_n}{2}\right) + 4 \\ &= -3EM(x_1, x_2, \dots, x_n)+ 4\\ &= f(EM(x_1, x_2, \dots, x_n)). \end{aligned}


Problem 11

Consider a dataset of nn values, y1,y2,...,yny_1, y_2, ..., y_n, all of which are non-negative. We’re interested in fitting a constant model, H(x)=hH(x) = h, to the data, using the new β€œWolverine” loss function:

Lwolverine(yi,h)=wi(yi2βˆ’h2)2L_\text{wolverine}(y_i, h) = w_i \left( y_i^2 - h^2 \right)^2

Here, wiw_i corresponds to the β€œweight” assigned to the data point yiy_i, the idea being that different data points can be weighted differently when finding the optimal constant prediction, hβˆ—h^*.

For example, for the dataset y1=1,y2=5,y3=2y_1 = 1, y_2 = 5, y_3 = 2, we will end up with different values of hβˆ—h^* when we use the weights w1=w2=w3=1w_1 = w_2 = w_3 = 1 and when we use weights w1=8,w2=4,w3=3w_1 = 8, w_2 = 4, w_3 = 3.


Problem 11.1

Find βˆ‚Lwolverineβˆ‚h\frac{\partial L_\text{wolverine}}{\partial h}, the derivative of the Wolverine loss function with respect to hh. Show your work.

βˆ‚Lβˆ‚h=βˆ’4wih(yi2βˆ’h2)\frac{\partial L}{\partial h} = -4w_ih(y_i^2 -h^2)

To solve this problem we simply take the derivative of Lwolverine(yi,h)=wi(yi2βˆ’h2)2L_\text{wolverine}(y_i, h) = w_i( y_i^2 - h^2 )^2.

We can use the chain rule to find the derivative. The chain rule is: βˆ‚βˆ‚h[f(g(h))]=fβ€²(g(h))gβ€²(h)\frac{\partial}{\partial h}[f(g(h))]=f'(g(h))g'(h).

Note that (yi2βˆ’h2)2(y_i^2 -h^2)^2 is the area we care about inside of Lwolverine(yi,h)=wi(yi2βˆ’h2)2L_\text{wolverine}(y_i, h) = w_i( y_i^2 - h^2 )^2 because that is where hh is!. In this case f(h)=h2f(h) = h^2 and g(h)=yi2βˆ’h2g(h) = y_i^2 - h^2. We can then take the derivative of both to get: fβ€²(h)=2hf'(h) = 2h and gβ€²(x)=βˆ’2hg'(x) = -2h.

This tells us the derivative is: βˆ‚Lβˆ‚h=(wi)βˆ—2(yi2βˆ’h2)βˆ—(βˆ’2h)\frac{\partial L}{\partial h} = (w_i) * 2(y_i^2 -h^2) * (-2h), which can be simplified to βˆ‚Lβˆ‚h=βˆ’4wih(yi2βˆ’h2)\frac{\partial L}{\partial h} = -4w_ih(y_i^2 -h^2).


Problem 11.2

Prove that the constant prediction that minimizes average loss for the Wolverine loss function is:

hβˆ—=βˆ‘i=1nwiyi2βˆ‘i=1nwih^* = \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}

The recipe for average loss is to find the derivative of the risk function, set it equal to zero, and solve for hβˆ—h^*.

We know that average loss follows the equation R(L(yi,h))=1nβˆ‘i=1nL(yi,h)R(L(y_i, h)) = \frac{1}{n} \sum_{i=1}^n L(y_i, h). This means that Rwolverine(h)=1nβˆ‘i=1nwi(yi2βˆ’h2)2R_\text{wolverine}(h) = \frac{1}{n} \sum_{i = 1}^n w_i (y_i^2 - h^2)^2.

Recall we have already found the derivative of Lwolverine(yi,h)=wi(yi2βˆ’h2)2L_\text{wolverine}(y_i, h) = w_i ( y_i^2 - h^2)^2. Which means that βˆ‚Rβˆ‚h(h)=1nβˆ‘i=1nβˆ‚Lβˆ‚h(h)\frac{\partial R}{\partial h}(h) = \frac{1}{n} \sum_{i = 1}^n \frac{\partial L}{\partial h}(h). So we can set βˆ‚βˆ‚h(h)Rwolverine(h)=1nβˆ‘i=1nβˆ’4hwi(yi2βˆ’h2)\frac{\partial}{\partial h}(h) R_\text{wolverine}(h) = \frac{1}{n} \sum_{i = 1}^n -4hw_i(y_i^2 -h^2).

We can now do the last two steps: 0=1nβˆ‘i=1nβˆ’4hwi(yi2βˆ’h2)0=βˆ’4hnβˆ‘i=1nwih(yi2βˆ’h2)0=βˆ‘i=1nwi(yi2βˆ’h2)0=βˆ‘i=1nwiyi2βˆ’wih20=βˆ‘i=1nwiyi2βˆ’βˆ‘i=1nwih2βˆ‘i=1nwih2=βˆ‘i=1nwiyi2h2βˆ‘i=1nwi=βˆ‘i=1nwiyi2h2=βˆ‘i=1nwiyi2βˆ‘i=1nwihβˆ—=βˆ‘i=1nwiyi2βˆ‘i=1nwi\begin{align*} 0 &= \frac{1}{n} \sum_{i = 1}^n -4hw_i(y_i^2 -h^2)\\ 0&= \frac{-4h}{n} \sum_{i = 1}^n w_ih(y_i^2 -h^2)\\ 0&= \sum_{i = 1}^n w_i(y_i^2 -h^2)\\ 0&= \sum_{i = 1}^n w_iy_i^2 -w_ih^2\\ 0&= \sum_{i = 1}^n w_iy_i^2 - \sum_{i = 1}^n w_ih^2\\ \sum_{i = 1}^n w_ih^2 &= \sum_{i = 1}^n w_iy_i^2\\ h^2\sum_{i = 1}^n w_i &= \sum_{i = 1}^n w_iy_i^2\\ h^2 &= \frac{\sum_{i = 1}^n w_iy_i^2}{\sum_{i = 1}^n w_i}\\ h^* &= \sqrt{\frac{\sum_{i = 1}^n w_iy_i^2}{\sum_{i = 1}^n w_i}} \end{align*}


Problem 11.3

For a dataset of non-negative values y1,y2,...,yny_1, y_2, ..., y_n with weights w1,1,...,1w_1, 1, ..., 1, evaluate: lim⁑w1β†’βˆžhβˆ—\displaystyle \lim_{w_1 \rightarrow \infty} h^*

y1y_1

Recall from part b hβˆ—=βˆ‘i=1nwiyi2βˆ‘i=1nwih^* = \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}.

The problem is asking us lim⁑w1β†’βˆžβˆ‘i=1nwiyi2βˆ‘i=1nwi\lim_{w_1 \rightarrow \infty} \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}.

We can further rewrite the problem to get something like this: lim⁑w1β†’βˆžw1y12+βˆ‘i=1nβˆ’1yi2w1+(nβˆ’1)\lim_{w_1 \rightarrow \infty} \sqrt{\frac{w_1 y_1^2 + \sum_{i=1}^{n-1}y_i^2}{w_1 + (n-1)}}. Note that βˆ‘i=1nβˆ’1yi2nβˆ’1\frac{\sum_{i=1}^{n-1}y_i^2}{n-1} is insignificant because it is a constant. Constants compared to infinity can be ignored. We now have something like w1y12w1\sqrt{\frac{w_1y_1^2}{w_1}}. We can cancel out the w1w_1 to get y12\sqrt{y_1^2}, which becomes y1y_1.



Problem 12

Suppose we’re given a dataset of nn points, (x1,y1),(x2,y2),...,(xn,yn)(x_1, y_1), (x_2, y_2), ..., (x_n, y_n), where xΛ‰\bar{x} is the mean of x1,x2,...,xnx_1, x_2, ..., x_n and yΛ‰\bar{y} is the mean of y1,y2,...,yny_1, y_2, ..., y_n.

Using this dataset, we create a transformed dataset of nn points, (x1β€²,y1β€²),(x2β€²,y2β€²),...,(xnβ€²,ynβ€²)(x_1', y_1'), (x_2', y_2'), ..., (x_n', y_n'), where:

xiβ€²=4xiβˆ’3yiβ€²=yi+24x_i' = 4x_i - 3 \qquad y_i' = y_i + 24

That is, the transformed dataset is of the form (4x1βˆ’3,y1+24),...,(4xnβˆ’3,yn+24)(4x_1 - 3, y_1 + 24), ..., (4x_n - 3, y_n + 24).

We decide to fit a simple linear hypothesis function H(xβ€²)=w0+w1xβ€²H(x') = w_0 + w_1x' on the transformed dataset using squared loss. We find that w0βˆ—=7w_0^* = 7 and w1βˆ—=2w_1^* = 2, so Hβˆ—(xβ€²)=7+2xβ€²H^*(x') = 7 + 2x'.


Problem 12.1

Suppose we were to fit a simple linear hypothesis function through the original dataset, (x1,y1),(x2,y2),...,(xn,yn)(x_1, y_1), (x_2, y_2), ..., (x_n, y_n), again using squared loss. What would the optimal slope be?

8.8.

Relative to the dataset with xβ€²x', the dataset with xx has an xx-variable that’s β€œcompressed” by a factor of 4, so the slope increases by a factor of 4 to 2β‹…4=82 \cdot 4 = 8.

Concretely, this can be shown by looking at the formula 2=rSD(yβ€²)SD(xβ€²)2 = r\frac{SD(y')}{SD(x')}, recognizing that SD(yβ€²)=SD(y)SD(y') = SD(y) since the yy values have the same spread in both datasets, and that SD(xβ€²)=4SD(x)SD(x') = 4 SD(x).


Problem 12.2

Recall, the hypothesis function Hβˆ—H^* was fit on the transformed dataset,

(x1β€²,y1β€²),(x2β€²,y2β€²),...,(xnβ€²,ynβ€²)(x_1', y_1'), (x_2', y_2'), ..., (x_n', y_n'). Hβˆ—H^* happens to pass through the point (xΛ‰,yΛ‰)(\bar{x}, \bar{y}). What is the value of xΛ‰\bar{x}? Give your answer as an integer with no variables.

55.

The key idea is that the regression line always passes through (mean x,mean y)(\text{mean } x, \text{mean } y) in the dataset we used to fit it. So, we know that: 2xβ€²Λ‰+7=yβ€²Λ‰2 \bar{x'} + 7 = \bar{y'}. This first equation can be rewritten as: 2β‹…(4xΛ‰βˆ’3)+7=yΛ‰+242 \cdot (4\bar{x} - 3) + 7 = \bar{y} + 24.

We’re also told this line passes through (xΛ‰,yΛ‰)(\bar{x}, \bar{y}), which means that it’s also true that: 2xΛ‰+7=yΛ‰2 \bar{x} + 7 = \bar{y}.

Now we have a system of two equations:

{2β‹…(4xΛ‰βˆ’3)+7=yΛ‰+242xΛ‰+7=yΛ‰\begin{cases} 2 \cdot (4\bar{x} - 3) + 7 = \bar{y} + 24 \\ 2 \bar{x} + 7 = \bar{y} \end{cases}

…\dots and solving our system of two equations gives: xΛ‰=5\bar{x} = 5.



Problem 13

For a given dataset {y1,y2,…,yn}\{y_1, y_2, \dots, y_n\}, let Mabs(h)M_{abs}(h) represent the median absolute error of the constant prediction hh on that dataset (as opposed to the mean absolute error Rabs(h)R_{abs}(h)).


Problem 13.1

For the dataset {4,9,10,14,15}\{4, 9, 10, 14, 15\}, what is Mabs(9)M_{abs}(9)?

55

The first step is to calculate the absolute errors (∣yiβˆ’h∣|y_i - h|).

Absolute Errors={∣4βˆ’9∣,∣9βˆ’9∣,∣10βˆ’9∣,∣14βˆ’9∣,∣15βˆ’9∣}Absolute Errors={βˆ£βˆ’5∣,∣0∣,∣1∣,∣5∣,∣6∣}Absolute Errors={5,0,1,5,6} \begin{align*} \text{Absolute Errors} &= \{|4-9|, |9-9|, |10-9|, |14-9|, |15-9|\} \\ \text{Absolute Errors} &= \{|-5|, |0|, |1|, |5|, |6|\} \\ \text{Absolute Errors} &= \{5, 0, 1, 5, 6\} \end{align*}

Now we have to order the values inside of the absolute errors: {0,1,5,5,6}\{0, 1, 5, 5, 6\}. We can see the median is 55, so Mabs(9)=5M_{abs}(9) =5.


Problem 13.2

For the same dataset {4,9,10,14,15}\{4, 9, 10, 14, 15\}, find another integer hh such that Mabs(9)=Mabs(h)M_{abs}(9) = M_{abs}(h).

55 or 1515

Our goal is to find another number that will give us the same median of absolute errors as in part (a).

One way to do this is to plug in a number and guess. Another way requires noticing you can modify 1010 (the middle element) to become 55 in either direction (negative or positive) because of the absolute value.

We can solve this equation to get ∣10βˆ’x∣=5β†’x=15 and x=5|10-x| = 5 \rightarrow x = 15 \text{ and } x = 5.

We can then test this by following the same steps as we did in part (a).

For x=15x = 15:

Absolute Errors={∣4βˆ’15∣,∣9βˆ’15∣,∣10βˆ’15∣,∣14βˆ’15∣,∣15βˆ’15∣}Absolute Errors={βˆ£βˆ’11∣,βˆ£βˆ’6∣,βˆ£βˆ’5∣,βˆ£βˆ’1∣,∣0∣}Absolute Errors={11,6,5,1,0} \begin{align*} \text{Absolute Errors} &= \{|4-15|, |9-15|, |10-15|, |14-15|, |15-15|\} \\ \text{Absolute Errors} &= \{|-11|, |-6|, |-5|, |-1|, |0|\} \\ \text{Absolute Errors} &= \{11, 6, 5, 1, 0\} \end{align*}

Then we order the elements to get the absolute errors: {0,1,5,6,11}\{0, 1, 5, 6, 11\}. We can see the median is 55, so Mabs(15)=5M_{abs}(15) =5.

For x=5x = 5:

Absolute Errors={∣4βˆ’5∣,∣9βˆ’5∣,∣10βˆ’5∣,∣14βˆ’5∣,∣15βˆ’5∣}Absolute Errors={βˆ£βˆ’1∣,∣4∣,∣5∣,∣9∣,∣10∣}Absolute Errors={1,4,5,9,10} \begin{align*} \text{Absolute Errors} &= \{|4-5|, |9-5|, |10-5|, |14-5|, |15-5|\} \\ \text{Absolute Errors} &= \{|-1|, |4|, |5|, |9|, |10|\} \\ \text{Absolute Errors} &= \{1, 4, 5, 9, 10\} \end{align*}

We do not have to re-order the elements because they are in order already. We can see the median is 55, so Mabs(5)=5M_{abs}(5) =5.


Problem 13.3

Based on your answers to parts (a) and (b), discuss in at most two sentences what is problematic about using the median absolute error to make predictions.

The numbers 55 and 1515 are clearly bad predictions (close to the extreme values in the dataset), yet they are considered just as good a prediction by this metric as the number 99, which is roughly in the center of the dataset. Intuitively, 99 is a much better prediction, but this way of measuring the quality of a prediction does not recognize that.



Problem 14

Suppose we are given a dataset of points {(x1,y1),(x2,y2),…,(xn,yn)}\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} and for some reason, we want to make predictions using a prediction rule of the form H(x)=17+w1x.H(x) = 17 + w_1x.


Problem 14.1

Write down an expression for the mean squared error of a prediction rule of this form, as a function of the parameter w1w_1.

MSE(w1)=1nβˆ‘i=1n(yiβˆ’(17+w1xi))2MSE(w_1) = \dfrac1n \displaystyle\sum_{i=1}^n (y_i - (17 + w_1x_i))^2


Problem 14.2

Minimize the function MSE(w1)MSE(w_1) to find the parameter w1βˆ—w_1^* which defines the optimal prediction rule Hβˆ—(x)=17+w1βˆ—xH^*(x) = 17 + w_1^*x. Show all your work and explain your steps.

Fill in your final answer below:

w1βˆ—=βˆ‘i=1nxi(yiβˆ’17)βˆ‘i=1nxi2w_1^* = \dfrac{\displaystyle\sum_{i=1}^n x_i(y_i - 17)}{\displaystyle\sum_{i=1}^n x_i^2}

To minimize a function of one variable, we need to take the derivative, set it equal to zero, and solve. MSE(w1)=1nβˆ‘i=1n(yiβˆ’17βˆ’w1xi)2MSEβ€²(w1)=1nβˆ‘i=1nβˆ’2xi(yiβˆ’17βˆ’w1xi))using the chain rule0=1nβˆ‘i=1nβˆ’2xi(yiβˆ’17)+1nβˆ‘i=1n2xi2w1splitting up the sum0=βˆ‘i=1nβˆ’xi(yiβˆ’17)+βˆ‘i=1nxi2w1multiplying through by n2w1βˆ‘i=1nxi2=βˆ‘i=1nxi(yiβˆ’17)rearranging terms and pulling out w1w1=βˆ‘i=1nxi(yiβˆ’17)βˆ‘i=1nxi2\begin{aligned} MSE(w_1) &= \dfrac1n \displaystyle\sum_{i=1}^n (y_i - 17 - w_1x_i)^2 \\ MSE'(w_1) &= \dfrac1n \displaystyle\sum_{i=1}^n -2x_i(y_i - 17 - w_1x_i)) \qquad \text{using the chain rule} \\ 0 &= \dfrac1n \displaystyle\sum_{i=1}^n -2x_i(y_i - 17) + \dfrac1n \displaystyle\sum_{i=1}^n 2x_i^2w_1 \qquad \text{splitting up the sum} \\ 0 &= \displaystyle\sum_{i=1}^n -x_i(y_i - 17) + \displaystyle\sum_{i=1}^n x_i^2w_1 \qquad \text{multiplying through by } \frac{n}{2} \\ w_1 \displaystyle\sum_{i=1}^n x_i^2 &= \displaystyle\sum_{i=1}^n x_i(y_i - 17) \qquad \text{rearranging terms and pulling out } w_1 \\ w_1 & = \dfrac{\displaystyle\sum_{i=1}^n x_i(y_i - 17)}{\displaystyle\sum_{i=1}^n x_i^2} \end{aligned}


Problem 14.3

True or False: For an arbitrary dataset, the prediction rule Hβˆ—(x)=17+w1βˆ—xH^*(x) = 17 + w_1^*x goes through the point (xΛ‰,yΛ‰)(\bar x, \bar y).

False.

When we fit a prediction rule of the form H(x)=w0+w1xH(x) = w_0+w_1x using simple linear regression, the formula for the intercept w0w_0 is designed to make sure the regression line passes through the point (xΛ‰,yΛ‰)(\bar x, \bar y). Here, we don’t have the freedom to control our intercept, as it’s forced to be 1717. This means we can’t guarantee that the prediction rule Hβˆ—(x)=17+w1βˆ—xH^*(x) = 17 + w_1^*x goes through the point (xΛ‰,yΛ‰)(\bar x, \bar y).

A simple example shows that this is the case. Consider the dataset (βˆ’2,0)(-2, 0) and (2,0)(2, 0). The point (xΛ‰,yΛ‰)(\bar x, \bar y) is the origin, but the prediction rule Hβˆ—(x)H^*(x) does not pass through the origin because it has an intercept of 1717.


Problem 14.4

True or False: For an arbitrary dataset, the mean squared error associated with Hβˆ—(x)H^*(x) is greater than or equal to the mean squared error associated with the regression line.

True.

The regression line is the prediction rule of the form H(x)=w0+w1xH(x) = w_0+w_1x with the smallest mean squared error (MSE). Hβˆ—(x)H^*(x) is one example of a prediction rule of that form so unless it happens to be the regression line itself, the regression line will have lower MSE because it was designed to have the lowest possible MSE. This means the MSE associated with Hβˆ—(x)H^*(x) is greater than or equal to the MSE associated with the regression line.



Problem 15

Suppose you have a dataset {(x1,y1),(x2,y2),…,(x8,y8)}\{(x_1, y_1), (x_2,y_2), \dots, (x_8, y_8)\} with n=8n=8 ordered pairs such that the variance of {x1,x2,…,x8}\{x_1, x_2, \dots, x_8\} is 5050. Let mm be the slope of the regression line fit to this data.

Suppose now we fit a regression line to the dataset {(x1,y2),(x2,y1),…,(x8,y8)}\{(x_1, y_2), (x_2,y_1), \dots, (x_8, y_8)\} where the first two yy-values have been swapped. Let mβ€²m' be the slope of this new regression line.

If x1=3x_1 = 3, y1=7y_1 =7, x2=8x_2=8, and y2=2y_2=2, what is the difference between the new slope and the old slope? That is, what is mβ€²βˆ’mm' - m? The answer you get should be a number with no variables.

Hint: There are many equivalent formulas for the slope of the regression line. We recommend using the version of the formula without yβ€Ύ\overline{y}.

mβ€²βˆ’m=116m' - m = \dfrac{1}{16}

Using the formula for the slope of the regression line, we have:

m=βˆ‘i=1n(xiβˆ’xβ€Ύ)yiβˆ‘i=1n(xiβˆ’xβ€Ύ)2=βˆ‘i=1n(xiβˆ’xβ€Ύ)yinβ‹…Οƒx2=(3βˆ’xΛ‰)β‹…7+(8βˆ’xΛ‰)β‹…2+βˆ‘i=3n(xiβˆ’xβ€Ύ)yi8β‹…50. \begin{aligned} m &= \frac{\sum_{i=1}^n (x_i - \overline x)y_i}{\sum_{i=1}^n (x_i - \overline x)^2}\\ &= \frac{\sum_{i=1}^n (x_i - \overline x)y_i}{n\cdot \sigma_x^2}\\ &= \frac{(3-\bar{x})\cdot 7 + (8 - \bar{x})\cdot 2 + \sum_{i=3}^n (x_i - \overline x)y_i}{8\cdot 50}. \\ \end{aligned}

Note that by switching the first two yy-values, the terms in the sum from i=3i=3 to nn, the number of data points nn, and the variance of the xx-values are all unchanged.

So the slope becomes:

mβ€²=(3βˆ’xΛ‰)β‹…2+(8βˆ’xΛ‰)β‹…7+βˆ‘i=3n(xiβˆ’xβ€Ύ)yi8β‹…50 \begin{aligned} m' &= \frac{(3-\bar{x})\cdot 2 + (8 - \bar{x})\cdot 7 + \sum_{i=3}^n (x_i - \overline x)y_i}{8\cdot 50} \\ \end{aligned}

and the difference between these slopes is given by:

mβ€²βˆ’m=(3βˆ’xΛ‰)β‹…2+(8βˆ’xΛ‰)β‹…7βˆ’((3βˆ’xΛ‰)β‹…7+(8βˆ’xΛ‰)β‹…2)8β‹…50=(3βˆ’xΛ‰)β‹…2+(8βˆ’xΛ‰)β‹…7βˆ’(3βˆ’xΛ‰)β‹…7βˆ’(8βˆ’xΛ‰)β‹…28β‹…50=(3βˆ’xΛ‰)β‹…(βˆ’5)+(8βˆ’xΛ‰)β‹…58β‹…50=βˆ’15+5xΛ‰+40βˆ’5xΛ‰8β‹…50=258β‹…50=116 \begin{aligned} m'-m &= \frac{(3-\bar{x})\cdot 2 + (8 - \bar{x})\cdot 7 - ((3-\bar{x})\cdot 7 + (8 - \bar{x})\cdot 2)}{8\cdot 50}\\ &= \frac{(3-\bar{x})\cdot 2 + (8 - \bar{x})\cdot 7 - (3-\bar{x})\cdot 7 - (8 - \bar{x})\cdot 2}{8\cdot 50}\\ &= \frac{(3-\bar{x})\cdot (-5) + (8 - \bar{x})\cdot 5}{8\cdot 50}\\ &= \frac{ -15+5\bar{x} + 40 -5\bar{x}}{8\cdot 50}\\ &= \frac{ 25}{8\cdot 50}\\ &= \frac{ 1}{16} \end{aligned}


Problem 16

Note that we have two simplified closed form expressions for the estimated slope ww in simple linear regression that you have already seen in discussions and lectures:

w=βˆ‘i(xiβˆ’xβ€Ύ)yiβˆ‘i(xiβˆ’xβ€Ύ)2w=βˆ‘i(yiβˆ’yβ€Ύ)xiβˆ‘i(xiβˆ’xβ€Ύ)2 \begin{align*} w &= \frac{\sum_i (x_i - \overline{x}) y_i}{\sum_i (x_i - \overline{x})^2} \\ \\ w &= \frac{\sum_i (y_i - \overline{y}) x_i }{\sum_i (x_i - \overline{x})^2} \end{align*}

where we have dataset D=[(x1,y1),…,(xn,yn)]D = [(x_1,y_1), \ldots, (x_n,y_n)] and sample means   xβ€Ύ=1nβˆ‘ixi,yβ€Ύ=1nβˆ‘iyi\overline{x} = {1 \over n} \sum_{i} x_i, \quad \overline{y} = {1 \over n} \sum_{i} y_i. Without further explanation, βˆ‘i\sum_i means βˆ‘i=1n\sum_{i=1}^n


Problem 16.1

Are (11) and (22) equivalent? That is, is the following equality true? Prove or disprove it. βˆ‘i(xiβˆ’xβ€Ύ)yi=βˆ‘i(yiβˆ’yβ€Ύ)xi\sum_i (x_i - \overline{x}) y_i = \sum_i (y_i - \overline{y}) x_i

True.

βˆ‘i(xiβˆ’xβ€Ύ)yi=βˆ‘i(yiβˆ’yβ€Ύ)xiβ‡”βˆ‘ixiyiβˆ’xβ€Ύβˆ‘iyi=βˆ‘ixiyiβˆ’yβ€Ύβˆ‘ixi⇔xβ€Ύβˆ‘iyi=yβ€Ύβˆ‘ixi⇔1nβˆ‘ixiβˆ‘iyi=1nβˆ‘iyiβˆ‘ixi \begin{align*} & \sum_i (x_i - \overline{x}) y_i = \sum_i (y_i - \overline{y}) x_i \\ & \Leftrightarrow \sum_i x_i y_i - \overline{x} \sum_i y_i = \sum_i x_i y_i - \overline{y} \sum_i x_i \\ & \Leftrightarrow \overline{x} \sum_i y_i = \overline{y} \sum_i x_i \\ & \Leftrightarrow {1 \over n} \sum_i x_i \sum_i y_i = {1 \over n} \sum_i y_i \sum_i x_i \\ \end{align*}


Problem 16.2

True or False: If the dataset shifted right by a constant distance aa, that is, we have the new dataset Da=(x1+a,y1),…,(xn+a,yn)D_a = (x_1 + a,y_1), \ldots, (x_n + a,y_n), then will the estimated slope ww change or not?

False. By (11) in part (a), we can view ww as only being affected by xiβˆ’xβ€Ύx_i - \overline{x}, which is unchanged after shifting horizontally. Therefore, ww is unchanged.


Problem 16.3

True or False: If the dataset shifted up by a constant distance bb, that is, we have the new dataset Db=[(x1,y1+b),…,(xn,yn+b)]D_b = [(x_1,y_1 + b), \ldots, (x_n,y_n + b)], then will the estimated slope ww change or not?

False. By (22) in part (a), we can view ww as only being affected by yiβˆ’yβ€Ύy_i - \overline{y}, which is unchanged after shifting vertically. Therefore, ww is unchanged.



Problem 17

Consider a dataset that consists of y1,⋯ ,yny_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, Rsq(h)=1nβˆ‘i=1n(hβˆ’yi)2R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: Lmidterm(y,h)=(Ξ±yβˆ’h)2+Ξ»hL_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h


Problem 17.1

Write down the empiricial risk Rmidterm(h)R_{\text{midterm}}(h) by using the above loss function.

Rmidterm(h)=1nβˆ‘i=1n[(Ξ±yiβˆ’h)2+Ξ»h]=[1nβˆ‘i=1n(Ξ±yiβˆ’h)2]+Ξ»hR_{\text{midterm}}(h)=\frac{1}{n}\sum_{i=1}^{n}[(\alpha y_i - h)^2+\lambda h]=[\frac{1}{n}\sum_{i=1}^{n}(\alpha y_i - h)^2] +\lambda h


Problem 17.2

The mean of dataset is yΛ‰\bar{y}, i.e. yΛ‰=1nβˆ‘i=1nyi\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find hβˆ—h^* that minimizes Rmidterm(h)R_{\text{midterm}}(h) using calculus. Your result should be in terms of yΛ‰\bar{y}, Ξ±\alpha and Ξ»\lambda.

hβˆ—=Ξ±yΛ‰βˆ’Ξ»2h^*=\alpha \bar{y} - \frac{\lambda}{2}



ddhRmidterm(h)=[2nβˆ‘i=1n(hβˆ’Ξ±yi)]+Ξ»=2hβˆ’2Ξ±yΛ‰+Ξ». \begin{align*} \frac{d}{dh}R_{\text{midterm}}(h)&= [\frac{2}{n}\sum_{i=1}^{n}(h- \alpha y_i )] +\lambda \\ &=2 h-2\alpha \bar{y} + \lambda. \end{align*}

By setting ddhRmidterm(h)=0\frac{d}{dh}R_{\text{midterm}}(h)=0 we get 2hβˆ—βˆ’2Ξ±yΛ‰+Ξ»=0β‡’hβˆ—=Ξ±yΛ‰βˆ’Ξ»2.2 h^*-2\alpha \bar{y} + \lambda=0 \Rightarrow h^*=\alpha \bar{y} - \frac{\lambda}{2}.