Final Review: Post-Midterm Content

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to attempt these problems before Tuesday’s exam review session, so that we have enough time to walk through the solutions to all of the problems.

We will enable the solutions here after the review session, though you can find the written solutions to these problems in other discussion worksheets.


Problem 1

Consider a dataset of n values, y_1, y_2, ..., y_n, all of which are non-negative. We’re interested in fitting a constant model, H(x) = h, to the data, using the new “Wolverine” loss function:

L_\text{wolverine}(y_i, h) = w_i \left( y_i^2 - h^2 \right)^2

Here, w_i corresponds to the “weight” assigned to the data point y_i, the idea being that different data points can be weighted differently when finding the optimal constant prediction, h^*.

For example, for the dataset y_1 = 1, y_2 = 5, y_3 = 2, we will end up with different values of h^* when we use the weights w_1 = w_2 = w_3 = 1 and when we use weights w_1 = 8, w_2 = 4, w_3 = 3.


Problem 1.1

Find \frac{\partial L_\text{wolverine}}{\partial h}, the derivative of the Wolverine loss function with respect to h. Show your work.

\frac{\partial L}{\partial h} = -4w_ih(y_i^2 -h^2)

To solve this problem we simply take the derivative of L_\text{wolverine}(y_i, h) = w_i( y_i^2 - h^2 )^2.

We can use the chain rule to find the derivative. The chain rule is: \frac{\partial}{\partial h}[f(g(h))]=f'(g(h))g'(h).

Note that (y_i^2 -h^2)^2 is the area we care about inside of L_\text{wolverine}(y_i, h) = w_i( y_i^2 - h^2 )^2 because that is where h is!. In this case f(h) = h^2 and g(h) = y_i^2 - h^2. We can then take the derivative of both to get: f'(h) = 2h and g'(x) = -2h.

This tells us the derivative is: \frac{\partial L}{\partial h} = (w_i) * 2(y_i^2 -h^2) * (-2h), which can be simplified to \frac{\partial L}{\partial h} = -4w_ih(y_i^2 -h^2).


Difficulty: ⭐️⭐️

The average score on this problem was 88%.


Problem 1.2

Prove that the constant prediction that minimizes average loss for the Wolverine loss function is:

h^* = \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}

The recipe for average loss is to find the derivative of the risk function, set it equal to zero, and solve for h^*.

We know that average loss follows the equation R(L(y_i, h)) = \frac{1}{n} \sum_{i=1}^n L(y_i, h). This means that R_\text{wolverine}(h) = \frac{1}{n} \sum_{i = 1}^n w_i (y_i^2 - h^2)^2.

Recall we have already found the derivative of L_\text{wolverine}(y_i, h) = w_i ( y_i^2 - h^2)^2. Which means that \frac{\partial R}{\partial h}(h) = \frac{1}{n} \sum_{i = 1}^n \frac{\partial L}{\partial h}(h). So we can set \frac{\partial}{\partial h}(h) R_\text{wolverine}(h) = \frac{1}{n} \sum_{i = 1}^n -4hw_i(y_i^2 -h^2).

We can now do the last two steps: \begin{align*} 0 &= \frac{1}{n} \sum_{i = 1}^n -4hw_i(y_i^2 -h^2)\\ 0&= \frac{-4h}{n} \sum_{i = 1}^n w_ih(y_i^2 -h^2)\\ 0&= \sum_{i = 1}^n w_i(y_i^2 -h^2)\\ 0&= \sum_{i = 1}^n w_iy_i^2 -w_ih^2\\ 0&= \sum_{i = 1}^n w_iy_i^2 - \sum_{i = 1}^n w_ih^2\\ \sum_{i = 1}^n w_ih^2 &= \sum_{i = 1}^n w_iy_i^2\\ h^2\sum_{i = 1}^n w_i &= \sum_{i = 1}^n w_iy_i^2\\ h^2 &= \frac{\sum_{i = 1}^n w_iy_i^2}{\sum_{i = 1}^n w_i}\\ h^* &= \sqrt{\frac{\sum_{i = 1}^n w_iy_i^2}{\sum_{i = 1}^n w_i}} \end{align*}


Difficulty: ⭐️⭐️

The average score on this problem was 77%.


Problem 1.3

For a dataset of non-negative values y_1, y_2, ..., y_n with weights w_1, 1, ..., 1, evaluate: \displaystyle \lim_{w_1 \rightarrow \infty} h^*

y_1

Recall from part b h^* = \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}.

The problem is asking us \lim_{w_1 \rightarrow \infty} \sqrt{\frac{\sum_{i = 1}^n w_i y_i^2}{\sum_{i = 1}^n w_i}}.

We can further rewrite the problem to get something like this: \lim_{w_1 \rightarrow \infty} \sqrt{\frac{w_1 y_1^2 + \sum_{i=1}^{n-1}y_i^2}{w_1 + (n-1)}}. Note that \frac{\sum_{i=1}^{n-1}y_i^2}{n-1} is insignificant because it is a constant. Constants compared to infinity can be ignored. We now have something like \sqrt{\frac{w_1y_1^2}{w_1}}. We can cancel out the w_1 to get \sqrt{y_1^2}, which becomes y_1.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 48%.



Problem 2

Suppose we want to fit a hypothesis function of the form:

H(x) = w_0 + w_1 x^2

Note that this is not the simple linear regression hypothesis function, H(x) = w_0 + w_1x.

To do so, we will find the optimal parameter vector \vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix} that satisfies the normal equations. The first 5 rows of our dataset are as follows, though note that our dataset has n rows in total.

x y
2 4
-1 4
3 4
-7 4
3 4

Suppose that x_1, x_2, ..., x_n have a mean of \bar{x} = 2 and a variance of \sigma_x^2 = 10.


Problem 2.1

Write out the first 5 rows of the design matrix, X.

X = \begin{bmatrix} 1 & 4 \\ 1 & 1 \\ 1 & 9 \\ 1 & 49 \\ 1 & 9 \end{bmatrix}

Recall our hypothesis function is H(x) = w_0 + w_1x^2. Since there is a w_0 present our X matrix should contain a column of ones. This means that our first column will be ones. Our second column should be x^2. This means we take each datapoint x and square it inside of X.


Difficulty: ⭐️⭐️

The average score on this problem was 84%.


Problem 2.2

Suppose, just in part (b), that after solving the normal equations, we find \vec{w}^* = \begin{bmatrix} 2 \\ -5 \end{bmatrix}. What is the predicted y value for x = 2? Give your answer as an integer with no variables. Show your work.

(2)(1)+(-5)(4)=-18

To find the predicted y value all you need to do is plug x = 2 into the hypothesis function H(x) = w_0 + w_1x^2, or take the dot product of \vec{w}^* with \begin{bmatrix}1 \\ 2^2\end{bmatrix}.

\begin{align*} &\begin{bmatrix} 2 \\ -5 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 4 \end{bmatrix}\\ &(2)(1)+(-5)(4)\\ &2 - 20\\ &-18 \end{align*}


Difficulty: ⭐️⭐️

The average score on this problem was 78%.


Problem 2.3

Let X_\text{tri} = 3 X. Using the fact that \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2, determine the value of the bottom-left value in the matrix X_\text{tri}^T X_\text{tri}, i.e. the value in the second row and first column. Give your answer as an expression involving n. Show your work.

126n

To figure out a pattern it can be easier to use variables instead of numbers. Like so:

X = \begin{bmatrix} 1 & x_1^2 \\ 1 & x_2^2 \\ \vdots & \vdots \\ 1 & x_n^2 \end{bmatrix}

We can now create X_{\text{tri}}:

X_{\text{tri}} = \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}

We want to know what the bottom left value of X_\text{tri}^T X_\text{tri} is. We figure this out with matrix multiplication!

\begin{align*} X_\text{tri}^T X_\text{tri} &= \begin{bmatrix} 3 & 3 & ... & 3\\ 3x_1^2 & 3x_2^2 & ... & 3x_n^2 \end{bmatrix} \begin{bmatrix} 3 & 3x_1^2 \\ 3 & 3x_2^2 \\ \vdots & \vdots \\ 3 & 3x_n^2 \end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 3(3) & \sum_{i = 1}^n 3(3x_i^2) \\ \sum_{i = 1}^n 3(3x_i^2) & \sum_{i = 1}^n (3x_i^2)(3x_i^2)\end{bmatrix}\\ &= \begin{bmatrix} \sum_{i = 1}^n 9 & \sum_{i = 1}^n 9x_i^2 \\ \sum_{i = 1}^n 9x_i^2 & \sum_{i = 1}^n (3x_i^2)^2 \end{bmatrix} \end{align*}

We can see that the bottom left element should be \sum_{i = 1}^n 9x_i^2.

From here we can use the fact given to us in the directions: \sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2.

\begin{align*} &\sum_{i = 1}^n 9x_i^2\\ &9\sum_{i = 1}^n x_i^2\\ &\text{Notice now we can replace $\sum_{i = 1}^n x_i^2$ with $n \sigma_x^2 + n \bar{x}^2$.}\\ &9(n \sigma_x^2 + n \bar{x}^2)\\ &\text{We know that $\sigma_x^2 = 10$ and $\bar x = 2$ fron the directions before part a.}\\ &9(10n + 2^2n)\\ &9(10n + 4n)\\ &9(14n) = 126n \end{align*}


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 39%.



Problem 3

Suppose we have one qualitative variable that that we convert to numerical values using one- hot encoding. We’ve shown the first four rows of the resulting design matrix below:


Problem 3.1

Say we train a linear model m_1 on these data. Then, we replace all of the 1 values in column a with 3’s and all of the 1 values in column b with 2’s and train a new linear model m_2. Neither m_1 nor m_2 have an intercept term. On the training data, the average squared loss for m_1 will be ________ that of m_2.

Answers:

The answer is equal to.

Note that we can just re-scale our weights accordingly. Any model we can get with m_1 we can also get with m_2 (and vice versa).


Problem 3.2

To account for the intercept term, we add a column of all ones to our design matrix from part a. That is, the resulting design matrix has four columns: a with 3’s instead of 1’s, b with 2’s instead of 1’s, c, and a column of all ones. What is the rank of the new design matrix with these four columns?

Answers:

The answer is 3.

Note that the column c = intercept column −\frac{1}{3}a + \frac{1}{2}b. Hence, there is a linear dependence relationship, meaning that one of the columns is redundant and that the rank of the new design matrix is 3.


Problem 3.3

Suppose we divide our sampling frame into three clusters of people, numbered 1, 2, and 3. After we survey people, along with our survey results, we save their cluster number as a new feature in our design matrix. Before training a model, what should we do with the cluster column? (Note: This part is independent of parts a and b.)

Answers:

The cluster number is a categorical variable, so it should be one-hot encoded.



Problem 4

One piece of information that may be useful as a feature is the proportion of SAT test takers in a state in a given year that qualify for free lunches in school. The Series lunch_props contains 8 values, each of which are either "low", "medium", or "high". Since we can’t use strings as features in a model, we decide to encode these strings using the following Pipeline:

# Note: The FunctionTransformer is only needed to change the result
# of the OneHotEncoder from a "sparse" matrix to a regular matrix
# so that it can be used with StandardScaler;
# it doesn't change anything mathematically.
pl = Pipeline([
    ("ohe", OneHotEncoder(drop="first")),
    ("ft", FunctionTransformer(lambda X: X.toarray())),
    ("ss", StandardScaler())
])

After calling pl.fit(lunch_props), pl.transform(lunch_props) evaluates to the following array:

array([[ 1.29099445, -0.37796447],
       [-0.77459667, -0.37796447],
       [-0.77459667, -0.37796447],
       [-0.77459667,  2.64575131],
       [ 1.29099445, -0.37796447],
       [ 1.29099445, -0.37796447],
       [-0.77459667, -0.37796447],
       [-0.77459667, -0.37796447]])

and pl.named_steps["ohe"].get_feature_names() evaluates to the following array:

array(["x0_low", "x0_med"], dtype=object)

Fill in the blanks: Given the above information, we can conclude that lunch_props has (a) value(s) equal to "low", (b) value(s) equal to "medium", and (c) value(s) equal to "high". (Note: You should write one positive integer in each box such that the numbers add up to 8.)

What goes in the blanks?

Answer: 3, 1, 4

The first column of the transformed array corresponds to the standardized one-hot-encoded low column. There are 3 values that are positive, which means there are 3 values that were originally 1 in that column pre-standardization. This means that 3 of the values in lunch_props were originally "low".

The second column of the transformed array corresponds to the standardized one-hot-encoded med column. There is only 1 value in the transformed column that is positive, which means only 1 of the values in lunch_props was originally "medium".

The Series lunch_props has 8 values, 3 of which were identified as "low" in subpart 1, and 1 of which was identified as "medium" in subpart 2. The number of values being "high" must therefore be 8 - 3 - 1 = 4.


Problem 5

Consider the least squares regression model, \vec{h} = X \vec{w}. Assume that X and \vec{h} refer to the design matrix and hypothesis vector for our training data, and \vec y is the true observation vector.

Let \vec{w}_\text{OLS}^* be the parameter vector that minimizes mean squared error without regularization. Specifically:

\vec{w}_\text{OLS}^* = \arg\underset{\vec{w}}{\min} \frac{1}{n} \| \vec{y} - X \vec{w} \|^2_2

Let \vec{w}_\text{ridge}^* be the parameter vector that minimizes mean squared error with L_2 regularization, using a non-negative regularization hyperparameter \lambda (i.e. ridge regression). Specifically:

\vec{w}_\text{ridge}^* = \arg\underset{\vec{w}}{\min} \frac{1}{n} \| \vec y - X \vec{w} \|^2_2 + \lambda \sum_{j=1}^{p} w_j^2

For each of the following problems, fill in the blank.


Problem 5.1

If we set \lambda = 0, then \Vert \vec{w}_\text{OLS}^* \Vert^2_2 is ____ \Vert \vec{w}_\text{ridge}^* \Vert^2_2

Answers:

equal to


Problem 5.2

For each of the remaining parts, you can assume that \lambda is set such that the predicted response vectors for our two models (\vec{h} = X \vec{w}_\text{OLS}^* and \vec{h} = X \vec{w}_\text{ridge}^*) is different.

The training MSE of the model \vec{h} = X \vec{w}_\text{OLS}^* is ____ than the model \vec{h} = X \vec{w}_\text{ridge}^*.

Answers:

less than


Problem 5.3

Now, assume we’ve fit both models using our training data, and evaluate both models on some unseen testing data.

The test MSE of the model \vec{h} = X \vec{w}_\text{OLS}^* is ____ than the model \vec{h} = X \vec{w}_\text{ridge}^*.

Answers:

impossible to tell


Problem 5.4

Assume that our design matrix X contains a column of all ones. The sum of the residuals of our model \vec{h} = X \vec{w}_\text{ridge}^* ____.

Answers:

not necessarily equal to 0


Problem 5.5

As we increase \lambda, the bias of the model \vec{h} = X \vec{w}_\text{ridge}^* tends to ____.

Answers:

increase


Problem 5.6

As we increase \lambda, the model variance of the model \vec{h} = X \vec{w}_\text{ridge}^* tends to ____.

Answers:

decrease


Problem 5.7

As we increase \lambda, the observation variance of the model \vec{h} = X \vec{w}_\text{ridge}^* tends to ____.

Answers:

stay the same



Problem 6

Neerad wants to build a model that predicts the number of open rooms a hotel has, given various other features. He has a training set with 1200 rows available to him for the purposes of training his model.


Problem 6.1

Neerad fits a regression model using the GPTRegression class. GPTRegression models have several hyperparameters that can be tuned, including context_length and sentience.

To choose between 5 possible values of the hyperparameter context_length, Neerad performs k-fold cross-validation.

  1. How many total times is a GPTRegression model fit?
  1. Suppose that every time a GPTRegression model is fit, it appends the number of points in its training set to the list sizes. Note that after performing cross- validation, len(sizes) is equal to your answer to the previous subpart.

What is sum(sizes)?

Answers:

  1. 5k
  2. 6000(k-1)

When we do k-fold cross-validation for one single hyperparameter value, we split the dataset into k folds, and in each iteration i, train the model on the remaining k-1 folds and evaluate on fold i. Since every fold is left out and evaluated on once, the model is fit in total k times. We do this once for every hyperparameter value we want to test, so the total number of model fits required is 5k.

In part 2, we can note that each model fit is performed on the same size of data – the size of the remaining k-1 folds when we hold out a single fold. This size is 1 - \frac{1}{k} = \frac{k-1}{k} times the size of the entire dataset, in this case, 1200 \cdot \frac{k-1}{k}, and we fit a model on a dataset of this size 5k times. So, the sum of the training sizes for each model fit is:

5k \cdot \frac{k-1}{k} \cdot 1200 = 6000(k-1)


Problem 6.2

The average training error and validation error for all 5 candidate values of context_length are given below.

Fill in the blanks: As context_length increases, model complexity __(i)__. The optimal choice of context_length is __(ii)__; if we choose a context_length any higher than that, our model will __(iii)__.

  1. What goes in blank (i)?
  1. What goes in blank (ii)?
  1. What goes in blank (iii)?

Answers:

  1. decreases
  2. 100
  3. underfit the training data and have high bias

In part 1, we can see that as context_length increases, the training error increases, and the model performs worse. In general, higher model complexity leads to better model performance, so here, increasing context_length is reducing model complexity.

In part 2, we will choose a context_length of 100, since this parameterization leads to the best validation performance. If we increase context_length further, the validation error increases.

In part 3, since increased context_length indicates less complexity and worse training performance, increasing the context_length further would lead to underfitting, as the model would lack the expressiveness or number of parameters required to capture the data. Since training error represents model bias, and since high variance is associated with overfitting, a further increase in context_length would mean a more biased model.



Problem 7

Suppose we want to use logistic regression to classify whether a person survived the sinking of the Titanic. The first 5 rows of our dataset are given below.

\begin{array}{|c|c|c|c|} \hline & \textbf{Age} & \textbf{Survived} & \textbf{Female} \\ \hline 0 & 22.0 & 0 & 0 \\ \hline 1 & 38.0 & 1 & 1 \\ \hline 2 & 26.0 & 1 & 1 \\ \hline 3 & 35.0 & 1 & 1 \\ \hline 4 & 35.0 & 0 & 0 \\ \hline \end{array}

Suppose after training our logistic regression model we get \vec{w}^* = \begin{bmatrix} -1.2 \\ -0.005 \\ 2.5 \end{bmatrix}, where -1.2 is an intercept term, -0.005 is the optimal parameter corresponding to passenger’s age, and 2.5 is the optimal parameter corresponding to sex (1 if female, 0 otherwise).


Problem 7.1

Consider Sı̄lānah Iskandar Nāsı̄f Abı̄ Dāghir Yazbak, a 20 year old female. What chance did she have to survive the sinking of the Titanic according to our model? Give your answer as a probability in terms of \sigma. If there is not enough information, write “not enough information.”

Answer: P(y = 1 | \text{age} = 20, \text{female} = 1) = \sigma(1.2)

Our augmented feature vector is of the form \text{Aug}(\vec{x}) = \begin{bmatrix} 1 \\ 20 \\ 1 \end{bmatrix}. Then \vec{w}^* \cdot \text{Aug}(\vec x) = 1(-1.2) + 20(-0.005) + 1(2.5) = 1.2, so:

P(y = 1 | \vec{x}) = \sigma \left( \vec{w}^* \cdot \text{Aug}(\vec x) \right) = \boxed{\sigma (1.2)}


Problem 7.2

Sı̄lānah Iskandar Nāsı̄f Abı̄ Dāghir Yazbak actually survived. What is the cross-entropy loss for our prediction in the previous part?

Answer: -\log (\sigma (1.2))

Here y_i=1 and p_i = \sigma (1.2). The formula for cross entropy loss is:

L_\text{ce}(y_i, p_i) = -y_i\log (p_i) - (1 - y_i)\log (1 - p_i) = \boxed{-\log (\sigma (1.2))}


Problem 7.3

At what age would we predict that a female passenger is more likely to have survived the Titanic than not? In other words, at what age is the probability of survival for a female passenger greater than 0.5?

Hint: Since \sigma(0) = 0.5, we have that \sigma \left( \vec{w}^* \cdot \text{Aug}(\vec x) \right) = 0.5 \implies \vec{w}^* \cdot \text{Aug}(\vec x) = 0.

Answer: 260 years old

The probability that a female passenger of age a survives the Titanic is:

P(y = 1 | \text{age} = a, \text{female} = 1) = \sigma(-1.2 - 0.005 a + 2.5) = \sigma(1.3 - 0.005a)

In order for \sigma(1.3 - 0.005a) = 0.5, we need 1.3 - 0.005a = 0. This means that:

0.005a = 1.3 \implies a = \frac{1.3}{0.005} = 1.3 \cdot 200 = 260

So, a female passenger must be at least 260 years old in order for us to predict that they are more likely to survive the Titanic than not. Note that \text{age} = 260 can be interpreted as a decision boundary; since we’ve fixed a value for the \text{female} feature, there’s only one remaining feature, which is \text{age}.


Problem 7.4

Let m be the odds of a given non-female passenger’s survival according to our logistic regression model, i.e., if the passenger had an 80% chance of survival, m would be 4, since their odds of survival are \frac{0.8}{0.2} = 4.

It turns out we can compute f, the odds of survival for a female passenger of the same age, in terms of m. Give an expression for f in terms of m.

Let p_m be the probability that the non-female passenger survives, and let p_f be the probability that the female passenger of the same age survives. Then, we have that:

p_m = \sigma(-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0)

p_f = \sigma(-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1)

Now, recall from Lecture 24 that:

  • If p_i is the probability of an event, then the odds of the event are \frac{p_i}{1 - p_i}.
  • If p_i = \sigma(t), then t = \sigma^{-1}(p_i) = \log \left( \frac{p_i}{1 - p_i} \right). In other words, the inverse of p_i = \sigma(t) is the log odds of p_i, i.e. $^{-1}(p_i) = ( (p_i) ).

What does this all have to do with the question? Well, we can take the two equations at the start of the solution and apply \sigma^{-1} to both sides, yielding:

\sigma^{-1}(p_m) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0

\sigma^{-1}(p_f) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1

But, \sigma^{-1}(p_m) = \log \left( \text{odds}(p_m) \right) = \log(m) (using the definition in the problem) and \sigma^{-1}(p_f) = \log \left( \text{odds}(p_f) \right) = \log(f), so we have that:

\log(m) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0

\log(f) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1

Finally, if we raise both sides to the exponent e, we’ll be able to directly write f in terms of m! Remember that e^{\log(m)} = m and e^{\log(f)} = f, assuming that we’re using the natural logarithm. Then:

m = e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0}

f = e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1}

So, f in terms of m is:

\frac{f}{m} = \frac{e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1}}{e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0}} = e^{2.5}

Or, in other words:

\boxed{f = e^{2.5}m}



Problem 8

We decide to build a classifier that takes in a state’s demographic information and predicts whether, in a given year:


Problem 8.1

The simplest possible classifier we could build is one that predicts the same label (1 or 0) every time, independent of all other features.

Consider the following statement:

If a > b, then the constant classifier that maximizes training accuracy predicts 1 every time; otherwise, it predicts 0 every time.

For which combination of a and b is the above statement not guaranteed to be true?

Note: Treat sat as our training set.

Option 1:

a = (sat['Math'] > sat['Verbal']).mean()
b = 0.5

Option 2:

a = (sat['Math'] - sat['Verbal']).mean()
b = 0

Option 3:

a = (sat['Math'] - sat['Verbal'] > 0).mean()
b = 0.5

Option 4:

a = ((sat['Math'] / sat['Verbal']) > 1).mean() - 0.5
b = 0

Answer: Option 2

Conceptually, we’re looking for a combination of a and b such that when a > b, it’s true that in more than 50% of states, the "Math" value is larger than the "Verbal" value. Let’s look at all four options through this lens:

  • Option 1: sat['Math'] > sat['Verbal'] is a Series of Boolean values, containing True for all states where the "Math" value is larger than the "Verbal" value and False for all other states. The mean of this series, then, is the proportion of states satisfying this criteria, and since b is 0.5, a > b is True only when the bolded condition above is True.
  • Option 3 is the same as Option 1 – note that x > y is equivalent to x - y > 0.
  • Option 4: sat['Math'] / sat['Verbal'] is a Series that contains values greater than 1 whenever a state’s "Math" value is larger than its "Verbal" value and less than or equal to 1 in all other cases. As in the other options that work, (sat['Math'] / sat['Verbal']) > 1 is a Boolean Series with True for all states with a larger "Math" value than "Verbal" values; a > b compares the proportion of True values in this Series to 0.5. (Here, p - 0.5 > 0 is the same as p > 0.5.)

Then, by process of elimination, Option 2 must be the correct option – that is, it must be the only option that doesn’t work. But why? sat['Math'] - sat['Verbal'] is a Series containing the difference between each state’s "Math" and "Verbal" values, and .mean() computes the mean of these differences. The issue is that here, we don’t care about how different each state’s "Math" and "Verbal" values are; rather, we just care about the proportion of states with a bigger "Math" value than "Verbal" value. It could be the case that 90% of states have a larger "Math" value than "Verbal" value, but one state has such a big "Verbal" value that it makes the mean difference between "Math" and "Verbal" scores negative. (A property you’ll learn about in future probability courses is that this is equal to the difference in the mean "Math" value for all states and the mean "Verbal" value for all states – this is called the “linearity of expectation” – but you don’t need to know that to answer this question.)


Problem 8.2

Suppose we train a classifier, named Classifier 1, and it achieves an accuracy of \frac{5}{9} on our training set.

Typically, root mean squared error (RMSE) is used as a performance metric for regression models, but mathematically, nothing is stopping us from using it as a performance metric for classification models as well.

What is the RMSE of Classifier 1 on our training set? Give your answer as a simplified fraction.

Answer: \frac{2}{3}

An accuracy of \frac{5}{9} means that the model is such that out of 9 values, 5 are labeled correctly. By extension, this means that 4 out of 9 are not labeled correctly as 0 or 1.

Remember, RMSE is defined as

\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - H(x_i))^2}

where y_i represents the ith actual value and H(x_i) represents the ith prediction. Here, y_i is either 0 or 1 and $H(x_i) is also either 0 or 1. We’re told that \frac{5}{9} of the time, y_i and H(x_i) are the same; in those cases, (y_i - H(x_i))^2 = 0^2 = 0. We’re also told that \frac{4}{9} of the time, y_i and H(x_i) are different; in those cases, (y_i - H(x_i))^2 = 1. So,

\text{RMSE} = \sqrt{\frac{5}{9} \cdot 0 + \frac{4}{9} \cdot 1} = \sqrt{\frac{4}{9}} = \frac{2}{3}


Problem 8.3

While Classifier 1’s accuracy on our training set is \frac{5}{9}, its accuracy on our test set is \frac{1}{4}. Which of the following scenarios is most likely?

Answer: Option 2

Since the accuracy of Classifier 1 is much higher on the dataset used to train it than the dataset it was tested on, it’s likely Classifer 1 overfit to the training set because it was too complex. To fix the issue, we need to decrease its complexity, so that it focuses on learning the general structure of the data in the training set and not too much on the random noise in the training set.


For the remainder of this question, suppose we train another classifier, named Classifier 2, again on our training set. Its performance on the training set is described in the confusion matrix below. Note that the columns of the confusion matrix have been separately normalized so that each has a sum of 1.


Problem 8.4

Suppose conf is the DataFrame above. Which of the following evaluates to a Series of length 2 whose only unique value is the number 1?

Answer: Option 1

Note that the columns of conf sum to 1 – 0.9 + 0.1 = 1, and 0.4 + 0.6 = 1. To create a Series with just the value 1, then, we need to sum the columns of conf, which we can do using conf.sum(axis=0). conf.sum(axis=1) would sum the rows of conf.


Problem 8.5

Fill in the blank: the ___ of Classifier 2 is guaranteed to be 0.6.

Answer: recall

The number 0.6 appears in the bottom-right corner of conf. Since conf is column-normalized, the value 0.6 represents the proportion of values in the second column that were predicted to be 1. The second column contains values that were actually 1, so 0.6 is really the proportion of values that were actually 1 that were predicted to be 1, that is, \frac{\text{actually 1 and predicted 1}}{\text{actually 1}}. This is the definition of recall!

If you’d like to think in terms of true positives, etc., then remember that: - True Positives (TP) are values that were actually 1 and were predicted to be 1. - True Negatives (TN) are values that were actually 0 and were predicted to be 0. - False Positives (FP) are values that were actually 0 and were predicted to be 1. - False Negatives (FN) are values that were actually 1 and were predicted to be 0.

Recall is \frac{\text{TP}}{\text{TP} + \text{FN}}.


For your convenience, we show the column-normalized confusion matrix from the previous page below. You will need to use the specific numbers in this matrix when answering the following subpart.


Problem 8.6

Suppose a fraction \alpha of the labels in the training set are actually 1 and the remaining 1 - \alpha are actually 0. The accuracy of Classifier 2 is 0.65. What is the value of \alpha?

Hint: If you’re unsure on how to proceed, here are some guiding questions:

Answer: \frac{5}{6}

Here is one way to solve this problem:

accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Given the values from the confusion matrix:

accuracy = \frac{0.6 \cdot \alpha + 0.9 \cdot (1 - \alpha)}{\alpha + (1 - \alpha)}
accuracy = \frac{0.6 \cdot \alpha + 0.9 - 0.9 \cdot \alpha}{1}
accuracy = 0.9 - 0.3 \cdot \alpha

Therefore:

0.65 = 0.9 - 0.3 \cdot \alpha
0.3 \cdot \alpha = 0.9 - 0.65
0.3 \cdot \alpha = 0.25
\alpha = \frac{0.25}{0.3}
\alpha = \frac{5}{6}



Problem 9

Let \vec{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}. Consider the function g(\vec{x}) = (x_1 - 3)^2 + (x_1^2 - x_2)^2.


Problem 9.1

Find \nabla g(\vec{x}), the gradient of g(\vec{x}), and use it to show that \nabla g\left( \begin{bmatrix} -1 \\ 1 \end{bmatrix} \right) = \begin{bmatrix} -8 \\ 0 \end{bmatrix}.

\nabla g(\vec{x}) = \begin{bmatrix} 2x_1 -6 + 4x_1(x_1^2 - x_2) \\ -2(x_1^2 - x_2) \end{bmatrix}

We can find \nabla g(\vec{x}) by finding the partial derivatives of g(\vec{x}):

\frac{\partial g}{\partial x_1} = 2(x_1 - 3) + 2(x_1^2 - x_2)(2 x_1) \frac{\partial g}{\partial x_2} = 2(x_1^2 - x_2)(-1) \nabla g(\vec{x}) = \begin{bmatrix} 2(x_1 - 3) + 2(x_1^2 - x_2)(2 x_1) \\ 2(x_1^2 - x_2)(-1) \end{bmatrix} \nabla g\left(\begin{bmatrix} - 1 \\ 1 \end{bmatrix}\right) = \begin{bmatrix} 2(-1 - 4) + 2((-1)^2 - 1)(2(-1)) \\ 2((-1)^2 - 1) \end{bmatrix} = \begin{bmatrix} -8 \\ 0 \end{bmatrix}.


Problem 9.2

We’d like to find the vector \vec{x}^* that minimizes g(\vec{x}) using gradient descent. Perform one iteration of gradient descent by hand, using the initial guess \vec{x}^{(0)} = \begin{bmatrix} -1 \\ 1 \end{bmatrix} and the learning rate \alpha = \frac{1}{2}. In other words, what is \vec{x}^{(1)}?

\vec x^{(1)} = \begin{bmatrix} 3 \\ 1 \end{bmatrix}

Here’s the general form of gradient descent: \vec x^{(1)} = \vec{x}^{(0)} - \alpha \nabla g(\vec{x}^{(0)})

We can substitute \alpha = \frac{1}{2} and x^{(0)} = \begin{bmatrix} -1 \\ 1 \end{bmatrix} to get: \vec x^{(1)} = \begin{bmatrix} -1 \\ 1 \end{bmatrix} - \frac{1}{2} \nabla g(\vec x ^{(0)}) \vec x^{(1)} = \begin{bmatrix} -1 \\ 1 \end{bmatrix} - \frac{1}{2} \begin{bmatrix} -8 \\ 0 \end{bmatrix}

\vec{x}^{(1)} = \begin{bmatrix} -1 \\ 1 \end{bmatrix} - \frac{1}{2} \begin{bmatrix} -8 \\ 0 \end{bmatrix} = \begin{bmatrix} 3 \\ 1 \end{bmatrix}


Problem 9.3

Consider the function f(t) = (t - 3)^2 + (t^2 - 1)^2. Select the true statement below.

f(t) is not convex, but has a global minimum.

It is seen that f(t) isn’t convex, which can be verified using the second derivative test: f'(t) = 2(t - 3) + 2(t^2 - 1) 2t = 2t - 6 + 4t^3 - 4t = 4t^3 - 2t - 6 f''(t) = 12t^2 - 2

Clearly, f''(t) < 0 for many values of t (e.g. t = 0), so f(t) is not always convex.

However, f(t) does have a global minimum – its output is never less than 0. This is because it can be expressed as the sum of two squares, (t - 3)^2 and (t^2 - 1)^2, respectively, both of which are greater than or equal to 0.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.