Classifier Evaluation and Logistic Regression

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

After fitting a BillyClassifier on our training set, we use it to make predictions on an unseen test set. Our results are summarized in the following confusion matrix.


Problem 1.1

What is the recall of our classifier? Give your answer as a fraction (it does not need to be simplified).

Answer: \frac{35}{57}

There are 105 true positives and 66 false negatives. Hence, the recall is \frac{105}{105 + 66} = \frac{105}{171} = \frac{35}{57}.


Problem 1.2

The accuracy of our classifier is \frac{69}{117}. How many true negatives did our classifier have? Give your answer as an integer.

Answer: 33

Let x be the number of true negatives. The number of correctly classified data points is 105 + x, and the total number of data points is 105 + 30 + 66 + x = 201 + x. Hence, this boils down to solving for x in \frac{69}{117} = \frac{105 + x}{201 + x}.

It may be tempting to cross-multiply here, but that’s not necessary (in fact, we picked the numbers specifically so you would not have to)! Multiply \frac{69}{117} by \frac{2}{2} to yield \frac{138}{234}. Then, conveniently, setting x = 33 in \frac{105 + x}{201 + x} also yields \frac{138}{234}, so x = 33 and hence the number of true negatives our classifier has is 33.


Problem 1.3

True or False: In order for a binary classifier’s precision and recall to be equal, the number of mistakes it makes must be an even number.

Answer: True

Remember that \text{precision} = \frac{TP}{TP + FP} and \text{recall} = \frac{TP}{TP + FN}. In order for precision to be the same as recall, it must be the case that FP = FN, i.e. that our classifier makes the same number of false positives and false negatives. The only kinds of “errors" or”mistakes" a classifier can make are false positives and false negatives; thus, we must have

\text{mistakes} = FP + FN = FP + FP = 2 \cdot FP

2 times any integer must be an even integer, so the number of mistakes must be even.


Problem 1.4

Suppose we are building a classifier that listens to an audio source (say, from your phone’s microphone) and predicts whether or not it is Soulja Boy’s 2008 classic “Kiss Me thru the Phone”. Our classifier is pretty good at detecting when the input stream is “Kiss Me thru the Phone”, but it often incorrectly predicts that similar sounding songs are also “Kiss Me thru the Phone”.

Complete the sentence: Our classifier has…

Answer: Option B: low precision and high recall.

Our classifier is good at identifying when the input stream is “Kiss Me thru the Phone”, i.e. it is good at identifying true positives amongst all positives. This means it has high recall.

Since our classifier makes many false positive predictions – in other words, it often incorrectly predicts “Kiss Me thru the Phone” when that’s not what the input stream is – it has many false positives, so its precision is low.

Thus, our classifier has low precision and high recall.



Problem 2

Suppose we want to use logistic regression to classify whether a person survived the sinking of the Titanic. The first 5 rows of our dataset are given below.

\begin{array}{|c|c|c|c|} \hline & \textbf{Age} & \textbf{Survived} & \textbf{Female} \\ \hline 0 & 22.0 & 0 & 0 \\ \hline 1 & 38.0 & 1 & 1 \\ \hline 2 & 26.0 & 1 & 1 \\ \hline 3 & 35.0 & 1 & 1 \\ \hline 4 & 35.0 & 0 & 0 \\ \hline \end{array}

Suppose after training our logistic regression model we get \vec{w}^* = \begin{bmatrix} -1.2 \\ -0.005 \\ 2.5 \end{bmatrix}, where -1.2 is an intercept term, -0.005 is the optimal parameter corresponding to passenger’s age, and 2.5 is the optimal parameter corresponding to sex (1 if female, 0 otherwise).


Problem 2.1

Consider Sı̄lānah Iskandar Nāsı̄f Abı̄ Dāghir Yazbak, a 20 year old female. What chance did she have to survive the sinking of the Titanic according to our model? Give your answer as a probability in terms of \sigma. If there is not enough information, write “not enough information.”

Answer: P(y = 1 | \text{age} = 20, \text{female} = 1) = \sigma(1.2)

Our augmented feature vector is of the form \text{Aug}(\vec{x}) = \begin{bmatrix} 1 \\ 20 \\ 1 \end{bmatrix}. Then \vec{w}^* \cdot \text{Aug}(\vec x) = 1(-1.2) + 20(-0.005) + 1(2.5) = 1.2, so:

P(y = 1 | \vec{x}) = \sigma \left( \vec{w}^* \cdot \text{Aug}(\vec x) \right) = \boxed{\sigma (1.2)}


Problem 2.2

Sı̄lānah Iskandar Nāsı̄f Abı̄ Dāghir Yazbak actually survived. What is the cross-entropy loss for our prediction in the previous part?

Answer: -\log (\sigma (1.2))

Here y_i=1 and p_i = \sigma (1.2). The formula for cross entropy loss is:

L_\text{ce}(y_i, p_i) = -y_i\log (p_i) - (1 - y_i)\log (1 - p_i) = \boxed{-\log (\sigma (1.2))}


Problem 2.3

At what age would we predict that a female passenger is more likely to have survived the Titanic than not? In other words, at what age is the probability of survival for a female passenger greater than 0.5?

Hint: Since \sigma(0) = 0.5, we have that \sigma \left( \vec{w}^* \cdot \text{Aug}(\vec x) \right) = 0.5 \implies \vec{w}^* \cdot \text{Aug}(\vec x) = 0.

Answer: 260 years old

The probability that a female passenger of age a survives the Titanic is:

P(y = 1 | \text{age} = a, \text{female} = 1) = \sigma(-1.2 - 0.005 a + 2.5) = \sigma(1.3 - 0.005a)

In order for \sigma(1.3 - 0.005a) = 0.5, we need 1.3 - 0.005a = 0. This means that:

0.005a = 1.3 \implies a = \frac{1.3}{0.005} = 1.3 \cdot 200 = 260

So, a female passenger must be at least 260 years old in order for us to predict that they are more likely to survive the Titanic than not. Note that \text{age} = 260 can be interpreted as a decision boundary; since we’ve fixed a value for the \text{female} feature, there’s only one remaining feature, which is \text{age}.


Problem 2.4

Let m be the odds of a given non-female passenger’s survival according to our logistic regression model, i.e., if the passenger had an 80% chance of survival, m would be 4, since their odds of survival are \frac{0.8}{0.2} = 4.

It turns out we can compute f, the odds of survival for a female passenger of the same age, in terms of m. Give an expression for f in terms of m.

Let p_m be the probability that the non-female passenger survives, and let p_f be the probability that the female passenger of the same age survives. Then, we have that:

p_m = \sigma(-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0)

p_f = \sigma(-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1)

Now, recall from Lecture 24 that:

  • If p_i is the probability of an event, then the odds of the event are \frac{p_i}{1 - p_i}.
  • If p_i = \sigma(t), then t = \sigma^{-1}(p_i) = \log \left( \frac{p_i}{1 - p_i} \right). In other words, the inverse of p_i = \sigma(t) is the log odds of p_i, i.e. $^{-1}(p_i) = ( (p_i) ).

What does this all have to do with the question? Well, we can take the two equations at the start of the solution and apply \sigma^{-1} to both sides, yielding:

\sigma^{-1}(p_m) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0

\sigma^{-1}(p_f) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1

But, \sigma^{-1}(p_m) = \log \left( \text{odds}(p_m) \right) = \log(m) (using the definition in the problem) and \sigma^{-1}(p_f) = \log \left( \text{odds}(p_f) \right) = \log(f), so we have that:

\log(m) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0

\log(f) = -1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1

Finally, if we raise both sides to the exponent e, we’ll be able to directly write f in terms of m! Remember that e^{\log(m)} = m and e^{\log(f)} = f, assuming that we’re using the natural logarithm. Then:

m = e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0}

f = e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1}

So, f in terms of m is:

\frac{f}{m} = \frac{e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 1}}{e^{-1.2 - 0.005 \cdot \text{age} + 2.5 \cdot 0}} = e^{2.5}

Or, in other words:

\boxed{f = e^{2.5}m}



Problem 3


Problem 3.1

For a given classifier, suppose the first 10 predictions of our classifier and 10 true observations are as follows: \begin{array}{|c|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{Predictions} & 1 & 1 & 1 & 1 & 1 & 0 & 1 & 1 & 1 & 1 \\ \hline \textbf{True Label} & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 1 & 1 & 1 \\ \hline \end{array}

  1. What is the accuracy of our classifier on these 10 predictions?

  2. What is the precision on these 10 predictions?

  3. What is the recall on these 10 predictions?

Answer:

  1. \frac{7}{10}

  2. We have 6 true positive and 3 false positives, so the precision evaluates to: \frac{6}{6+3} = \frac{2}{3}

  3. From the solution to (2) above, we know true positives = 6. The number of false negatives is 0 (we only predicted 0 once and the true value actually was 0). Thus the recall is: \frac{6}{6+0} = 1


Problem 3.2

Consider a model (e.g. the logistic regression model) that predicts p_i = P(y = 1 | \vec{x}_i), and then applies some threshold T to the outputted probability. That is, the model predicts either class 1, if p_i \geq T, or class 0, if p_i < T.

In general, if we increase the threshold T, which of the following can happen to our precision, recall, and accuracy? Select all that apply.

Answer:

All except “Recall can increase” are correct.

As we increase our classification threshold, the number of false positives decreases, but the number of false negatives (i.e. undetected points) increases. As a result, our precision increases (more of the points we say are positive will actually be positive), but our recall decreases (there will be more points that are actually positive that we don’t detect). However, in some cases precision can also decrease, when increasing a threshold lowers the number of true positives but keeps the number of true negatives the same. As seen in lecture, accuracy may increase or decrease – there typically exists an optimal threshold that maximizes accuracy, and if we increase or decrease our threshold from that point, accuracy decreases.



Problem 4

We decide to build a classifier that takes in a state’s demographic information and predicts whether, in a given year:


Problem 4.1

The simplest possible classifier we could build is one that predicts the same label (1 or 0) every time, independent of all other features.

Consider the following statement:

If a > b, then the constant classifier that maximizes training accuracy predicts 1 every time; otherwise, it predicts 0 every time.

For which combination of a and b is the above statement not guaranteed to be true?

Note: Treat sat as our training set.

Option 1:

a = (sat['Math'] > sat['Verbal']).mean()
b = 0.5

Option 2:

a = (sat['Math'] - sat['Verbal']).mean()
b = 0

Option 3:

a = (sat['Math'] - sat['Verbal'] > 0).mean()
b = 0.5

Option 4:

a = ((sat['Math'] / sat['Verbal']) > 1).mean() - 0.5
b = 0

Answer: Option 2

Conceptually, we’re looking for a combination of a and b such that when a > b, it’s true that in more than 50% of states, the "Math" value is larger than the "Verbal" value. Let’s look at all four options through this lens:

  • Option 1: sat['Math'] > sat['Verbal'] is a Series of Boolean values, containing True for all states where the "Math" value is larger than the "Verbal" value and False for all other states. The mean of this series, then, is the proportion of states satisfying this criteria, and since b is 0.5, a > b is True only when the bolded condition above is True.
  • Option 3 is the same as Option 1 – note that x > y is equivalent to x - y > 0.
  • Option 4: sat['Math'] / sat['Verbal'] is a Series that contains values greater than 1 whenever a state’s "Math" value is larger than its "Verbal" value and less than or equal to 1 in all other cases. As in the other options that work, (sat['Math'] / sat['Verbal']) > 1 is a Boolean Series with True for all states with a larger "Math" value than "Verbal" values; a > b compares the proportion of True values in this Series to 0.5. (Here, p - 0.5 > 0 is the same as p > 0.5.)

Then, by process of elimination, Option 2 must be the correct option – that is, it must be the only option that doesn’t work. But why? sat['Math'] - sat['Verbal'] is a Series containing the difference between each state’s "Math" and "Verbal" values, and .mean() computes the mean of these differences. The issue is that here, we don’t care about how different each state’s "Math" and "Verbal" values are; rather, we just care about the proportion of states with a bigger "Math" value than "Verbal" value. It could be the case that 90% of states have a larger "Math" value than "Verbal" value, but one state has such a big "Verbal" value that it makes the mean difference between "Math" and "Verbal" scores negative. (A property you’ll learn about in future probability courses is that this is equal to the difference in the mean "Math" value for all states and the mean "Verbal" value for all states – this is called the “linearity of expectation” – but you don’t need to know that to answer this question.)


Problem 4.2

Suppose we train a classifier, named Classifier 1, and it achieves an accuracy of \frac{5}{9} on our training set.

Typically, root mean squared error (RMSE) is used as a performance metric for regression models, but mathematically, nothing is stopping us from using it as a performance metric for classification models as well.

What is the RMSE of Classifier 1 on our training set? Give your answer as a simplified fraction.

Answer: \frac{2}{3}

An accuracy of \frac{5}{9} means that the model is such that out of 9 values, 5 are labeled correctly. By extension, this means that 4 out of 9 are not labeled correctly as 0 or 1.

Remember, RMSE is defined as

\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - H(x_i))^2}

where y_i represents the ith actual value and H(x_i) represents the ith prediction. Here, y_i is either 0 or 1 and $H(x_i) is also either 0 or 1. We’re told that \frac{5}{9} of the time, y_i and H(x_i) are the same; in those cases, (y_i - H(x_i))^2 = 0^2 = 0. We’re also told that \frac{4}{9} of the time, y_i and H(x_i) are different; in those cases, (y_i - H(x_i))^2 = 1. So,

\text{RMSE} = \sqrt{\frac{5}{9} \cdot 0 + \frac{4}{9} \cdot 1} = \sqrt{\frac{4}{9}} = \frac{2}{3}


Problem 4.3

While Classifier 1’s accuracy on our training set is \frac{5}{9}, its accuracy on our test set is \frac{1}{4}. Which of the following scenarios is most likely?

Answer: Option 2

Since the accuracy of Classifier 1 is much higher on the dataset used to train it than the dataset it was tested on, it’s likely Classifer 1 overfit to the training set because it was too complex. To fix the issue, we need to decrease its complexity, so that it focuses on learning the general structure of the data in the training set and not too much on the random noise in the training set.


For the remainder of this question, suppose we train another classifier, named Classifier 2, again on our training set. Its performance on the training set is described in the confusion matrix below. Note that the columns of the confusion matrix have been separately normalized so that each has a sum of 1.


Problem 4.4

Suppose conf is the DataFrame above. Which of the following evaluates to a Series of length 2 whose only unique value is the number 1?

Answer: Option 1

Note that the columns of conf sum to 1 – 0.9 + 0.1 = 1, and 0.4 + 0.6 = 1. To create a Series with just the value 1, then, we need to sum the columns of conf, which we can do using conf.sum(axis=0). conf.sum(axis=1) would sum the rows of conf.


Problem 4.5

Fill in the blank: the ___ of Classifier 2 is guaranteed to be 0.6.

Answer: recall

The number 0.6 appears in the bottom-right corner of conf. Since conf is column-normalized, the value 0.6 represents the proportion of values in the second column that were predicted to be 1. The second column contains values that were actually 1, so 0.6 is really the proportion of values that were actually 1 that were predicted to be 1, that is, \frac{\text{actually 1 and predicted 1}}{\text{actually 1}}. This is the definition of recall!

If you’d like to think in terms of true positives, etc., then remember that: - True Positives (TP) are values that were actually 1 and were predicted to be 1. - True Negatives (TN) are values that were actually 0 and were predicted to be 0. - False Positives (FP) are values that were actually 0 and were predicted to be 1. - False Negatives (FN) are values that were actually 1 and were predicted to be 0.

Recall is \frac{\text{TP}}{\text{TP} + \text{FN}}.


For your convenience, we show the column-normalized confusion matrix from the previous page below. You will need to use the specific numbers in this matrix when answering the following subpart.


Problem 4.6

Suppose a fraction \alpha of the labels in the training set are actually 1 and the remaining 1 - \alpha are actually 0. The accuracy of Classifier 2 is 0.65. What is the value of \alpha?

Hint: If you’re unsure on how to proceed, here are some guiding questions:

Answer: \frac{5}{6}

Here is one way to solve this problem:

accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Given the values from the confusion matrix:

accuracy = \frac{0.6 \cdot \alpha + 0.9 \cdot (1 - \alpha)}{\alpha + (1 - \alpha)}
accuracy = \frac{0.6 \cdot \alpha + 0.9 - 0.9 \cdot \alpha}{1}
accuracy = 0.9 - 0.3 \cdot \alpha

Therefore:

0.65 = 0.9 - 0.3 \cdot \alpha
0.3 \cdot \alpha = 0.9 - 0.65
0.3 \cdot \alpha = 0.25
\alpha = \frac{0.25}{0.3}
\alpha = \frac{5}{6}



Problem 5

Suppose Yutong builds a classifier that predicts whether or not a hotel provides free parking. The confusion matrix for her classifier, when evaluated on our training set, is given below.


Problem 5.1

What is the precision of Yutong’s classifier? Give your answer as a simplified fraction.

Answer: \frac{8}{13}

Precision is the proportion of predicted positives that actually were positives. So, given this confusion matrix, that value is \frac{8}{8 + 5}, or \frac{8}{13}.


Problem 5.2

Fill in the blanks: In order for Yutong’s classifier’s recall to be equal to its precision, __(i)__ must be equal to __(ii)__.

  1. What goes in blank (i)?
  1. What goes in blank (ii)?

Answer:

  1. B
  2. 5

We already know that the precision is \frac{8}{13}. Recall is the proportion of true positives that were indeed classified positives, which in this matrix is \frac{8}{B + 8}. So, in order for precision to equal recall, B must be 5.


Problem 5.3

Now, suppose both A and B are unknown. Fill in the blanks: In order for Yutong’s classifier’s recall to be equal to its accuracy, __(i)__ must be equal to __(ii)__.

  1. What goes in blank (i)?
  1. What goes in blank (ii)?

Hint: To verify your answer, pick an arbitrary value of A, like A = 10, and solve for the B that sets the model’s recall equal to its accuracy. Do the specific A and B you find satisfy your answer above?

Answer:

  1. A \cdot B
  2. 40

We can solve this problem by simply stating recall and accuracy in terms of the values in the confusion matrix. As we already found, recall is \frac{8}{B+8}. Accuracy is the sum of correct predictions over total number of predictions, or \frac{A + 8}{A + B + 13}. Then, we simply set these equal to each other, and solve.

\frac{8}{B+8} = \frac{A + 8}{A + B + 13} 8(A + B + 13) = (A + 8)(B + 8) 8A + 8B + 104 = AB + 8A + 8B + 64 104 = AB + 64 AB = 40



Problem 6

Suppose you fit four different models to predict whether someone has an income greater than $100,000 a year using their purchase history. You split the data into a training and test set and use 3-fold cross-validation. The table below shows all the calculated accuracies for each model (higher accuracy is better).


Problem 6.1

Which model has the lowest model bias?

Answer: Model D



Problem 6.2

Which model most severely underfits the data?

Answer: Model A



Problem 6.3

Which model most severely overfits the data?

Answer: Model D



Problem 6.4

Which model should you pick overall?

Answer: Model B



Problem 7

Consider three classifiers with the following confusion matrices:


Problem 7.1

Which model has the highest accuracy?

Answer: Model B

Accuracy is defined as \frac{\text{number\;of\;correct\;predictions}}{\text{number\;of\;total\;predictions}}, so we sum the (Yes, Yes) and (No, No) cells to get the number of correct predictions and divide that by the sum of all cells as the number of total predictions. We see that Model B has the highest accuracy of 0.9 with that formula. (Note that for simplicity, the confusion matrices are such that the sum of all values is 100 in all three cases.)


Problem 7.2

Which model has the highest precision?

Answer: Model C

Precision is defined as \frac{\text{number of correctly predicted yes values}}{\text{total number of yes predictions}}, so the number of correctly predicted yes values is the (Yes, Yes) cell, while the total number of yes predictions is the sum of the Yes column. We see that Model C has the highest precision, \frac{80}{85}, with that formula.


Problem 7.3

Which model has the highest recall?

Answer: Model B

Recall is defined as \frac{\text{number of correctly predicted yes values}}{\text{total number of values actually yes}}, so the number of correctly predicted yes values is the (Yes, Yes) cell, while the total number of values that are actually yes is the sum of the Yes row. We see that Model B has the highest recall, 1, with that formula.



Problem 8

Suppose you fit a logistic regression classifier. The classifier’s predictions on a test set of 5 points are shown below, along with the actual labels.

Recall that for logistic regression, we must also choose a threshold \tau to convert the predicted probabilities to predicted labels. For this question, assume that 0 < \tau < 1. Precision is undefined when the classifier doesn’t make any positive predictions (since \frac{0}{0} is undefined). For each question, show your work and draw a box around your final answer in the space provided. Each of your final answers should be a single number.


Problem 8.1

What is the lowest possible precision for any threshold \tau?

Answer:

The lowest precision happens when \tau is less than 0.3. In this case, the classifier predicts all points are 1, which gives a precision of \frac{3}{5}.



Problem 8.2

What is the lowest possible recall for any threshold \tau?

Answer:

The lowest recall happens when \tau is greater than 0.7. In this case, the classifier predicts all points are 0, which gives a recall of 0.



Problem 8.3

What is the highest possible recall if the classifier achieves a precision of 1?

Answer:

If precision is 1, the threshold must be greater than 0.4. Of these thresholds, the recall is greatest when the threshold is between 0.4 and 0.6. In this case, the recall is \frac{2}{3}.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.