In [1]:
from lec_utils import *
Discussion Slides: Logistic Regression
Agenda 📆¶
- Logistic regression.
- The logistic function.
- Cross-entropy loss.
Logistic regression¶
- Logistic regression is a binary classification technique that predicts the probability that a data point belongs to class 1 given its feature vector, $\vec x_i$.
- If we're able to predict the probability of an event, we can classify the event by using a threshold.
For example, if we set the threshold to 50% and our model estimates a 60% chance that a point belongs to class 1, we classify that point as class 1.
- Logistic regression is similar to linear regression in that it computes a linear combination of the input features (where the weights come from a parameter vector, $\vec w$) to output a number as a prediction.
- However, instead of predicting a real number in the range $(-\infty, \infty)$, logistic regression predicts a probability, which is in the range $[0, 1]$.
The logistic function¶
To perform this transformation from a real number to a probability, we use the logistic function, which transforms numerical inputs to the interval $(0,1)$: $$\sigma(t) = \frac{1}{1 + e^{-t}} = \frac{1}{1 + \text{exp}(-t)}$$
As $t \to +\infty$, $\sigma(t) \to 1$.
As $t \to -\infty$, $\sigma(t) \to 0$.
When $t = 0$, $\sigma(0) = 0.5$.
In [2]:
ts = np.linspace(-5, 5)
px.line(x=ts, y=1 / (1 + np.e ** (-ts)), title=r'$\sigma(x)$')
- In the one-feature case, where $P(y_i = 1 | \vec x_i) = \sigma(w_0 + w_1 x_i)$:
- If $w_1 > 0$, then $\sigma(w_0 + w_1 x_i)$ is an increasing function of $x_i$.
- If $w_1 < 0$, then $\sigma(w_0 + w_1 x_i)$ is a decreasing function of $x_i$.
In [3]:
px.line(x=ts, y=1 / (1 + np.e ** -(3 - 2 * ts)), title=r'$\sigma(3 - 2x)$')
Cross-entropy loss¶
- As a reminder, the logistic regression model looks like:
- To find optimal model parameters, we minimize average cross-entropy loss.
If $y_i$ is an observed value and $p_i$ is a predicted probability, then:
- Benefits of cross-entropy loss:
- The loss surface of average cross-entropy loss is convex, which means it's easy(er) to minimize with gradient descent than non-convex loss surfaces.
- Cross-entropy loss steeply penalizes incorrect probabilities, unlike squared loss.