In [1]:

from lec_utils import *

Discussion Slides: Logistic Regression

Agenda 📆¶

Logistic regression.
The logistic function.
Cross-entropy loss.

Logistic regression¶

Logistic regression is a binary classification technique that predicts the probability that a data point belongs to class 1 given its feature vector, $\vec x_i$.

$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$Remember, in binary classification, each $y_i$ is either 0 or 1.

If we're able to predict the probability of an event, we can classify the event by using a threshold.

For example, if we set the threshold to 50% and our model estimates a 60% chance that a point belongs to class 1, we classify that point as class 1.

Logistic regression is similar to linear regression in that it computes a linear combination of the input features (where the weights come from a parameter vector, $\vec w$) to output a number as a prediction.

However, instead of predicting a real number in the range $(-\infty, \infty)$, logistic regression predicts a probability, which is in the range $[0, 1]$.

The logistic function¶

To perform this transformation from a real number to a probability, we use the logistic function, which transforms numerical inputs to the interval $(0,1)$: $$\sigma(t) = \frac{1}{1 + e^{-t}} = \frac{1}{1 + \text{exp}(-t)}$$
As $t \to +\infty$, $\sigma(t) \to 1$.
As $t \to -\infty$, $\sigma(t) \to 0$.
When $t = 0$, $\sigma(0) = 0.5$.

In [2]:

ts = np.linspace(-5, 5)
px.line(x=ts, y=1 / (1 + np.e ** (-ts)), title=r'$\sigma(x)$')

In the one-feature case, where $P(y_i = 1 | \vec x_i) = \sigma(w_0 + w_1 x_i)$:
- If $w_1 > 0$, then $\sigma(w_0 + w_1 x_i)$ is an increasing function of $x_i$.
- If $w_1 < 0$, then $\sigma(w_0 + w_1 x_i)$ is a decreasing function of $x_i$.

In [3]:

px.line(x=ts, y=1 / (1 + np.e ** -(3 - 2 * ts)), title=r'$\sigma(3 - 2x)$')

Cross-entropy loss¶

As a reminder, the logistic regression model looks like:

$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$

To find optimal model parameters, we minimize average cross-entropy loss.
If $y_i$ is an observed value and $p_i$ is a predicted probability, then:

$$L_\text{ce}(y_i, p_i) = \begin{cases} - \log(p_i) & \text{if $y_i = 1$} \\ -\log(1 - p_i) & \text{if $y_i = 0$} \end{cases} = - \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right)$$

Benefits of cross-entropy loss:
- The loss surface of average cross-entropy loss is convex, which means it's easy(er) to minimize with gradient descent than non-convex loss surfaces.
- Cross-entropy loss steeply penalizes incorrect probabilities, unlike squared loss.