In [1]:
from lec_utils import *

Discussion Slides: Logistic Regression

Agenda 📆¶

  • Logistic regression.
  • The logistic function.
  • Cross-entropy loss.

Logistic regression¶

  • Logistic regression is a binary classification technique that predicts the probability that a data point belongs to class 1 given its feature vector, $\vec x_i$.
$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$
Remember, in binary classification, each $y_i$ is either 0 or 1.
  • If we're able to predict the probability of an event, we can classify the event by using a threshold.


For example, if we set the threshold to 50% and our model estimates a 60% chance that a point belongs to class 1, we classify that point as class 1.

  • Logistic regression is similar to linear regression in that it computes a linear combination of the input features (where the weights come from a parameter vector, $\vec w$) to output a number as a prediction.
  • However, instead of predicting a real number in the range $(-\infty, \infty)$, logistic regression predicts a probability, which is in the range $[0, 1]$.

The logistic function¶

  • To perform this transformation from a real number to a probability, we use the logistic function, which transforms numerical inputs to the interval $(0,1)$: $$\sigma(t) = \frac{1}{1 + e^{-t}} = \frac{1}{1 + \text{exp}(-t)}$$

  • As $t \to +\infty$, $\sigma(t) \to 1$.

  • As $t \to -\infty$, $\sigma(t) \to 0$.

  • When $t = 0$, $\sigma(0) = 0.5$.

In [2]:
ts = np.linspace(-5, 5)
px.line(x=ts, y=1 / (1 + np.e ** (-ts)), title=r'$\sigma(x)$')
  • In the one-feature case, where $P(y_i = 1 | \vec x_i) = \sigma(w_0 + w_1 x_i)$:
    • If $w_1 > 0$, then $\sigma(w_0 + w_1 x_i)$ is an increasing function of $x_i$.
    • If $w_1 < 0$, then $\sigma(w_0 + w_1 x_i)$ is a decreasing function of $x_i$.
In [3]:
px.line(x=ts, y=1 / (1 + np.e ** -(3 - 2 * ts)), title=r'$\sigma(3 - 2x)$')

Cross-entropy loss¶

  • As a reminder, the logistic regression model looks like:
$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}) = \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$$
  • To find optimal model parameters, we minimize average cross-entropy loss.
    If $y_i$ is an observed value and $p_i$ is a predicted probability, then:
$$L_\text{ce}(y_i, p_i) = \begin{cases} - \log(p_i) & \text{if $y_i = 1$} \\ -\log(1 - p_i) & \text{if $y_i = 0$} \end{cases} = - \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right)$$
  • Benefits of cross-entropy loss:
    • The loss surface of average cross-entropy loss is convex, which means it's easy(er) to minimize with gradient descent than non-convex loss surfaces.
    • Cross-entropy loss steeply penalizes incorrect probabilities, unlike squared loss.