In [1]:
from lec_utils import *
import lec20_util as util
Discussion Slides: Gradient Descent and Classification
Agenda 📆¶
- Gradient descent.
- Classification.
- Evaluating classification models.
Overview¶
- In order to find optimal model parameters, $w_0^*, w_1^*, ..., w_d^*$, our goal has been to minimize empirical risk.
- We've minimized empirical risk functions (like $R$ above) a few ways so far:
- Taking (partial) derivatives, setting them to 0, and solving.
- Using a linear algebraic-argument, which led to the normal equations, $X^TX \vec w^* = X^T \vec y$.
- Gradient descent is a method for minimizing functions computationally, when doing so by hand is difficult or impossible.
Gradient descent¶
- Suppose we want to minimize a function $f$, which takes in a vector $\vec w$ as input and returns a scalar as output.
For example, $f$ could be (but doesn't need to be) some empirical risk function, like the one on the previous slide.
To do so:
- Start with an initial guess of the minimizing input, $\vec w^{(0)}$.
- Choose a learning rate, or step size, $\alpha$.
- Use the following update rule:
$$\vec w^{(t+1)} = \vec w^{(t)} - \alpha \nabla f(\vec w^{(t)})$$
- $\nabla f(\vec{w}^{(t)})$ is the gradient of $f$, evaluated at timestep $t$.
The gradient of a function is a vector containing all of its partial derivatives.
- We repeat this iterative process until convergence, i.e. until the changes between $\vec{w}^{(t+1)}$ and $\vec{w}^{(t)}$ are small (or equivalently, until $\nabla f(\vec w^{(t)})$ is close to $\vec 0$.
In [2]:
util.minimizing_animation(w0=0, alpha=0.01)