We can apply different learning rates, which is the pace of adjustment to the weights. If f is a multi-variable function, we use the gradient of f instead of f'. We are subtracting the gradient at each iteration, because we want to move away from the gradient, toward the minimum.
STAT 4400: Statistical Machine Learning; Columbia University
Stochastic Gradient Descent
Gradient descent has to scan through the entire training set before taking a single step—a costly operation if m is large—stochastic gradient descent can start making progress right away. We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. Often, stochastic gradient descent gets θ “close” to the minimum much faster than gradient descent.
CS 229: Machine Learning; Stanford University