This blog post concerns our ICLR20 paper on a surprising discovery about learning rate (LR), the most basic hyperparameter in deep learning.
As illustrated in many online blogs, setting LR too small might slow down the optimization, and setting it too large might make the network overshoot the area of low losses. The standard mathematical analysis for the right choice of LR relates it to smoothness of the loss function.
Many practitioners use a ‘step decay’ LR schedule, which systematically drops the LR after specific training epochs. One often hears the intuition—with some mathematical justification if one treats SGD as a random walk in the loss landscape— that large learning rates are useful in the initial (“exploration”) phase of training whereas lower rates in later epochs allow a slow settling down to a local minimum in the landscape. Intriguingly, this intuition is called into question by the success of exotic learning rate schedules such as cosine (Loshchilov&Hutter, 2016), and triangular (Smith, 2015), featuring an oscillatory LR. These divergent approaches suggest that LR, the most basic and intuitive hyperparameter in deep learning, has not revealed all its mysteries yet.