TECH LOGIC: January 2015

Saturday, January 31, 2015

Adaptive Learning Rates for NN

I was going through Hinton's lectures and I found something interesting and wanted to share.
It is a very usual case that magnitude of gradient for different layers are different. The fan-in of a unit determines the size of the “overshoot” effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error. So we can use local adaptive gains $g_{ij}$ for gradients.

So update rule becomes:
\[\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}\]

How we adjust the gains is by additive increment and multiplicative decrement.

if $( \frac{\partial E}{\partial w_{ij}}(t-1) * \frac{\partial E}{\partial w_{ij}}(t) ) > 0 $

then $g_{ij}(t) = g_{ij}(t-1) + 0.05$

else $g_{ij}(t) = g_{ij}(t-1) * 0.95$

Other things to note are:

$g_{ij}$ should be withing some bounds like [0.1,10] or [0.01,100]
Use of full batch or large mini batches(nothing crazy should happen because of sampling error)
Use agreement in sign of current gradient and current velocity for that weight.(adaptive learning rates combined with momentum).

Updates for momentum method:
Weight change is current velocity
$$ \Delta w_{ij}(t) = v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t) = \alpha \Delta w_{ij}(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$$
velocity $v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$, here $\alpha$ is slightly less than 1.
Momentum method builds up speed in directions with a gentle but consistent gradient. Use of small initial momntum $\alpha = 0.5$ and later to $\alpha = 0.9$.

Popular Posts

Saturday, January 31, 2015

Adaptive Learning Rates for NN