I was going through Hinton's lectures and I found something interesting and wanted to share.
It is a very usual case that magnitude of gradient for different layers are different. The fan-in of a unit determines the size of the “overshoot” effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error. So we can use local adaptive gains $g_{ij}$ for gradients.
So update rule becomes:
\[\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}\]
How we adjust the gains is by additive increment and multiplicative decrement.
Weight change is current velocity
$$ \Delta w_{ij}(t) = v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t) = \alpha \Delta w_{ij}(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$$
velocity $v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$, here $\alpha$ is slightly less than 1.
Momentum method builds up speed in directions with a gentle but consistent gradient. Use of small initial momntum $\alpha = 0.5$ and later to $\alpha = 0.9$.
It is a very usual case that magnitude of gradient for different layers are different. The fan-in of a unit determines the size of the “overshoot” effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error. So we can use local adaptive gains $g_{ij}$ for gradients.
So update rule becomes:
\[\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}\]
How we adjust the gains is by additive increment and multiplicative decrement.
if $( \frac{\partial E}{\partial w_{ij}}(t-1) * \frac{\partial E}{\partial w_{ij}}(t) ) > 0 $
then $g_{ij}(t) = g_{ij}(t-1) + 0.05$
Other things to note are:else $g_{ij}(t) = g_{ij}(t-1) * 0.95$
- $g_{ij}$ should be withing some bounds like [0.1,10] or [0.01,100]
- Use of full batch or large mini batches(nothing crazy should happen because of sampling error)
- Use agreement in sign of current gradient and current velocity for that weight.(adaptive learning rates combined with momentum).
Weight change is current velocity
$$ \Delta w_{ij}(t) = v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t) = \alpha \Delta w_{ij}(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$$
velocity $v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$, here $\alpha$ is slightly less than 1.
Momentum method builds up speed in directions with a gentle but consistent gradient. Use of small initial momntum $\alpha = 0.5$ and later to $\alpha = 0.9$.