I was going through Hinton's lectures and I found something interesting and wanted to share.

It is a very usual case that magnitude of gradient for different layers are different. The fan-in of a unit determines the size of the “overshoot” effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error. So we can use local adaptive gains $g_{ij}$ for gradients.

So update rule becomes:

\[\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}\]

How we adjust the gains is by additive increment and multiplicative decrement.

Weight change is current velocity

$$ \Delta w_{ij}(t) = v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t) = \alpha \Delta w_{ij}(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$$

velocity $v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$, here $\alpha$ is slightly less than 1.

Momentum method builds up speed in directions with a gentle but consistent gradient. Use of small initial momntum $\alpha = 0.5$ and later to $\alpha = 0.9$.

It is a very usual case that magnitude of gradient for different layers are different. The fan-in of a unit determines the size of the “overshoot” effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error. So we can use local adaptive gains $g_{ij}$ for gradients.

So update rule becomes:

\[\Delta w_{ij} = - \epsilon g_{ij} \frac{\partial E}{\partial w_{ij}}\]

How we adjust the gains is by additive increment and multiplicative decrement.

if $( \frac{\partial E}{\partial w_{ij}}(t-1) * \frac{\partial E}{\partial w_{ij}}(t) ) > 0 $

then $g_{ij}(t) = g_{ij}(t-1) + 0.05$

Other things to note are:else $g_{ij}(t) = g_{ij}(t-1) * 0.95$

- $g_{ij}$ should be withing some bounds like [0.1,10] or [0.01,100]
- Use of full batch or large mini batches(nothing crazy should happen because of sampling error)
- Use agreement in sign of current gradient and current velocity for that weight.(adaptive learning rates combined with momentum).

Weight change is current velocity

$$ \Delta w_{ij}(t) = v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t) = \alpha \Delta w_{ij}(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$$

velocity $v(t) = \alpha v(t-1) - \epsilon \frac{\partial E}{\partial w_{ij}}(t)$, here $\alpha$ is slightly less than 1.

Momentum method builds up speed in directions with a gentle but consistent gradient. Use of small initial momntum $\alpha = 0.5$ and later to $\alpha = 0.9$.