Feature scaling's effect on gradient descent In Andrew Ng's machine learning class, he mentioned feature scaling will make gradient descent goes faster. < Specifically: > We can speed up gradient descent by having ...--prophetes.ai

Feature scaling's effect on gradient descent In Andrew Ng's machine learning class, he mentioned feature scaling will make gradient descent goes faster. < Specifically: > We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. Why it would work?

The gradient descend uses one fixed learning rate for all $\theta$'s, so we need to choose the value based on the input value having the smallest range. Otherwise the gradient descent might not converge for that small range. Now with that small learning rate it takes ages for the large range to converge.

There is also good explanation in Quora

> Essentially, scaling the inputs (through mean normalization, or z-score) gives the error surface a more spherical shape, where it would otherwise be a very high curvature ellipse. Since gradient descent is curvature-ignorant, having an error surface with high curvature will mean that we take many steps which aren't necessarily in the optimal direction. When we scale the inputs, we reduce the curvature, which makes methods that ignore curvature (like gradient descent) work much better. When the error surface is circular (spherical), the gradient points right at the minimum, so learning is easy.