Generally speaking, in the unnormalized case, gradient-based optimization algorithms (i.e., neural networks) will have a very hard time to move the weight vectors towards a good solution.
- Let's say you have two features and they range differently.
- Geometrically, the data points are spread around and form an ellipsoid (i.e., an oval).
- However, if the features are normalized, they will be more concentrated and hopefully, form a unit circle and make the covariance diagonal or at least close to diagonal.
For gradient-based algorithms, normalization improves the convergence speed.
Here's what convergence looks like:
Relationship to Batch-Normalization (BN):
- This is the same idea behind methods such as batch-normalizing the intermediate representations of data in neural networks.
- Using BN, the convergence speed increases since it can easily help the gradients do what they are supposed to do in order to reduce the error.
Learn more about feature scaling in Andrew Ng's video.