Generally speaking, in the *unnormalized* case, **gradient-based optimization** algorithms (i.e., neural networks) will have a very __hard time to move the weight vectors towards a good solution__.

**Imagine this:**

- Let's say you have two features and they range differently.
- Geometrically, the data points are spread around and form an ellipsoid (i.e., an oval).
- However, if the features are normalized, they will be more concentrated and hopefully, form a
**unit circle**and make the**covariance diagonal**or at least close to diagonal.

Forgradient-based algorithms, normalization improves theconvergence speed.

Here's what convergence looks like:

Relationship to **Batch-Normalization (BN)**:

- This is the same idea behind methods such as batch-normalizing the intermediate representations of data in neural networks.
- Using BN, the
**convergence speed increases**since it can easily help the gradients do what they are supposed to do in order to reduce the error.

Learn more about feature scaling in Andrew Ng's video.