A very good article explaining how momentum in gradient descent helps with training.: https://distill.pub/2017/momentum/