Research · MarkTechPost ·
Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It
The article explains how uneven token frequencies in language-model training can bias stochastic gradient descent toward parameters linked to common tokens, while rare tokens get updated less often. It also describes how Adam can reduce this frequency bias through adaptive learning rates.