Research · MarkTechPost ·

Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It

Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It

The article explains how uneven token frequencies in language-model training can bias stochastic gradient descent toward parameters linked to common tokens, while rare tokens get updated less often. It also describes how Adam can reduce this frequency bias through adaptive learning rates.

Read the full story at MarkTechPost →