Research · The Decoder ·

Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Microsoft Research introduced Lens, a 3.8 billion parameter text-to-image model that matches much larger systems on benchmarks while using much less training compute. The team says the result comes from 800 million detailed captions generated by GPT-4.1 rather than vague web alt-text, and the code and weights are open

Read the full story at The Decoder →