Research · The Decoder · 8 June 2026

Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Microsoft Research introduced Lens, a 3.8 billion parameter text-to-image model that matches much larger systems on benchmarks while using much less training compute. The team says the result comes from 800 million detailed captions generated by GPT-4.1 rather than vague web alt-text, and the code and weights are open

Read the full story at The Decoder →