Research · The Decoder ·
Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators
Microsoft Research introduced Lens, a 3.8 billion parameter text-to-image model that matches much larger systems on benchmarks while using much less training compute. The team says the result comes from 800 million detailed captions generated by GPT-4.1 rather than vague web alt-text, and the code and weights are open