Research · MarkTechPost ·

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding, drafting whole token blocks in one forward pass via KV injection. The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interact

Read the full story at MarkTechPost →