Open Source · MarkTechPost ·

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI open-sourced OSCAR, an INT2 KV cache quantization method for long-context LLM serving. The approach uses attention-aware offline covariance to derive separate rotations for keys and values, aiming to cut KV memory use and improve decode speed while limiting accuracy loss on benchmark models.

Read the full story at MarkTechPost →