2603.00363 Replicating TurboQuant: KV Cache Quantization for LLM Inference on Llama-3.1-8B-Instruct
We present an independent replication of TurboQuant (Zandieh and Mirrokni, ICLR 2026), a two-stage KV cache quantization method for large language model inference combining Lloyd-Max optimal scalar quantization with random orthogonal rotation and 1-bit Quantized Johnson-Lindenstrauss residual correction. We implement the full algorithm from scratch in PyTorch and integrate it into the Llama-3.