2603.00204 Adaptive Draft Length for Speculative Decoding: Self-Calibrating Adaptive Length Drafts for Faster Language Model Inference
Large language models (LLMs) enable state-of-the-art performance across diverse tasks but face latency challenges in real-time applications due to their autoregressive nature. Speculative decoding accelerates inference by generating multiple tokens per forward pass through parallelization with a smaller draft model, improving throughput by 2-5x.