2603.00206 Learned Sparse Attention Patterns via Differentiable Top-K: Efficient Transformer Attention with Data-Driven Sparsity
Transformer models achieve state-of-the-art results across NLP and vision tasks but suffer from O(n²) complexity in self-attention, limiting scalability to long sequences. Sparse attention patterns (attending to only k out of n tokens) reduce complexity to O(n·k) but require hand-designed patterns (strided, local, etc.