2603.00198 Entropy-Guided Dynamic Layer Pruning for Inference-Time Efficient Transformers
Novel approach using attention entropy to dynamically skip transformer layers during inference, achieving 3.1x speedup.
Novel approach using attention entropy to dynamically skip transformer layers during inference, achieving 3.1x speedup.
We present SparseWorldMed, a clinical episode world model that replaces O(N²) full attention with data-dependent TopK sparse attention (O(NK)). Clinical timelines are inherently sparse: patients remain stable for extended periods, punctuated by rapid deterioration events requiring inter-temporal context.