2604.00597 Distilling Bidirectional Embedding Teachers into Streaming-Compatible Causal Students
Text embedding applications increasingly require real-time streaming updates—from conversational agents to recommendation systems processing continuous user interactions. While bidirectional attention models achieve superior embedding quality, they break key-value cache compatibility, requiring full sequence recomputation for each update.
2604.00595 Anisotropic Spectral Error Dressing for Calibrated Ensemble Weather Forecasts
Data-driven weather models achieve remarkable deterministic skill but lack native uncertainty quantification. Existing post-processing methods that convert deterministic forecasts into probabilistic ensembles typically assume isotropic error structure, ignoring directional patterns in forecast errors.
2604.00594 Time-Varying Mutual Information Decoding for Mitigating Visual Forgetting in Vision-Language Models
Long chain-of-thought (CoT) reasoning has substantially improved vision-language model (VLM) performance on complex visual tasks. However, extended generation causes visual forgetting, where models progressively lose dependence on image content and increasingly rely on language priors, leading to hallucinations.
2604.00593 DEFINITION UNIT TESTS IMPROVE LLM CONVENTION ADHERENCE
Large language models often know multiple valid conventions for mathematical notation but default to the wrong one when a specific convention is required. We introduce Definition Unit Tests (DUT), a prompting method that improves convention adherence by prepending discriminative checks—simple verification questions that test whether the model correctly interprets the specified convention—before the main problem.
2604.00592 Syntax Constraints Are Not Enough: Semantic Errors Dominate Diffusion LM Tool-Calling Failures
Diffusion language models have emerged as a promising alternative to autoregressive generation, yet they significantly underperform on structured output tasks such as tool calling. A common hypothesis attributes this gap to formatting failures that could be addressed through constrained decoding.
2604.00591 Deep-Layer Attention Pruning for Vision-Language Models
Visual token pruning is essential for efficient vision-language model inference, yet existing attention-based methods either fail catastrophically on spatially-sensitive tasks or require offline calibration data. We present a simple solution: use attention from deeper layers.
2604.00590 FCBoost: Static Frequency-Aware Channel Selection for 2-Bit KV Cache Quantization
KV cache quantization enables long-context inference in large language models but degrades accuracy at aggressive 2-bit precision. Recent methods like Kitty recover accuracy by dynamically boosting outlier channels to higher precision, but this requires per-page magnitude computation and metadata overhead.
2604.00589 Custom Forward-Backward VJPs for DFA-Guided Diffusion Language Models: An Empirical Study
DFA-guided diffusion language models enable constrained text generation by steering denoising with gradients of DFA acceptance probability. However, the DFA dynamic programming computation accounts for 57–59% of each guided step, creating a significant bottleneck.
2604.00588 TemplateLeak: A Template-Disjoint Evaluation Audit of CommonForms Form Field Detection
Template overlap between training and test splits is a persistent concern in document understanding benchmarks, as models may memorize specific form layouts rather than learning generalizable detection capabilities. We present TEMPLATELEAK, an audit framework that uses MinHash/LSH clustering to identify template overlap and applies document-level permutation testing to assess statistical significance.
2604.00587 BUDGET-DISTILLED ES-SSM: CROSS-BUDGET KNOWLEDGE DISTILLATION FOR ELASTIC SPECTRAL STATE SPACE MODELS
Elastic Spectral State Space Models (ES-SSM) enable runtime budget adaptation through ordered spectral truncation, allowing a single model to operate at any spectral budget K by using only the first K channels. However, ES-SSM suffers from severe accuracy degradation at low budgets, limiting practical deployment.
2604.00586 Counterfactual Gate Supervision Does Not Fix Gating Credit Assignment in Engram-Style Conditional Memory
Engram-style conditional memory augments transformers with hash-indexed n-gram embeddings and learned gating, but prior work has identified a critical training pathology: gates become systematically mis-calibrated, preferring high-frequency “hot” memory slots even when low-frequency “cold” positions achieve lower loss. We propose Counterfactual Gate Supervision (CGS), which computes per-token counterfactual loss differences under forced gate settings and uses this signal to supervise gate activations via an auxiliary loss.
2604.00585 Delta-Prefill Switching: Adaptive Routing for Speculative Decoding in Multi-Turn LLM Serving
Multi-turn LLM applications with prefix caching are increasingly common in production deployments. Speculative decoding accelerates inference by using a draft model to propose tokens verified in parallel, but its serialization requirement creates a severe bottleneck under concurrent multi-tenant load.
2604.00584 Innovation Saturation Does Not Robustify Kalman-Filtered Importance Ratios in LLM Reinforcement Learning
Kalman Policy Optimization (KPO) applies causal Kalman filtering to smooth importance sampling ratios in LLM reinforcement learning, but its performance is sensitive to the process-to-measurement noise ratio Q/V: weak smoothing (large Q/V) degrades accuracy by 11.79 percentage points on MATH-500.
2604.00583 Distilling Bidirectional Embedding Teachers into Streaming-Compatible Causal Students
Text embedding applications increasingly require real-time streaming updates—from conversational agents to recommendation systems processing continuous user interactions. While bidirectional attention models achieve superior embedding quality, they break key-value cache compatibility, requiring full sequence recomputation for each update.
2604.00582 Evidence-Grounded Constraint Schemas Do Not Improve Medical LLM Guardrails on LiveMedBench
Medical LLMs must respect patient-specific constraints—allergies, drug interactions, pregnancy status—to provide safe advice. We evaluate evidence-grounded constraint schemas as guardrails, comparing structured JSON schema extraction against plain-text checklist extraction and a single-pass baseline.
2604.00581 Answerability-Gain Rewards for Evidence-Label-Free GRU-Mem Gating: An Empirical Investigation
Recurrent memory agents process long documents efficiently by maintaining compact textual memory states, with GRU-style gating mechanisms controlling memory updates and early exit decisions. However, training these gates typically requires expensive evidence-position labels that are unavailable for realistic long-context QA datasets.
2604.00580 RefSwap: Counterfactual Reference-Swap Verification for Robust LLM Verifiers
Reference-based verifiers are critical components of reinforcement learning with verifiable rewards (RLVR), providing reward signals by comparing model responses against ground-truth answers. However, these verifiers are vulnerable to “master-key” attacks—trivial responses like single tokens or short phrases that achieve 25–29% false positive rates without containing any actual answer.
2604.00579 Risk-Controlled Early Exit for Diffusion Language Models
Diffusion language models (DLLMs) enable parallel text generation but require hundreds of diffusion steps, making inference slow. Early exit strategies can reduce computation by terminating tokens when predictions stabilize, but existing methods use fixed thresholds without formal quality guarantees.
2604.00578 The Repetition Advantage in Long-CoT SFT is a Termination Effect
Recent work shows that in long chain-of-thought (CoT) supervised fine-tuning (SFT), training for many epochs on a small dataset substantially outperforms single-epoch training on a larger dataset—a counterintuitive “repetition advantage.” We investigate whether this advantage reflects improved reasoning or merely better output termination behavior.
2604.00577 Copy-Then-Inpaint: Improving Temporal Consistency in Multi-Step GUI Generation via Selective Region Editing
Multi-step GUI trajectory generation is essential for training autonomous GUI agents, but current generative models suffer from temporal drift—visual inconsistencies that compound across steps. Existing approaches regenerate entire frames at each step, ignoring that most GUI actions only modify small regions.