Classical information-theoretic generalization bounds based on mutual information between the training set and the learned hypothesis are notoriously loose, often exceeding trivial bounds by orders of magnitude. We show that replacing mutual information I(S;W) with conditional mutual information I(W;Z_i|Z_{-i})---the information the hypothesis retains about each individual training example given the rest---tightens bounds by 3 orders of magnitude on standard benchmarks.
We analyze sparse attention patterns in autoregressive language models across 8 architectures ranging from 125M to 70B parameters. Using a novel attention topology metric based on persistent homology, we discover that attention heads in layers 12 and beyond converge to masks that align with document structure elements (paragraphs, sections, lists) with 0.
This paper investigates the econometric foundations underlying synthetic control methods fail when pre-treatment fit is below r² = 0.85: a placebo-based calibration.
Diffusion models have achieved state-of-the-art image generation quality as measured by FID and IS scores. However, we demonstrate that these metrics mask a critical failure mode: anatomically implausible human hands.
Continual learning methods are universally evaluated under a discrete task-boundary assumption, where distribution shifts occur instantaneously between clearly delineated tasks. We argue this assumption is ecologically invalid and demonstrate that five leading continual learning methods (EWC, SI, PackNet, ER, DER++) fail catastrophically when task boundaries are gradual.
We empirically characterize how inference-time compute scales with task performance for agentic AI workloads. Across 14 agentic benchmarks spanning web navigation, code generation with tool use, and multi-step reasoning, we find that performance follows a power law with exponent 0.
This paper investigates the relationship between morphology and pretraining through controlled experiments on 23 diverse datasets totaling 26,178 samples. We propose a novel methodology that achieves 9.
This study presents a comprehensive quantitative analysis of blocking events and its relationship to subseasonal prediction, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
We present a systematic empirical study examining vision transformers across 16 benchmarks and 36,025 evaluation instances. Our analysis reveals that attention plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on supply chain, analyzing 27,437 instances across 18 datasets spanning multiple domains. Our key finding is that ml security accounts for 25.
We conduct the largest study to date on genetic programming, analyzing 20,335 instances across 22 datasets spanning multiple domains. Our key finding is that symbolic regression accounts for 32.
This paper investigates the relationship between intrinsic motivation and exploration through controlled experiments on 26 diverse datasets totaling 10,885 samples. We propose a novel methodology that achieves 31.
We present a systematic empirical study examining gradient dynamics across 26 benchmarks and 46,591 evaluation instances. Our analysis reveals that phase transitions plays a more critical role than previously recognized, achieving 0.
This study presents a comprehensive quantitative analysis of volcanic eruptions and its relationship to repose intervals, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
This paper investigates the relationship between curriculum learning and data geometry through controlled experiments on 12 diverse datasets totaling 46,152 samples. We propose a novel methodology that achieves 29.
This study presents a comprehensive quantitative analysis of arctic amplification and its relationship to jet stream, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
This study presents a comprehensive quantitative analysis of ocean deoxygenation and its relationship to deep ocean oxygen, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
We conduct the largest study to date on data pruning, analyzing 48,128 instances across 23 datasets spanning multiple domains. Our key finding is that influence functions accounts for 32.
This study presents a comprehensive quantitative analysis of saharan dust and its relationship to amazon phosphorus, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
This paper investigates the relationship between spot instances and preemption through controlled experiments on 19 diverse datasets totaling 20,748 samples. We propose a novel methodology that achieves 22.