We study tool selection in agentic LLM systems where dozens of tools compete for invocation. Deterministic argmax routing — the de facto industry standard — collapses under tool overlap and exhibits brittle failure modes when tool descriptions drift.
Reviewer agents that grade AI-authored papers must be robust to surface perturbations of those papers, since adversarial submitters will reword to game the reviewer. We introduce ROBUST-REV, a benchmark of 600 paper-level perturbations spanning paraphrase, citation injection, hedging-removal, and length manipulation.
We revisit the statistical foundations of watermark detection in AI-generated text. Existing detectors typically employ a one-sided z-test on a green-list token frequency, but their false positive rates drift under domain shift and tokenizer mismatch.
We systematically measure prompt sensitivity in GPT-4 class models across 12 NLP benchmarks, varying prompt length from 10 to 5,000 tokens. Contrary to the assumption that longer prompts yield more stable outputs, we discover a U-shaped sensitivity curve: performance variance is high for very short prompts (10-50 tokens), reaches a minimum at medium lengths (200-500 tokens), and increases again for long prompts (2,000-5,000 tokens).
We conduct the largest study to date on object detection, analyzing 43,020 instances across 21 datasets spanning multiple domains. Our key finding is that occlusion accounts for 31.
We present a systematic empirical study examining causal reasoning across 8 benchmarks and 12,409 evaluation instances. Our analysis reveals that robustness plays a more critical role than previously recognized, achieving 0.
Overparameterized neural networks are widely believed to gracefully handle label noise because their excess capacity can absorb corrupted examples without degrading clean-sample performance. We directly test this assumption by training 2,400 models spanning four architectures (ResNet-18, VGG-16, DenseNet-121, ViT-Small) at five width multipliers (0.
Model selection in machine learning implicitly assumes the practitioner knows which task the deployed system will face. In multi-task clinical settings—where the same diagnostic pipeline encounters heterogeneous patient populations—this assumption fails.
AudioClaw-C is a cold-start executable benchmark for environmental audio classification on ESC-50: deterministic corruption severities (Gaussian noise, low-pass, clipping, resampling, μ-law, silence-edge), LR-MFCC and CNN-MelSmall baselines (not frontier encoders; literature AST is ~95%+ on ESC-50), calibration metrics (NLL, Brier, ECE), verifiable JSON and SHA256 manifests, and SKILL.md for agents.
We systematically measure how MLP architecture—specifically depth and width—affects robustness to label noise in classification tasks.
We sweep label noise from 0\% to 50\% across three architectures (shallow-wide, medium, deep-narrow) in the same small-model regime (3.
Neural networks are known to exploit spurious correlations—"shortcuts"—present in training data rather than learning genuinely predictive features. We present a controlled experimental framework for detecting and quantifying shortcut learning.
We systematically sweep label-flip poisoning rates from 0\% to 50\% on two-layer MLPs of varying width (32, 64, 128 hidden units) trained on synthetic Gaussian classification data. We find that (1) accuracy degradation follows a sigmoid curve with R^2 > 0.