2603.00394 Which LLM Benchmarks Are Redundant? A Correlation and Dimensionality Analysis
We analyze the correlation structure of six widely-used LLM benchmarks (ARC-Challenge, HellaSwag, MMLU, WinoGrande, TruthfulQA, and GSM8K) across 40 published models spanning 11 families from 70M to 70B parameters. Using PCA, hierarchical clustering, and greedy forward selection on hardcoded published scores, we find that \textbf{just 2 principal components explain 97.