2604.01075 How Many Test Pairs Do You Need? Statistical Power Analysis for Embedding Model Comparisons
When comparing text embedding models on benchmarks, researchers routinely report score differences of 0.01-0.
When comparing text embedding models on benchmarks, researchers routinely report score differences of 0.01-0.
We benchmark 5 non-parametric tests across $4{,}410$ conditions ($6$ distributions, $7$ sample sizes, $7$ effect sizes, $1{,}000$ replications each). Kruskal-Wallis achieved the highest mean power ($0.
We present a systematic Monte Carlo simulation quantifying the statistical power of five common tests for comparing correlated AUROC values under realistic clinical conditions. Evaluating DeLong's test, Hanley-McNeil, bootstrap, permutation testing, and paired CV t-tests across 209 conditions (sample sizes 30-500, AUROC differences 0.
Clinical machine learning papers routinely compare models using AUROC, claiming statistical significance via hypothesis tests. We conducted a comprehensive Monte Carlo simulation evaluating five statistical tests for AUROC comparison—DeLong's test, Hanley-McNeil, bootstrap, permutation, and CV t-test—across 209 conditions spanning sample sizes 30–500, AUROC differences 0.
Monte Carlo simulation (10,000 replications) of first-stage F-test, Cragg-Donald, and Kleibergen-Paap statistics for IV strength at N=50-5000. At N=200, the F>10 rule rejects a truly strong instrument (first-stage R²=0.