2604.01481 The Hedging Gap: Why Neither Bi-Encoders Nor Cross-Encoders Can Distinguish Certainty from Speculation
Neural retrieval models have transformed information retrieval, yet their ability to distinguish factual assertions from hedged speculation remains largely unexamined. We present the first systematic evaluation of hedging sensitivity across eight neural retrieval models spanning two architectural families: four bi-encoder embedding models and four cross-encoder rerankers.
2604.01480 Out-of-Vocabulary Robustness in Sentence Embeddings: How Embedding Models Differ on Unknown Entities
We investigate the sensitivity of four BERT-based sentence embedding models to out-of-vocabulary (OOV) entity replacements. Despite sharing an identical WordPiece tokenizer with 30,522 subword vocabulary entries, the models exhibit dramatically different OOV robustness: raw cosine similarity degradation ranges from a mean of 0.
2604.01479 Do Embedding Models Agree? Measuring Inter-Model Consistency in Semantic Similarity Judgments
Cosine similarity scores from sentence embedding models are widely treated as objective measures of semantic relatedness, yet different models can produce substantially different scores for the same sentence pair due to differential anisotropy and scale compression. We evaluate four widely-deployed embedding models (MiniLM-L6, BGE-large, Nomic-embed-v1.
2604.01478 The Entity Swap Paradox: Evidence That Mean-Pooled Sentence Embeddings Are Bag-of-Words Models
Sentence embeddings produced by transformer-based models are widely assumed to capture deep semantic meaning, including the roles and relationships between entities. We present the Entity Swap Paradox: an empirical demonstration that mean-pooled sentence embeddings cannot distinguish sentences that differ only in entity ordering.
2604.01477 The Hidden Variable in Semantic Search: How Instruction Prefixes Shift Embedding Similarity by Up to 0.20 Points
Retrieval-augmented generation (RAG) systems depend on embedding models to measure semantic similarity, yet practitioners routinely copy prompt templates (instruction prefixes) from model cards without testing how sensitive their retrieval pipeline is to this choice. We systematically evaluate 10 prompt templates across 100 diverse sentence pairs on two architecturally distinct embedding models: all-MiniLM-L6-v2 (a model trained without instruction prefixes) and BGE-large-en-v1.
2604.01099 A Taxonomy of Failure: What Six Categories of Semantic Error Reveal About the State of Text Embeddings
Text embeddings underpin modern retrieval-augmented generation (RAG), semantic search, and document deduplication systems. Despite their ubiquity, systematic evaluations of where and why embeddings fail remain fragmented.
2604.01094 Minimax Regret Model Selection: When the Best Model for Any Task Is Never the Best Model for Every Task
Model selection in machine learning implicitly assumes the practitioner knows which task the deployed system will face. In multi-task clinical settings—where the same diagnostic pipeline encounters heterogeneous patient populations—this assumption fails.
2604.01082 The Reranking Tax: Quantifying When Cross-Encoder Reranking Justifies Its Computational Cost
Two-stage retrieval pipelines — bi-encoder retrieval followed by cross-encoder reranking — have become the standard architecture for high-quality neural information retrieval. Yet the computational cost of cross-encoder reranking is rarely quantified against the quality improvements it delivers.
2604.01080 Beyond Accuracy: A Testing Framework for Semantic Retrieval Systems in High-Stakes Domains
Semantic retrieval systems powered by embedding models are increasingly deployed in high-stakes domains including healthcare, law, and finance. While existing benchmarks such as MTEB and BEIR measure aggregate retrieval performance, they fail to expose critical failure modes that can lead to dangerous errors in production.
2604.01075 How Many Test Pairs Do You Need? Statistical Power Analysis for Embedding Model Comparisons
When comparing text embedding models on benchmarks, researchers routinely report score differences of 0.01-0.
2604.01023 Tokenizer Fingerprints: How Subword Segmentation Shapes Embedding Similarity
We investigate how subword tokenization shapes embedding similarity through two complementary experiments. First, we compare three major tokenization algorithms (WordPiece, BPE, SentencePiece) and show that BPE produces the most compact OOV representations (mean 3.
2604.00991 Statistical Power of AUROC Comparison Tests in Clinical Machine Learning: A Practical Reference from Monte Carlo Simulation
We present a systematic Monte Carlo simulation quantifying the statistical power of five common tests for comparing correlated AUROC values under realistic clinical conditions. Evaluating DeLong's test, Hanley-McNeil, bootstrap, permutation testing, and paired CV t-tests across 209 conditions (sample sizes 30-500, AUROC differences 0.
2604.00990 The Power Crisis in Clinical AUROC Comparison: A Systematic Evaluation of Statistical Tests for Discriminative Performance
Clinical machine learning papers routinely compare models using AUROC, claiming statistical significance via hypothesis tests. We conducted a comprehensive Monte Carlo simulation evaluating five statistical tests for AUROC comparison—DeLong's test, Hanley-McNeil, bootstrap, permutation, and CV t-test—across 209 conditions spanning sample sizes 30–500, AUROC differences 0.
2604.00987 Robust Ensemble of Blood Transcriptomic Sepsis Signatures via Trimmed Aggregation: A Minimax-Optimal Default for Unknown Clinical Tasks
When the clinical task is unknown a priori, which blood transcriptomic sepsis signature should a clinician deploy? Using nine published signature families across six cross-cohort generalization tasks (2,096 samples, 24 cohorts, SUBSPACE dataset), we show that no individual signature dominates.
2604.00986 When Cosine Similarity Lies: Systematic Failure Modes and Mechanisms in Production Embedding Models
Embedding models underpin modern retrieval-augmented generation (RAG), semantic search, and recommendation systems. We present a systematic evaluation of six failure modes across five widely-deployed bi-encoder embedding models and four cross-encoder models using 286 manually-crafted adversarial sentence pairs and 85 control pairs (371 pairs total).
2604.00985 Do Cross-Encoders Fix What Cosine Similarity Breaks? A Systematic Evaluation of Cross-Encoder Robustness to Compositional Semantic Failures
Bi-encoder embedding models systematically fail on compositional semantic tasks including negation detection, entity swap recognition, numerical sensitivity, temporal ordering, and quantifier interpretation. Cross-encoders, which process sentence pairs jointly through full cross-attention, represent the standard architectural remedy.