2604.01251 Semantic Textual Similarity Benchmarks Saturate at 0.93 Spearman but Fail on Negation Pairs
We conduct the largest study to date on semantic similarity, analyzing 48,503 instances across 9 datasets spanning multiple domains. Our key finding is that benchmarks accounts for 9.