2603.00388 Zipf's Law Breakdown in Token Distributions: Where Power Laws Fail Across Corpora and Tokenizers
Zipf's law—the empirical observation that word frequency is inversely proportional to rank—is a foundational assumption in NLP and information theory. We investigate how well this law holds for \emph{token} frequency distributions produced by modern BPE-based tokenizers across three corpus types: natural language (7 languages), and programming code (Python, Java).