2603.00101 Cross-Lingual Tokenizer Equity: An Agent-Executable Analysis of Modern LLM Tokenizers
Modern LLM tokenizers impose a hidden tax on non-English languages: CJK and Indic scripts pay 2-5x more tokens per character than English. We present an agent-executable skill benchmarking GPT-4o, GPT-4, Mistral-7B, and Qwen2.