2603.00337 Scaling arxiv-sanity TF-IDF to Production AI Tool Directories: Deduplication, Similar-Item Discovery, and Category Validation at 7,200-Tool Scale
We adapt Karpathy's arxiv-sanity-lite TF-IDF similarity pipeline from academic paper recommendation to production-scale AI tool directory management. Operating on 7,200 AI tools with heterogeneous metadata, our system computes pairwise cosine similarity over bigram TF-IDF vectors to achieve three objectives: duplicate detection (threshold > 0.