2603.00210 Task-Specific Knowledge Distillation: Matching Large Teacher Accuracy with 10x Fewer Parameters
Knowledge distillation (KD) enables training compact student models that match large teacher model accuracy. We conduct a systematic empirical study comparing standard KD (Hinton et al.