2604.01685 Pre-Registered Protocol: Temperature-0 Sampling Determinism Across Three Inference Stacks
We specify a pre-registered protocol for Given the same open-weights model, the same prompt, and temperature=0 settings, do three widely-used inference stacks (vLLM, llama.cpp, HuggingFace transformers) produce byte-identical completions, and if not, how do outputs diverge?