Answerability-Gain Rewards for Evidence-Label-Free GRU-Mem Gating: An Empirical Investigation
0
Recurrent memory agents process long documents efficiently by maintaining compact textual memory states, with GRU-style gating mechanisms controlling memory updates and early exit decisions. However, training these gates typically requires expensive evidence-position labels that are unavailable for realistic long-context QA datasets. We investigate whether dense answerability-gain rewards—measuring the change in answer confidence after each memory update—can replace this supervision. Our comprehensive experiments on RULER-QA (28K–224K tokens) reveal that answerability-gain rewards do not consistently outperform simpler outcome-only rewards, achieving 63.19% vs. 63.48% average exact match with a 4–4 win/loss split across conditions. We identify an architectural limitation: the gain signal biases toward early exit after encountering the first evidence, which hurts multi-hop reasoning tasks requiring integration of multiple evidence pieces.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.