Improving consistency in retrieval-augmented systems

An RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. (NeurIPS)

Retrieval-Augmented Generation (RAG) systems enhance factual accuracy by grounding language model outputs in relevant documents retrieved from external corpora. However, they often exhibit inconsistencies across semantically equivalent or paraphrased inputs, undermining user trust and reliability, particularly in high-stakes applications. These inconsistencies stem from two primary sources: (1) variation in the retriever, which may return different document sets for similar queries, and (2) stochasticity in the generator, which can produce divergent outputs even under identical retrieval. Despite its importance, output consistency in RAG systems remains underexplored. In this work, we present a systematic framework for measuring and improving RAG output consistency. We introduce a rigorous evaluation protocol that quantifies both retriever-level consistency (via document set overlap) and generator-level consistency (via output similarity across paraphrased queries), using metrics such as lexical agreement and LLM-based judgments. To improve consistency, we propose a reinforcement learning approach that leverages the Group Reward Policy Optimization (GRPO) algorithm. Specifically, we utilize GRPO's extensive rollouts per query to compute group similarity rewards that captures consistency across paraphrased queries. Empirical results on multiple QA datasets demonstrate that our method significantly improves output consistency without compromising factual accuracy, offering a scalable and effective solution to a critical reliability challenge in RAG systems.