RAFFLES: reasoning-based attribution of faults
An evaluation architecture that incorporates reasoning and iterative refinement. (EACL)
Current evaluation strategies for multi-component systems are predominantly one-dimensional—assessed by end-to-end performance—and static—no consideration of the contextual state and changing environments. To evaluate consistency, stability and performance degradation of multi-turn systems, pinpointing why and where failures occur is crucial. To address this, we propose a novel, automated framework for fine-grained fault attribution: identifying not only which component fails but also the specific failure mode at each step. Our system evaluates multi-step pipelines, such as those in Retrieval-Augmented Generation and tool-using agents, and report state-of-the-art performance in identifying step-level failures. We provide a scalable alternative to labor-intensive manual analysis and establishing a new framework for multi-turn evaluation as long-horizon, autonomous tasks become more prevalent.