Skip to content
← Back to explorer

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky · Feb 26, 2026 · Citations: 0

Abstract

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: General

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.40
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models.
  • We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora.
  • Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses.

Why It Matters For Eval

  • We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models.
  • We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora.

Related Papers