Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan · Mar 3, 2026 · Citations: 0

Abstract

Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.

HFEPX Relevance Assessment

This paper appears adjacent to HFEPX scope (human-feedback/eval), but does not show strong direct protocol evidence in metadata/abstract.

Eval-Fit Score

0/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Trajectory
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.35
Flags: low_signal, possible_false_positive

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

f1success rate

Research Brief

Deterministic synthesis

We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence… HFEPX signals include Automatic Metrics with confidence 0.35. Updated from current HFEPX corpus.

Generated Mar 4, 2026, 4:23 PM · Grounded in abstract + metadata only

Key Takeaways

We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the…
Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8…

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (f1, success rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Low-signal flag detected: protocol relevance may be indirect.

Recommended Queries

human-eval protocol design pairwise preference data quality inter-rater agreement adjudication

Research Summary

Contribution Summary

We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence…
Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and…
To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks.

Why It Matters For Eval

We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence…
Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and…

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: f1, success rate

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning Category Neighbor

Citations: 0 Relevance: 3.75
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (success, rate, performance)
Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (performance)
Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models Category Neighbor

Citations: 0 Relevance: 2.50
- Shared arXiv category (cs.CL)
- Shared terminology (performance, systems)
FlashEvaluator: Expanding Search Space with Parallel Evaluation Category Neighbor

Citations: 0 Relevance: 2.50
- Shared arXiv category (cs.CL)
- Shared terminology (evaluating, systems)
HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse Category Neighbor

Citations: 0 Relevance: 2.40
- Shared arXiv category (cs.CL)
- Shared metric mentions
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation Category Neighbor

Citations: 0 Relevance: 2.05
- Shared arXiv category (cs.CL)
- Shared terminology (performance)
Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory Category Neighbor

Citations: 0 Relevance: 2.05
- Shared arXiv category (cs.CL)
- Shared terminology (evaluating)
ExpGuard: LLM Content Moderation in Specialized Domains Category Neighbor

Citations: 0 Relevance: 2.05
- Shared arXiv category (cs.CL)
- Shared terminology (performance)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote