Skip to content
← Back to explorer

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao · Feb 26, 2026 · Citations: 0

Abstract

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

25/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Domain Experts
  • Unit of annotation: Unknown
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Long Horizon
  • Quality controls: Not reported
  • Confidence: 0.55
  • Flags: ambiguous

Protocol And Measurement Signals

Benchmarks / Datasets

Ama-Bench

Reported Metrics

accuracy

Research Brief

Deterministic synthesis

To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. HFEPX signals include Automatic Metrics, Long Horizon with confidence 0.55. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 6:46 PM · Grounded in abstract + metadata only

Key Takeaways

  • To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
  • To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval.

Researcher Actions

  • Treat this as method context, then pivot to protocol-specific HFEPX hubs.
  • Cross-check benchmark overlap: Ama-Bench.
  • Validate metric comparability (accuracy).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
  • To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval.
  • Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

Why It Matters For Eval

  • To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
  • To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval.

Researcher Checklist

  • Gap: Human feedback protocol is explicit

    No explicit human feedback protocol detected.

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: Ama-Bench

  • Pass: Metric reporting is present

    Detected: accuracy

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.