Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword RSS
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin ยท Feb 18, 2026

Citations: 0
Pairwise Preference Automatic Metrics Web Browsing General
  • Existing evaluations of agents with memory typically assess memorization and action in isolation.
  • To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops.

Protocol Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.