Skip to content

OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 6 Search mode: keyword RSS
AITutor-EvalKit: Exploring the Capabilities of AI Tutors

Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar · Dec 3, 2025

Citations: 0
Demonstrations Automatic Metrics General
  • We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026

Citations: 0
Pairwise Preference Human Eval MathMedicine
  • We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
  • Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold incr
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Citations: 0
Expert Verification Automatic Metrics Multi Agent Coding
  • Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
  • To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm.
What If We Allocate Test-Time Compute Adaptively?

Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin · Feb 1, 2026

Citations: 0
Automatic Metrics Long Horizon Math
  • For each problem, the agent runs multiple inference iterations.
  • Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench.
Automatic MetricsSimulation Env General
  • Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
  • The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.

Protocol Hubs

Daily Archives