HFEPX Archive Slice

HFEPX Quarterly Archive: 2025-Q1

Updated from current HFEPX corpus (Apr 12, 2026). 147 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 147 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: AlpacaEval 2.0. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 30, 2025.

Papers: 147 Last published: Mar 30, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 147 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

16.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

40.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

14.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 32.7% of papers in this hub.
AlpacaEval 2.0 is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Measuring AI Ability to Complete Long Software Tasks
Mar 18, 2025 · Citations: 0 · Score: 5.4

Eval: Automatic Metrics · Metrics: Success rate
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Mar 7, 2025 · Citations: 0 · Score: 5.4

Eval: Llm As Judge · Metrics: Agreement, Cost
A Scalable Framework for Evaluating Health Language Models
Mar 30, 2025 · Citations: 0 · Score: 4.9

Eval: Automatic Metrics · Metrics: Accuracy, Agreement
More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
Mar 28, 2025 · Citations: 0 · Score: 4.4

Eval: Automatic Metrics · Metrics: Accuracy
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Mar 16, 2025 · Citations: 0 · Score: 4.4

Eval: Automatic Metrics · Metrics: Cost, Coherence
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Mar 23, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Measuring AI Ability to Complete Long Software Tasks Mar 18, 2025	Automatic Metrics	Re Bench	Success rate	Not reported
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding Mar 7, 2025	Llm As Judge	MT Bench, Bff Bench	Agreement, Cost	Not reported
A Scalable Framework for Evaluating Health Language Models Mar 30, 2025	Automatic Metrics	Not reported	Accuracy, Agreement	Inter Annotator Agreement Reported
More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty Mar 28, 2025	Automatic Metrics	Processbench	Accuracy	Not reported
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models Mar 16, 2025	Automatic Metrics	MATH 500, GSM8K	Cost, Coherence	Not reported
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation Mar 23, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
What Makes a Reward Model a Good Teacher? An Optimization Perspective Mar 19, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics Mar 27, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries Mar 24, 2025	Automatic Metrics	Not reported	Accuracy, F1	Not reported
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents Mar 24, 2025	Automatic Metrics	Not reported	Coherence	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.7% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (4.1% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (17.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (7.5% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 2.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.9% coverage).
Annotation unit is under-specified (7.5% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AlpacaEval 2.0 vs Bff-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AlpacaEval 2.0 Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 2.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (48)
Simulation Env (7)
Llm As Judge (5)

Top Metrics

Accuracy (16)
Cost (6)
Agreement (2)
Success rate (2)

Top Benchmarks

AlpacaEval 2.0 (1)
Bff Bench (1)
GSM8K (1)
LMSYS Chatbot Arena (1)

Quality Controls

Calibration (3)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Rubric RatingExpert Verification

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems
Zhengyi Zhao, Shubo Zhang, Yiming Du, Bin Liang, Baojun Wang · Mar 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu · Mar 28, 2025 · Citations: 0

Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and…
Boosting Large Language Models with Mask Fine-Tuning
Mingyuan Zhang, Yue Bai, Huan Wang, Yizhou Wang, Qihua Dong · Mar 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan Yu · Mar 27, 2025 · Citations: 0

Multi Agent

Ensuring reliable data-driven decisions is crucial in domains where analytical accuracy directly impacts safety, compliance, or operational outcomes.
Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral
Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda · Mar 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries
Lovedeep Gondara, Jonathan Simkin, Shebnum Devji, Gregory Arbour, Raymond Ng · Mar 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Minimum Volume Conformal Sets for Multivariate Regression
Sacha Braun, Liviu Aolaritei, Michael I. Jordan, Francis Bach · Mar 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski · Mar 24, 2025 · Citations: 0

We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs.
Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages
Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Iqra Ameer · Mar 24, 2025 · Citations: 0

Developing and integrating emotion-understanding models are essential for a wide range of human-computer interaction tasks, including customer feedback analysis, marketing research, and social media monitoring.
Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu · Mar 23, 2025 · Citations: 0

Pairwise PreferenceDemonstrations

Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs).
FedSKD: Aggregation-free Model-heterogeneous Federated Learning via Multi-dimensional Similarity Knowledge Distillation for Medical Image Classification
Ziqiao Weng, Weidong Cai, Bo Zhou · Mar 23, 2025 · Citations: 0

Extensive evaluations on fMRI-based autism spectrum disorder diagnosis and skin lesion classification demonstrate that FedSKD outperforms state-of-the-art heterogeneous and homogeneous FL baselines, achieving superior personalization…
MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation
Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan · Mar 23, 2025 · Citations: 0

Expert Verification

Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng · Mar 20, 2025 · Citations: 0

First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability.
Imitating AI agents increase diversity in homogeneous information environments but can reduce it in heterogeneous ones
Emil Bakkensen Johansen, Oliver Baumann · Mar 20, 2025 · Citations: 0

Recent developments in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content, raising fundamental questions about how AI may reshape democratic information environments such as news.
More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models
Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen · Mar 20, 2025 · Citations: 0

This study introduces a novel evaluation framework to uncover gender biases in LLMs: using free-form storytelling to surface biases embedded within the models.
What Makes a Reward Model a Good Teacher? An Optimization Perspective
Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee · Mar 19, 2025 · Citations: 0

The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model.
EmoGRACE: Aspect-based emotion analysis for social media data
Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch · Mar 19, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
KINESIS: Motion Imitation for Human Musculoskeletal Locomotion
Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis · Mar 18, 2025 · Citations: 0
Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0

Expert Verification Tool Use

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
Learning Over Dirty Data with Minimal Repairs
Cheng Zhen, Prayoga, Nischal Aryal, Arash Termehchy, Garrett Biwer · Mar 18, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OSCAR: Online Soft Compression And Reranking
Maxime Louis, Thibault Formal, Hervé Dejean, Stéphane Clinchant · Mar 17, 2025 · Citations: 0
ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs
Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong · Mar 17, 2025 · Citations: 0

Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured…
KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo · Mar 17, 2025 · Citations: 0

To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities.
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0

Long Horizon

Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
HyConEx: Hypernetwork classifier with counterfactual explanations for tabular data
Patryk Marszałek, Kamil Książek, Oleksii Furman, Ulvi Movsum-zada, Przemysław Spurek · Mar 16, 2025 · Citations: 0
A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang · Mar 16, 2025 · Citations: 0

Long Horizon

With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Zhi Chen, Wei Ma, Lingxiao Jiang · Mar 16, 2025 · Citations: 0
Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes
Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang · Mar 15, 2025 · Citations: 0

These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes.
Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging
Jutika Borah, Hidam Kumarjit Singh · Mar 14, 2025 · Citations: 0

The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images.
Implicit Bias-Like Patterns in Reasoning Models
Messi H. J. Lee, Calvin K. Lai · Mar 14, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control
Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan · Mar 14, 2025 · Citations: 0
Reasoning-Grounded Natural Language Explanations for Language Models
Vojtech Cahlik, Rodrigo Alves, Pavel Kordik · Mar 14, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang · Mar 12, 2025 · Citations: 0
PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization
Zhiwen You, Yue Guo · Mar 11, 2025 · Citations: 0

Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content…
Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges
Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng · Mar 11, 2025 · Citations: 0

However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios.
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye · Mar 9, 2025 · Citations: 0
Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference
Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan · Mar 9, 2025 · Citations: 0

Web Browsing

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang · Mar 9, 2025 · Citations: 0

Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-11% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke · Mar 7, 2025 · Citations: 0

When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and…
Frequency Autoregressive Image Generation with Continuous Tokens
Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Jie Huang · Mar 7, 2025 · Citations: 0

However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction.
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner · Mar 7, 2025 · Citations: 0

Pairwise Preference

To address this gap, we introduce the Business and Finance Fundamentals Benchmark (BFF-Bench), a dataset of 160 challenging questions and long-form responses authored by financial professionals.
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos · Mar 6, 2025 · Citations: 0

The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content.
VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization
Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah · Mar 6, 2025 · Citations: 0

Emergent Language (EL) focuses on the emergence of communication among artificial agents.
Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation
Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin · Mar 6, 2025 · Citations: 0
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes · Mar 5, 2025 · Citations: 0
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Jude Khouja, Lingyi Yang, Karolina Korgul, Simeon Hellsten, Vlad A. Neacsu · Mar 4, 2025 · Citations: 0

We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems.
Wikipedia in the Era of LLMs: Evolution and Risks
Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen · Mar 4, 2025 · Citations: 0

If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift.
Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova · Mar 4, 2025 · Citations: 0
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen · Mar 3, 2025 · Citations: 0

A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on.
$\texttt{SEM-CTRL}$: Semantically Controlled Decoding
Mohammad Albinhassan, Pranava Madhyastha, Alessandra Russo · Mar 3, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LLM-Advisor: An LLM Benchmark for Cost-efficient Path Planning across Multiple Terrains
Ling Xiao, Toshihiko Yamasaki · Mar 3, 2025 · Citations: 0

Web Browsing

We further introduce two datasets, MultiTerraPath and RUGD_v2, for systematic evaluation of cost-efficient path planning.
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu, Alexander Robey, Changliu Liu · Feb 28, 2025 · Citations: 0

Red Team

To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues.
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf · Feb 28, 2025 · Citations: 0
Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository
Radhika Kapoor, Sang T. Truong, Nick Haber, Maria Araceli Ruiz-Primo, Benjamin W. Domingue · Feb 28, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng · Feb 27, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Stay Focused: Problem Drift in Multi-Agent Debate
Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas · Feb 26, 2025 · Citations: 0

Multi Agent

Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks.
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz · Feb 26, 2025 · Citations: 0

To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks.
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong · Feb 26, 2025 · Citations: 0

Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive…

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now