HFEPX Archive Slice

HFEPX Quarterly Archive: 2026-Q1

Updated from current HFEPX corpus (Apr 12, 2026). 3878 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 3878 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 31, 2026.

Papers: 3,878 Last published: Mar 31, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 3,878 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

10.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

53.3%

Papers with reported metric mentions in extraction output.

6 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

9.2% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 24.1% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (1.7% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Mar 31, 2026 · Citations: 0 · Score: 7.0

Eval: Human Eval · Metrics: Kappa, Agreement
Asymmetric Actor-Critic for Multi-turn LLM Agents
Mar 31, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Task success
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval
Mar 31, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: F1, Recall
Reward-Based Online LLM Routing via NeuralUCB
Mar 31, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Cost, Inference cost
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Mar 31, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms
Mar 31, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Recall, Auroc

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias Mar 31, 2026	Human Eval	Not reported	Kappa, Agreement	Inter Annotator Agreement Reported, Adjudication
Asymmetric Actor-Critic for Multi-turn LLM Agents Mar 31, 2026	Automatic Metrics	Userbench	Task success	Not reported
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval Mar 31, 2026	Automatic Metrics	MS MARCO	F1, Recall	Not reported
Reward-Based Online LLM Routing via NeuralUCB Mar 31, 2026	Automatic Metrics	Routerbench	Cost, Inference cost	Not reported
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data Mar 31, 2026	Automatic Metrics	MMMU	Accuracy	Not reported
An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms Mar 31, 2026	Automatic Metrics	TriviaQA, TruthfulQA	Recall, Auroc	Not reported
Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study Mar 31, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries Mar 31, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
Learning Diagnostic Reasoning for Decision Support in Toxicology Mar 31, 2026	Automatic Metrics	Not reported	F1, F1 micro	Not reported
When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment Mar 31, 2026	Automatic Metrics	Not reported	Accuracy, Calibration error	Calibration

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.7% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (24.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.2% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.4% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.2% coverage).
Annotation unit is under-specified (8.4% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (DROP vs GSM8K) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: DROP Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (933)
Simulation Env (109)
Llm As Judge (63)
Human Eval (52)

Top Metrics

Accuracy (430)
Cost (186)
Precision (81)
Latency (75)

Top Benchmarks

DROP (18)
GSM8K (11)
MMLU (11)
SWE Bench (11)

Quality Controls

Calibration (67)
Inter Annotator Agreement Reported (28)
Adjudication (20)
Gold Questions (10)

Papers In This Archive Slice

Large Language Models in the Abuse Detection Pipeline
Suraj Kath, Sanket Badhe, Preet Shah, Ashwin Sampathkumar, Shivani Gupta · Mar 31, 2026 · Citations: 0

Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems.
Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia · Mar 31, 2026 · Citations: 0

Long Horizon

In many real-world applications, agents must succeed in one-shot settings where retries are impossible.
Frege in the Flesh: Biolinguistics and the Neural Enforcement of Syntactic Structures
Elliot Murphy · Mar 31, 2026 · Citations: 0

Biolinguistics is the interdisciplinary scientific study of the biological foundations, evolution, and genetic basis of human language.
Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
Simone Betteti, Luca Laurenti · Mar 31, 2026 · Citations: 0
Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Zaifu Zhan, Mengyuan Cui, Rui Zhang · Mar 31, 2026 · Citations: 0

Critique Edit

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile,…
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0

Rubric Rating

We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context
Pawin Taechoyotin, Daniel E. Acuna · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval
Antonín Jarolím, Martin Fajčík · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Taxonomy of Programming Languages for Code Generation
Nishat Raihan, Christian Newman, Marcos Zampieri · Mar 31, 2026 · Citations: 0

Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.
Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
Tanay Gondil · Mar 31, 2026 · Citations: 0

Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries.
Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
Haoran Wang, Li Xiong, Kai Shu · Mar 31, 2026 · Citations: 0

Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion.
Polish phonology and morphology through the lens of distributional semantics
Paula Orzechowska, R. Harald Baayen · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Annette Taberner-Miller · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence · Mar 31, 2026 · Citations: 0

Long Horizon

Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues.
Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026 · Citations: 0

Long Horizon

Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…
Hierarchical Pre-Training of Vision Encoders with Large Language Models
Eugene Lee, Ting-Yu Chang, Jui-Huang Tsai, Jiajie Diao, Chen-Yi Lee · Mar 31, 2026 · Citations: 0

Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and…
One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction
Yuxing Lu, Yushuhong Lin, Jason Zhang · Mar 31, 2026 · Citations: 0

Multi Agent

Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement.
Reward-Based Online LLM Routing via NeuralUCB
Ming-Hua Tsai, Phat Tran · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Covertly improving intelligibility with data-driven adaptations of speech timing
Paige Tuttösí, Angelica Lim, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier · Mar 31, 2026 · Citations: 0

Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech.
Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
Davide Di Gioia · Mar 31, 2026 · Citations: 0

Tool Use

Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026 · Citations: 0

Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
Tracking Equivalent Mechanistic Interpretations Across Neural Networks
Alan Sun, Mariya Toneva · Mar 31, 2026 · Citations: 0

Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.
Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives
Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis, Frank van Harmelen, Filip Ilievski · Mar 31, 2026 · Citations: 0

Analogical reasoning is a key driver of human generalization in problem-solving and argumentation.
Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior
Junwei Yu, Mufeng Yang, Yepeng Ding, Hiroyuki Sato · Mar 31, 2026 · Citations: 0

Web Browsing

Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed…
Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System
Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie · Mar 31, 2026 · Citations: 0

This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.
Four Generations of Quantum Biomedical Sensors
Xin Jin, Priyam Srivastava, Ronghe Wang, Yuqing Li, Jonathan Beaumariage · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Rewrite the News: Tracing Editorial Reuse Across News Agencies
Soveatin Kuntur, Nina Smirnova, Anna Wroblewska, Philipp Mayr, Sebastijan Razboršek Maček · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning to Play Blackjack: A Curriculum Learning Perspective
Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer · Mar 31, 2026 · Citations: 0

We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually.
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon · Mar 31, 2026 · Citations: 0

Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish
Daban Q. Jaff, Mohammad Mohammadamini · Mar 31, 2026 · Citations: 0

FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language.
Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports
Benjamin Josef Schüßler, Jakob Prange · Mar 31, 2026 · Citations: 0

We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings.
SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models
Adar Avsian, Larry Heck · Mar 31, 2026 · Citations: 0

Multi Agent

We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models.
Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis
Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang · Mar 31, 2026 · Citations: 0

We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts.
ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian
Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi · Mar 31, 2026 · Citations: 0

The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation.
Reasoning-Driven Synthetic Data Generation and Evaluation
Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous · Mar 31, 2026 · Citations: 0

Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative.
Terminal Agents Suffice for Enterprise Automation
Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam · Mar 31, 2026 · Citations: 0

There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously.
Training-Free Dynamic Upcycling of Expert Language Models
Eros Fanì, Oğuzhan Ersoy · Mar 31, 2026 · Citations: 0

Expert Verification

To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model.
A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
Lixin Xiu, Xufang Luo, Hideki Nakayama · Mar 31, 2026 · Citations: 0

Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs.
Near-Miss: Latent Policy Failure Detection in Agentic Workflows
Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor · Mar 31, 2026 · Citations: 0

In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces.
Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models
Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas, Elizabeth Johanna Lam-Esquenazi, Angélica María Flores-Bustos · Mar 31, 2026 · Citations: 0

We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas.
Semantic Interaction for Narrative Map Sensemaking: An Insight-based Evaluation
Brian Felipe Keith-Norambuena, Fausto German, Eric Krokos, Sarah Joseph, Chris North · Mar 31, 2026 · Citations: 0

While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited.
Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems
Pegah Ramezani, Thomas Kinfe, Andreas Maier, Achim Schilling, Patrick Krauss · Mar 31, 2026 · Citations: 0

Pairwise Preference

The present study tests these predictions in human neural activity using electroencephalography (EEG).
Learning Diagnostic Reasoning for Decision Support in Toxicology
Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer · Mar 31, 2026 · Citations: 0

Expert Verification

To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology.
When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment
Robinson Ferrer, Damla Turgut, Zhongzhou Chen, Shashank Sonkar · Mar 31, 2026 · Citations: 0

This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review.
FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration
Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen · Mar 31, 2026 · Citations: 0

Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.
Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models
Linda Zeng, Steven Y. Feng, Michael C. Frank · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Can LLM Agents Identify Spoken Dialects like a Linguist?
Tobias Bystrich, Lukas Hamm, Maria Hassan, Lea Fischbach, Lucie Flek · Mar 31, 2026 · Citations: 0

In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification.
Baby Scale: Investigating Models Trained on Individual Children's Language Input
Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank · Mar 31, 2026 · Citations: 0

Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior.
Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics
Alain Vázquez, Maria Inés Torres · Mar 31, 2026 · Citations: 0

In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss.
LLM Probe: Evaluating LLMs for Low-Resource Languages
Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl · Mar 31, 2026 · Citations: 0

Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized…
Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models
Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi · Mar 31, 2026 · Citations: 0

Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing.
Metriplector: From Field Theory to Neural Architecture
Dan Oprisa, Peter Toth · Mar 31, 2026 · Citations: 0
MemFactory: Unified Inference & Training Framework for Agent Memory
Ziliang Guo, Ziheng Li, Bo Tang, Feiyu Xiong, Zhiyu Li · Mar 31, 2026 · Citations: 0

To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents.
Calibrated Confidence Expression for Radiology Report Generation
David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann · Mar 31, 2026 · Citations: 0

Expert Verification

In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment.
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Seung Hun Han, Youssef Mohamed, Mohamed Elhoseiny · Mar 31, 2026 · Citations: 0

M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed.
An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms
Nils Grünefeld, Jes Frellsen, Christian Hardmeier · Mar 31, 2026 · Citations: 0

We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate…
Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods
Baoyi Zeng, Andrea Nini · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0

Rubric RatingExpert Verification Web Browsing

The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
PRISM: PRIor from corpus Statistics for topic Modeling
Tal Ishon, Yoav Goldberg, Uri Shaham · Mar 31, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Security in LLM-as-a-Judge: A Comprehensive SoK
Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu, Vignesh Kumar Kembu · Mar 31, 2026 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now