HFEPX Archive Slice

HFEPX Daily Papers for 2026-05-28

Daily archive slice for 2026-05-28 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-01); covers 60 papers from 2026-05-28.

Papers: 60 Last published: May 28, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

30.0%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

Use this archive slice to monitor protocol drift and shifts in evaluation methods over 2026-05-28.

Protocol Takeaways For This Period

Evaluation modes for this slice cluster around automatic_metrics, simulation_env.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs
May 28, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate
Latent Performance Profiling of Large Language Models
May 28, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents
May 28, 2026 · Citations: 0 · Score: 6.5

Eval: Simulation Env · Metrics: Task success
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
May 28, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
May 28, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Recall
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
May 28, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Spearman

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs May 28, 2026	Automatic Metrics	MT Bench, LMSYS Chatbot Arena	Success rate, Jailbreak success rate	Not reported
Latent Performance Profiling of Large Language Models May 28, 2026	Automatic Metrics	MMLU, MMLU Pro	Accuracy	Not reported
Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents May 28, 2026	Simulation Env	WebArena	Task success	Not reported
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings May 28, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection May 28, 2026	Automatic Metrics	Not reported	Accuracy, Recall	Calibration
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge? May 28, 2026	Automatic Metrics	Not reported	Accuracy, Spearman	Not reported
DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation May 28, 2026	Human Eval	Directorbench	Not reported	Not reported
Conformal Certification of Reasoning Trace Prefixes May 28, 2026	Automatic Metrics	Not reported	Accuracy, Auroc	Calibration
Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation May 28, 2026	Automatic Metrics	Not reported	Success rate, Latency	Not reported
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations May 28, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Human feedback

Human feedback is present in 17 of 60 papers.
Gap: Quality controls

Quality controls is present in 3 of 60 papers.
Gap: Benchmarks

Benchmarks is present in 8 of 60 papers.
Moderate: Metrics

Metrics is present in 18 of 60 papers.
Gap: Known rater population

Known rater population is present in 4 of 60 papers.
Gap: Known annotation unit

Known annotation unit is present in 10 of 60 papers.

Known Gaps

Quality controls is present in 3 of 60 papers.
Benchmarks is present in 8 of 60 papers.
Known rater population is present in 4 of 60 papers.

Suggested Next Analyses

Compare 2026-05-28 against neighboring archive slices to flag protocol drift.

Recommended Queries

Browse all HFEPX daily archives

Known Limitations

This synthetic archive page is generated on-demand from extraction data because no cached payload was available for 2026-05-28.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (16)
Simulation Env (4)
Human Eval (2)

Top Metrics

Accuracy (8)
Cost (2)
F1 (2)
Jailbreak success rate (2)

Top Benchmarks

IFEval (2)
MMLU (2)
MMLU Pro (2)
Arena Hard (1)

Quality Controls

Calibration (2)
Adjudication (1)

Papers In This Archive Slice

LLMSurgeon: Diagnosing Data Mixture of Large Language Models
Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu · May 28, 2026 · Citations: 0

To evaluate, we introduce LLMScan, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures.
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu · May 28, 2026 · Citations: 0

We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation.
Unlocking the Working Memory of Large Language Models for Latent Reasoning
Lukas Aichberger, Sepp Hochreiter · May 28, 2026 · Citations: 0

In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts.
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
Anany Kotawala · May 28, 2026 · Citations: 0

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent.
Demystifying Data Organization for Enhanced LLM Training
Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
COMPOSE: Composing Future Theorems from Citations and Formal Structure
David Busbib, Michael Werman · May 28, 2026 · Citations: 0

To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025.
Reasoning with Sampling: Cutting at Decision Points
Felix Zhou, Anay Mehrotra, Quanquan C. Liu · May 28, 2026 · Citations: 0

Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
On Language Generation in the Limit with Bounded Memory
Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Resolution Diagnostics for Paired LLM Evaluation
Anany Kotawala · May 28, 2026 · Citations: 0

Pairwise Preference

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
Valentina Bui Muti, Eugénie Dulout, Ziquan Fu · May 28, 2026 · Citations: 0

Expert Verification

We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems.
Self-Trained Verification for Training- and Test-Time Self-Improvement
Chen Henry Wu, Aditi Raghunathan · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie · May 28, 2026 · Citations: 0

Demonstrations Long Horizon

Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data,…
Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection
Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang · May 28, 2026 · Citations: 0

Pairwise Preference

To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context.
LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback
Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo · May 28, 2026 · Citations: 0

Pairwise Preference

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang · May 28, 2026 · Citations: 0

Pairwise Preference

Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities.
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang · May 28, 2026 · Citations: 0

While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely…
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang · May 28, 2026 · Citations: 0

Rubric Rating

In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents.
Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning
Shaojie Wang, Liang Zhang · May 28, 2026 · Citations: 0

Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing…
CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild
Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka · May 28, 2026 · Citations: 0

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability.
GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases
Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu · May 28, 2026 · Citations: 0

GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9.
Do Language Models Track Entities Across State Changes?
Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller · May 28, 2026 · Citations: 0

Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.
How's it going? Reinforcement learning in language models recruits a functional welfare axis
Andy Q Han, David J. Chalmers, Pavel Izmailov · May 28, 2026 · Citations: 0

Demonstrations

We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals.
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao · May 28, 2026 · Citations: 0

Long Horizon

To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation.
GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German
Fabian Mewes, Anne Lauscher, Vagrant Gautam · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Dual-Path Architecture for Scaling Compute and Capacity in LLMs
Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali · May 28, 2026 · Citations: 0

We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs.
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
Travis Lelle · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?
Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen · May 28, 2026 · Citations: 0

Proactive agents read user activity as text and call an LLM on every event to decide whether to act.
CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution
Milan Straka · May 28, 2026 · Citations: 0

Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation.
CCS: Clinical Consensus Selection for Radiology Report Generation
Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari · May 28, 2026 · Citations: 0

Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy…
Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking
Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang · May 28, 2026 · Citations: 0

We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones.
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li · May 28, 2026 · Citations: 0

Pairwise Preference

Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks.
DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation
Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li · May 28, 2026 · Citations: 0

Pairwise Preference Multi Agent

However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and…
Conformal Certification of Reasoning Trace Prefixes
Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model
Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata · May 28, 2026 · Citations: 0

Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels.
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering
Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian · May 28, 2026 · Citations: 0

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance.
REPOT: Recoverable Program-of-Thought via Checkpoint Repair
Parsa Mazaheri · May 28, 2026 · Citations: 0

Long Horizon

On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the…
Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues
Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead · May 28, 2026 · Citations: 0

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training.
Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage
Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya · May 28, 2026 · Citations: 0

Red Team

We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the…
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs
Asaf Yehudai, Naama Rozen, Ariel Gera · May 28, 2026 · Citations: 0

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure.
Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation
Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen · May 28, 2026 · Citations: 0

Red Team

Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility.
Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders
Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski · May 28, 2026 · Citations: 0

The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.
Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs
Vinay Samuel, Yapei Chang, Mohit Iyyer · May 28, 2026 · Citations: 0

Pairwise Preference

For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor…
Latent Performance Profiling of Large Language Models
Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti · May 28, 2026 · Citations: 0

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities.
Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
M. Ali Bayram, Banu Diri, Savaş Yıldırım · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment
Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning
Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank · May 28, 2026 · Citations: 0

Pairwise Preference

We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences.
MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization
Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales · May 28, 2026 · Citations: 0

Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.
Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents
Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim · May 28, 2026 · Citations: 0

To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance.
ExCAM: Explainable Cultural Awareness Metrics
Christoph Leiter, Haiyue Song, Hour Kaing, Jin Tei, Hideki Tanaka · May 28, 2026 · Citations: 0

To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs.
Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate
David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky · May 28, 2026 · Citations: 0

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text.
CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation
Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun · May 28, 2026 · Citations: 0

Critique Edit

To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment…
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Zhicheng Dou · May 28, 2026 · Citations: 0

Tool Use

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports.
MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables
Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng · May 28, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation
Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang · May 28, 2026 · Citations: 0

Rubric Rating

Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates.
Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models
Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi · May 28, 2026 · Citations: 0

Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz · May 28, 2026 · Citations: 0

Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text.
Data filtering methods for training language models
Egor Shevchenko, Elena Bruches · May 28, 2026 · Citations: 0

Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now