HFEPX Archive Slice

HFEPX Daily Archive: 2026-04-07

Updated from current HFEPX corpus (Apr 9, 2026). 83 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 83 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Insightbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 7, 2026.

Papers: 83 Last published: Apr 7, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 83 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

48.3%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

15.7% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 41% of papers in this hub.
Insightbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (7.2% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Apr 7, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Recall
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Apr 7, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: F1, Agreement
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Apr 7, 2026 · Citations: 0 · Score: 6.5

Eval: Human Eval · Metrics: Recall
STDec: Spatio-Temporal Stability Guided Decoding for dLLMs
Apr 7, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Throughput
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
Apr 7, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Apr 7, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching Apr 7, 2026	Automatic Metrics	Scirepeval	Recall	Not reported
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models Apr 7, 2026	Automatic Metrics	Not reported	F1, Agreement	Calibration, Adjudication
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling Apr 7, 2026	Human Eval	Insightbench	Recall	Not reported
STDec: Spatio-Temporal Stability Guided Decoding for dLLMs Apr 7, 2026	Automatic Metrics	MBPP+	Throughput	Not reported
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces Apr 7, 2026	Automatic Metrics	DROP, Halueval	Accuracy	Not reported
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts Apr 7, 2026	Automatic Metrics	Needle In A Haystack, IFEval	Accuracy, Cost	Not reported
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation Apr 7, 2026	Automatic Metrics	Not reported	Cost	Not reported
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning Apr 7, 2026	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control Apr 7, 2026	Automatic Metrics	Not reported	Latency	Not reported
Fine-tuning Whisper for Pashto ASR: strategies and scale Apr 7, 2026	Automatic Metrics	Not reported	Wer, Jailbreak success rate	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (15.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (9.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (4.8% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (14.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10.8% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15.7% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 9.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.8% coverage).
Annotation unit is under-specified (15.7% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Insightbench vs Ludobench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Insightbench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 9.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (34)
Simulation Env (4)
Llm As Judge (3)
Human Eval (1)

Top Metrics

Accuracy (3)
Cost (3)
Recall (3)
F1 (2)

Top Benchmarks

Insightbench (1)
Ludobench (1)
Scirepeval (1)
SQuAD (1)

Quality Controls

Calibration (6)
Adjudication (2)
Gold Questions (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Fine-tuning Whisper for Pashto ASR: strategies and scale
Hanif Rahman · Apr 7, 2026 · Citations: 0

Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long · Apr 7, 2026 · Citations: 0

As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge.
Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
Philipp Hellwig, Willem Zuidema, Claire E. Stevenson, Martha Lewis · Apr 7, 2026 · Citations: 0

Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another.
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka · Apr 7, 2026 · Citations: 0

Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized.
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0

Long Horizon

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
Multi-objective Evolutionary Merging Enables Efficient Reasoning Models
Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli · Apr 7, 2026 · Citations: 0

Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the…
Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed · Apr 7, 2026 · Citations: 0

Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by…
Learning to Interrupt in Language-based Multi-agent Communication
Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz · Apr 7, 2026 · Citations: 0

Multi Agent

Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
Yi Xu, Philipp Jettkant, Laura Ruis · Apr 7, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking
Georgi Grazhdanski, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't
Jonathan Nemitz, Carsten Eickhoff, Junyi Jessy Li, Kyle Mahowald, Michal Golovanevsky · Apr 7, 2026 · Citations: 0

To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules.
State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi · Apr 7, 2026 · Citations: 0

Demonstrations

Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by…
Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries
Rebecca M. M. Hicke, Sil Hamilton, David Mimno, Ross Deans Kristensen-McLachlan · Apr 7, 2026 · Citations: 0

When human authors of summaries compress a story, they reveal what they consider narratively important.
Say Something Else: Rethinking Contextual Privacy as Information Sufficiency
Yunze Xiao, Wenkai Li, Xiaoyuan Wu, Ningshan Ma, Yueqi Song · Apr 7, 2026 · Citations: 0

LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private.
FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts
Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ART: Attention Replacement Technique to Improve Factuality in LLMs
Ziqin Luo, Yihao Quan, Xiaofeng Zhang, Xiaosong Yuan, Chen Shen · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0

Expert Verification

These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models
Michael Rizvi-Martel, Guillaume Rabusseau, Marius Mosbach · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation
Ahmed Alansary, Molham Mohamed, Ali Hamdi · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Charlotte Pouw, Hosein Mohebbi, Afra Alishahi, Willem Zuidema · Apr 7, 2026 · Citations: 0

Demonstrations

In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain.
Severity-Aware Weighted Loss for Arabic Medical Text Generation
Ahmed Alansary, Molham Mohamed, Ali Hamdi · Apr 7, 2026 · Citations: 0

Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses.
STDec: Spatio-Temporal Stability Guided Decoding for dLLMs
Yuzhe Chen, Jiale Cao, Xuyang Liu, Jin Xie, Aiping Yang · Apr 7, 2026 · Citations: 0

Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score.
Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal · Apr 7, 2026 · Citations: 0

Multi Agent

Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools.
In-Place Test-Time Training
Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang · Apr 7, 2026 · Citations: 0

Pairwise Preference

Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Exclusive Unlearning
Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma · Apr 7, 2026 · Citations: 0

Red Team

We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to…
ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma · Apr 7, 2026 · Citations: 0

Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable.
JUÁ -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio · Apr 7, 2026 · Citations: 0

We present JUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
Olexander Mazurets, Olexander Barmak, Leonid Bedratyuk, Iurii Krak · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles
Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic · Apr 7, 2026 · Citations: 0

Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions.
Short Data, Long Context: Distilling Positional Knowledge in Transformers
Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
Hongxu Zhou · Apr 7, 2026 · Citations: 0

Critique Edit

While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models
Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan · Apr 7, 2026 · Citations: 0

Expert Verification

Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection
Zhongxing Zhang, Emily K. Vraga, Jisu Huh, Jaideep Srivastava · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
Michael Cuccarese · Apr 7, 2026 · Citations: 0

Demonstrations

This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
Disentangling MLP Neuron Weights in Vocabulary Space
Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models
Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou · Apr 7, 2026 · Citations: 0

However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate…
Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design
Shuqing Zhao · Apr 7, 2026 · Citations: 0

We present case studies of an 8-way set-associative L1 data cache and a synthesizable PG021-compatible AXI DMA controller (with Yosys and OpenSTA results on Sky130), and compare Arch to SystemVerilog, VHDL, Chisel, Bluespec, and other…
Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family
Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang · Apr 7, 2026 · Citations: 0

We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting.
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
Yi Yuan, Xuhong Wang, Shanzhe Lei · Apr 7, 2026 · Citations: 0

As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains.
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
Abbas Ghaddar, Ivan Kobyzev, Boxing Chen, Yufei Cui · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?
Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li · Apr 7, 2026 · Citations: 0

Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks.
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
Hongxu Zhou · Apr 7, 2026 · Citations: 0

Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed · Apr 7, 2026 · Citations: 0

Rubric Rating Long Horizon

To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents
Cherifa Ben Khelil, Jean-Yves Antoine, Anaïs Halftermeyer, Frédéric Rayar, Mathieu Thebaud · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mechanistic Circuit-Based Knowledge Editing in Large Language Models
Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen · Apr 7, 2026 · Citations: 0

Long Horizon

Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Fatih Uenal · Apr 7, 2026 · Citations: 0

Self-graded D7 scores (73-94%) exceed externally judged D8 security scores (20-61%) by a wide margin, though these dimensions use non-comparable scoring regimes.
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
Xiangming Gu, Soham De, Larisa Markeeva, Petar Veličković, Razvan Pascanu · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du · Apr 7, 2026 · Citations: 0

Rubric Rating

To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching.
LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring
Xiao Qin, Xingyi Song, Tong Liu, Hatim Laalej, Zepeng Liu · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes
Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel · Apr 7, 2026 · Citations: 0

Pairwise Preference

We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation.
AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning
Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026 · Citations: 0

Tool Use

To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.
"OK Aura, Be Fair With Me": Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection
Fernando López, Paula Delgado-Santos, Pablo Gómez, David Solans, Jordi Luque · Apr 7, 2026 · Citations: 0

We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes.
CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training
Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh · Apr 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan · Apr 7, 2026 · Citations: 0

Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector.
Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation
Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar · Apr 7, 2026 · Citations: 0

The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote