HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-02

Updated from current HFEPX corpus (Mar 8, 2026). 77 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 77 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 2, 2026.

Papers: 77 Last published: Mar 2, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 77 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

10.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

35.0%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Why This Time Slice Matters

11.7% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 23.4% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (1.3% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
Mar 2, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Cost
According to Me: Long-Term Personalized Referential Memory QA
Mar 2, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Recall
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Mar 2, 2026 · Citations: 0 · Score: 6.5

Eval: Llm As Judge · Metrics: Cost
LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
Mar 2, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: F1
Building a Strong Instruction Language Model for a Less-Resourced Language
Mar 2, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Win rate
From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
Mar 2, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Agreement

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation Mar 2, 2026	Automatic Metrics	Not reported	Cost	Calibration
According to Me: Long-Term Personalized Referential Memory QA Mar 2, 2026	Automatic Metrics	Atm Bench	Accuracy, Recall	Not reported
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation Mar 2, 2026	Llm As Judge	MT Bench	Cost	Not reported
LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction Mar 2, 2026	Automatic Metrics	Semeval	F1	Not reported
Building a Strong Instruction Language Model for a Less-Resourced Language Mar 2, 2026	Automatic Metrics	LMSYS Chatbot Arena, Slovenian Llm Eval	Win rate	Not reported
From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation Mar 2, 2026	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported
Surgical Post-Training: Cutting Errors, Keeping Knowledge Mar 2, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered Mar 2, 2026	Automatic Metrics	Not reported	Cost	Not reported
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning Mar 2, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking Mar 2, 2026	Automatic Metrics	Not reported	Perplexity, Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (5.2% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (19.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15.6% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 2.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.7% coverage).
Annotation unit is under-specified (15.6% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AIME vs BBH) before comparing methods.
Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AIME Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 2.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (18)
Llm As Judge (3)
Human Eval (1)
Simulation Env (1)

Top Metrics

Accuracy (5)
Latency (4)
Cost (3)
Jailbreak success rate (3)

Top Benchmarks

AIME (1)
BBH (1)
Healthbench (1)
LongBench (1)

Quality Controls

Calibration (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR
Pouya Mehralian, Melissa Farasyn, Anne Breitbarth, Anne-Sophie Ghyselen, Hugo Van hamme · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Directed Graph Model and Experimental Framework for Design and Study of Time-Dependent Text Visualisation
Songhai Fan, Simon Angus, Tim Dwyer, Ying Yang, Sarah Goodwin · Mar 2, 2026 · Citations: 0

Exponential growth in the quantity of digital news, social media, and other textual sources makes it difficult for humans to keep up with rapidly evolving narratives about world events.
RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
Alexandra Diaconu, Mădălina Vînaga, Bogdan Alexe · Mar 2, 2026 · Citations: 0
Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs
Jiangang Hao · Mar 2, 2026 · Citations: 0
Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects
Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva · Mar 2, 2026 · Citations: 0
Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
Valentin Lacombe, Valentin Quesnel, Damien Sileo · Mar 2, 2026 · Citations: 0
Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh · Mar 2, 2026 · Citations: 0
Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui · Mar 2, 2026 · Citations: 0

Pairwise Preference

The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem.
Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
Luigi Medrano, Arush Verma, Mukul Chhabra · Mar 2, 2026 · Citations: 0
Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)
Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Daniel DeAlcala, Gonzalo Mancera · Mar 2, 2026 · Citations: 0
LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing · Mar 2, 2026 · Citations: 0
LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
Veronika Solopova, Viktoria Skorik, Maksym Tereshchenko, Alina Haidun, Ostap Vykhopen · Mar 2, 2026 · Citations: 0
Recursive Models for Long-Horizon Reasoning
Chenxiao Yang, Nathan Srebro, Zhiyuan Li · Mar 2, 2026 · Citations: 0
Recursive Think-Answer Process for LLMs and VLMs
Byung-Kwan Lee, Youngchae Chee, Yong Man Ro · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
Chuong Huynh, Manh Luong, Abhinav Shrivastava · Mar 2, 2026 · Citations: 0
ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li · Mar 2, 2026 · Citations: 0

Rubric Rating

However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows.
Learning from Synthetic Data Improves Multi-hop Reasoning
Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go · Mar 2, 2026 · Citations: 0
Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
Thierry Geoffre, Trystan Geoffre · Mar 2, 2026 · Citations: 0

Critique Edit

Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across…
What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy · Mar 2, 2026 · Citations: 0
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Jiale Lao, Immanuel Trummer · Mar 2, 2026 · Citations: 0

Multi Agent

As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026 · Citations: 0

Pairwise Preference Multi Agent

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu · Mar 2, 2026 · Citations: 0

Pairwise Preference

We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
Simon Ging, Philipp Arnold, Sebastian Walter, Hani Alnahas, Hannah Bast · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao · Mar 2, 2026 · Citations: 0

Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation.
PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu · Mar 2, 2026 · Citations: 0

On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice.
According to Me: Long-Term Personalized Referential Memory QA
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li · Mar 2, 2026 · Citations: 0

However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience.
CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou · Mar 2, 2026 · Citations: 0

We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online…
AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao · Mar 2, 2026 · Citations: 0

Long Horizon

Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm · Mar 2, 2026 · Citations: 0

In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks.
When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Thibault Prouteau, Francis Lareau, Nicolas Dugué, Jean-Charles Lamirel, Christophe Malaterre · Mar 2, 2026 · Citations: 0

Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment.
From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
Junbo Huang, Max Weinig, Ulrich Fritsche, Ricardo Usbeck · Mar 2, 2026 · Citations: 0

To evaluate annotation quality, we employed a 6\times3 factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's α),…
AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen · Mar 2, 2026 · Citations: 0

Expert Verification Multi Agent

Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs.
FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
Liliia Bogdanova, Shiran Sun, Lifeng Han, Natalia Amat Lefort, Flor Miriam Plaza-del-Arco · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Efficient RLVR Training via Weighted Mutual Information Data Selection
Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo · Mar 2, 2026 · Citations: 0

Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning,…
KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Sovereign AI-based Public Services are Viable and Affordable
António Branco, Luís Gomes, Rodrigo Santos, Eduardo Santos, João Silva · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen · Mar 2, 2026 · Citations: 0

LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations.
Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
Xufei Lv, Jiahui Yang, Yifu Gao, Linbo Qiao, Houde Liu · Mar 2, 2026 · Citations: 0

Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval.
OpenAutoNLU: Open Source AutoML Library for NLU
Grigory Arshinov, Aleksandr Boriskin, Sergey Senichev, Ayaz Zaripov, Daria Galimzianova · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PleaSQLarify: Visual Pragmatic Repair for Natural Language Database Querying
Robin Shing Moon Chan, Rita Sevastjanova, Mennatallah El-Assady · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
Xunlei Chen, Jinyu Guo, Yuang Li, Zhaokun Wang, Yi Gong · Mar 2, 2026 · Citations: 0

ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens.
Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
Fred Zimmerman · Mar 2, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff · Mar 2, 2026 · Citations: 0

Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3)…
LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FreeAct: Freeing Activations for LLM Quantization
Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026 · Citations: 0

Rubric Rating

Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.
AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff · Mar 2, 2026 · Citations: 0

Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in…
Bootstrapping Embeddings for Low Resource Languages
Merve Basoz, Andrew Horne, Mattia Opper · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
Jinluan Yang, Yuxin Liu, Zhengyu Chen, Chengcheng Han, Yueqing Sun · Mar 2, 2026 · Citations: 0

Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.
Legal RAG Bench: an end-to-end benchmark for legal RAG
Abdur-Rahman Butler, Umar Butler · Mar 2, 2026 · Citations: 0

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems.
Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
Yixuan Tang, Zhenghong Lin, Yandong Sun, Wynne Hsu, Mong Li Lee · Mar 2, 2026 · Citations: 0

Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders,…
Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han · Mar 2, 2026 · Citations: 0

Pairwise Preference

While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu · Mar 2, 2026 · Citations: 0

Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models.
LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence
Anka Chandrahas Tummepalli, Preethu Rose Anish · Mar 2, 2026 · Citations: 0

We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments.
Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu · Mar 2, 2026 · Citations: 0

We conducted extensive evaluations on five diverse LLMs and four distinct tasks.
Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank · Mar 2, 2026 · Citations: 0

This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive,…
More Data, Fewer Diacritics: Scaling Arabic TTS
Ahmed Musleh, Yifan Zhang, Kareem Darwish · Mar 2, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty · Mar 2, 2026 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote