HFEPX Archive Slice

HFEPX Weekly Archive: 2026-W08

Updated from current HFEPX corpus (Apr 12, 2026). 328 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 328 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 22, 2026.

Papers: 328 Last published: Feb 22, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 328 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

11.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

48.3%

Papers with reported metric mentions in extraction output.

4 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

11.6% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 36.6% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.7% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Feb 21, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Agreement
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Feb 21, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Feb 21, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
SPQ: An Ensemble Technique for Large Language Model Compression
Feb 20, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Feb 22, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Feb 21, 2026 · Citations: 0 · Score: 5.5

Eval: Human Eval · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language Feb 21, 2026	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported, Adjudication
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning Feb 21, 2026	Automatic Metrics	Nyayabench	Accuracy, Precision	Not reported
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models Feb 21, 2026	Automatic Metrics	Arabicnumbench	Accuracy	Not reported
SPQ: An Ensemble Technique for Large Language Model Compression Feb 20, 2026	Automatic Metrics	GSM8K, TruthfulQA	Accuracy, Perplexity	Not reported
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition Feb 22, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models Feb 21, 2026	Human Eval	GSM8K, AIME	Not reported	Not reported
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Feb 21, 2026	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification Feb 22, 2026	Automatic Metrics	Not reported	F1, Precision	Not reported
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations Feb 22, 2026	Automatic Metrics	Not reported	Accuracy, Latency	Not reported
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection Feb 22, 2026	Automatic Metrics	Not reported	Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.6% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (5.2% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (2.4% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (8.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (9.1% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 5.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Annotation unit is under-specified (9.1% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AIME vs Bankmathbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AIME Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 5.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (120)
Simulation Env (10)
Human Eval (8)
Llm As Judge (7)

Top Metrics

Accuracy (14)
Cost (5)
Agreement (4)
F1 (3)

Top Benchmarks

AIME (1)
Bankmathbench (1)
Charteditbench (1)
Correctbench (1)

Quality Controls

Calibration (9)
Inter Annotator Agreement Reported (5)
Adjudication (4)
Gold Questions (1)

Papers In This Archive Slice

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja · Feb 22, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Smooth Gate Functions for Soft Advantage Policy Optimization
Egor Denisov, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko · Feb 22, 2026 · Citations: 0
PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari · Feb 22, 2026 · Citations: 0

Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification.
Soft Sequence Policy Optimization
Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko · Feb 22, 2026 · Citations: 0
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026 · Citations: 0

Long Horizon

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0

Pairwise Preference Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao · Feb 22, 2026 · Citations: 0

Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting.
Scaling Laws for Precision in High-Dimensional Linear Regression
Dechen Zhang, Xuan Tang, Yingyu Liang, Difan Zou · Feb 22, 2026 · Citations: 0
Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo · Feb 22, 2026 · Citations: 0
Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger · Feb 22, 2026 · Citations: 0

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift.
TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov · Feb 22, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026 · Citations: 0

The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco, Derek Shiller · Feb 22, 2026 · Citations: 0

This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.
Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide · Feb 22, 2026 · Citations: 0

Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on…
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0

Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions
Stefanie Schneider, Miriam Göldl, Julian Stalter, Ricarda Vollmer · Feb 22, 2026 · Citations: 0

The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large…
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model
Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica · Feb 22, 2026 · Citations: 0
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026 · Citations: 0

Long Horizon

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders
Michael McCoubrey, Angelo Salatino, Francesco Osborne, Enrico Motta · Feb 22, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Astra: Activation-Space Tail-Eigenvector Low-Rank Adaptation of Large Language Models
Kainan Liu, Yong Zhang, Ning Cheng, Yun Zhu, Yanmeng Wang · Feb 22, 2026 · Citations: 0

Extensive experiments across natural language understanding (NLU) and natural language generation (NLG) tasks demonstrate that Astra consistently outperforms existing PEFT baselines across 16 benchmarks and even surpasses full fine-tuning…
Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Seong Hah Cho, Junyi Li, Anna Leshinskaya · Feb 22, 2026 · Citations: 0

Among the characteristics of value representation in humans is that they distinguish among value of different kinds.
TriTopic: Tri-Modal Graph-Based Topic Modeling with Iterative Refinement and Archetypes
Roman Egger · Feb 22, 2026 · Citations: 0

In benchmarks across 20 Newsgroups, BBC News, AG News, and Arxiv, TriTopic achieves the highest NMI on every dataset (mean NMI 0.575 vs.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0

Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su · Feb 22, 2026 · Citations: 0

Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training.
Uncovering Context Reliance in Unstructured Knowledge Editing
Zisheng Zhou, Mengqi Zhang, Shiguang Wu, Xiaotian Ye, Chi Zhang · Feb 22, 2026 · Citations: 0

Evaluations show that COIN reduces Context Reliance by 45.2% and outperforms strong baselines by 23.6% in editing success rate, highlighting the vital role of mitigating Context Reliance for robust editing.
Learning to Detect Language Model Training Data via Active Reconstruction
Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi · Feb 22, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Wilson Y. Lee · Feb 22, 2026 · Citations: 0

Long Horizon

Why do language agents fail on tasks they are capable of solving?
Benchmark Test-Time Scaling of General LLM Agents
Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang · Feb 22, 2026 · Citations: 0

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026 · Citations: 0

Pairwise Preference

This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages.
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026 · Citations: 0

Personal AI agents incur substantial cost via repeated LLM calls.
DeepInnovator: Triggering the Innovative Capabilities of LLMs
Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu · Feb 21, 2026 · Citations: 0

The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously…
AAVGen: Precision Engineering of Adeno-associated Viral Capsids for Renal Selective Targeting
Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
Yujiao Yang · Feb 21, 2026 · Citations: 0

Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region…
[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic
Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation
Adam Dejl, Jonathan Pearson · Feb 21, 2026 · Citations: 0

Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight…
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat · Feb 21, 2026 · Citations: 0

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG).
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Red Team

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026 · Citations: 0

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol
Andreas Schlapbach · Feb 21, 2026 · Citations: 0

This paper establishes a fundamental convergence: Schema-Guided Dialogue (SGD) and the Model Context Protocol (MCP) represent two manifestations of a unified paradigm for deterministic, auditable LLM-agent interaction.
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang · Feb 21, 2026 · Citations: 0

Multi Agent

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer…
ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models
Zefang Liu, Chenyang Zhu, Sangwoo Cho, Shi-Xiong Zhang · Feb 21, 2026 · Citations: 0

Experimental results across diverse benchmarks demonstrate that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse
Martin Bertran, Riccardo Fogliato, Zhiwei Steven Wu · Feb 21, 2026 · Citations: 0
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0

Long Horizon

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Semantic Substrate Theory: An Operator-Theoretic Framework for Geometric Semantic Drift
Stephen Russell · Feb 21, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM
Md Badsha Biswas, Ozlem Uzuner · Feb 21, 2026 · Citations: 0

Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning.
From Trial by Fire To Sleep Like a Baby: A Lexicon of Anxiety Associations for 20k English Multiword Expressions
Saif M. Mohammad · Feb 21, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Spilled Energy in Large Language Models
Adrian Robert Minut, Hazem Dewidar, Iacopo Masi · Feb 21, 2026 · Citations: 0

Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task…
PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning
Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham · Feb 20, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026 · Citations: 0

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha · Feb 20, 2026 · Citations: 0

Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026 · Citations: 0

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026 · Citations: 0

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now