HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-19

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: Bankmathbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 19, 2026.

Papers: 61 Last published: Feb 19, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 61 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

43.3%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

11.5% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 36.1% of papers in this hub.
Bankmathbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (1.6% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

ABCD: All Biases Come Disguised
Feb 19, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Feb 19, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Recall
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Feb 19, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Modeling Distinct Human Interaction in Web Agents
Feb 19, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
What Makes a Good Doctor Response? A Study on Text-Based Telemedicine
Feb 19, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Feb 19, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Task success

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
ABCD: All Biases Come Disguised Feb 19, 2026	Automatic Metrics	DROP	Accuracy	Not reported
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering Feb 19, 2026	Automatic Metrics	PopQA	Recall	Not reported
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios Feb 19, 2026	Automatic Metrics	Bankmathbench	Accuracy	Not reported
Modeling Distinct Human Interaction in Web Agents Feb 19, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
What Makes a Good Doctor Response? A Study on Text-Based Telemedicine Feb 19, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction Feb 19, 2026	Automatic Metrics	Not reported	Task success	Not reported
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts Feb 19, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
What Language is This? Ask Your Tokenizer Feb 19, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines? Feb 19, 2026	Automatic Metrics	Not reported	Accuracy, Jailbreak success rate	Not reported
KLong: Training LLM Agent for Extremely Long-horizon Tasks Feb 19, 2026	Not reported	SWE Bench, MLE Bench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (1.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (3.3% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (9.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (13.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (9.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.1% coverage).
Annotation unit is under-specified (9.8% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Bankmathbench vs MLE-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and task success.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Bankmathbench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (22)
Human Eval (1)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Accuracy (5)
Task success (1)

Top Benchmarks

Bankmathbench (1)
MLE Bench (1)
Paperbench (1)
SWE Bench (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
Joschka Braun · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization
João N. Cardoso, Arlindo L. Oliveira, Bruno Martins · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver · Feb 19, 2026 · Citations: 0

Web Browsing

Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear.
On the scaling relationship between cloze probabilities and language model next-token prediction
Cassandra L. Jacobs, Morgan Grobol · Feb 19, 2026 · Citations: 0

While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical…
TFL: Targeted Bit-Flip Attack on Large Language Model
Jingkai Guo, Chaitali Chakrabarti, Deliang Fan · Feb 19, 2026 · Citations: 0

Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks.
Neural Synchrony Between Socially Interacting Language Models
Zhining Zhang, Wentao Zhu, Chi Han, Yizhou Wang, Heng Ji · Feb 19, 2026 · Citations: 0

Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction.
QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration
Meng Ye, Xiao Lin, Georgina Lukoczki, Graham W. Lederer, Yi Yao · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026 · Citations: 0

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
What Language is This? Ask Your Tokenizer
Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel · Feb 19, 2026 · Citations: 0

Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models.
Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld · Feb 19, 2026 · Citations: 0

Pairwise Preference

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order.
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat · Feb 19, 2026 · Citations: 0

While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms.
The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Modeling Distinct Human Interaction in Web Agents
Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou · Feb 19, 2026 · Citations: 0

Pairwise Preference Web Browsing

In this work, we introduce the task of modeling human intervention to support collaborative web task execution.
KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie · Feb 19, 2026 · Citations: 0

Rubric Rating Long Horizon

Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Jyotin Goel, Souvik Maji, Pratik Mazumder · Feb 19, 2026 · Citations: 0

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates.
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Multi Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan · Feb 19, 2026 · Citations: 0

Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
Peter Balogh · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics
Baris Karacan, Barbara Di Eugenio, Patrick Thornton · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
Yichen Lu, Siwei Nie, Minlong Lu, Xudong Yang, Xiaobo Zhang · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
Dimitri Staufer, Kirsten Morehouse · Feb 19, 2026 · Citations: 0

Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions.
Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers
Nusrat Jahan Lia, Shubhashis Roy Dipta · Feb 19, 2026 · Citations: 0

The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior.
PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions
Greta Damo, Stéphane Petiot, Elena Cabrio, Serena Villata · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Entropy-Based Data Selection for Language Models
Hongming Li, Yang Liu, Chao Huang · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ABCD: All Biases Come Disguised
Mateusz Nowak, Xavier Cadet, Peter Chin · Feb 19, 2026 · Citations: 0

We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented.
AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
Adib Sakhawat, Fardeen Sadab, Rakin Shahriar · Feb 19, 2026 · Citations: 0

Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions.
Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study
Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics
Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya · Feb 19, 2026 · Citations: 0

This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings.
Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DAVE: A Policy-Enforcing LLM Spokesperson for Secure Multi-Document Data Sharing
René Brinkhege, Prahlad Menon · Feb 19, 2026 · Citations: 0

We therefore outline an evaluation methodology to assess security, utility, and performance trade-offs under benign and adversarial querying as a basis for future empirical work on systematically governed LLM access to multi-party data…
The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour
Leonidas Zotos, Hedderik van Rijn, Malvina Nissim · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao · Feb 19, 2026 · Citations: 0

We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories.
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Bayesian Optimality of In-Context Learning with Selective State Spaces
Di Zhang, Jiaqi Xing · Feb 19, 2026 · Citations: 0

Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent…
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser · Feb 19, 2026 · Citations: 0

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison.
ArXiv-to-Model: A Practical Study of Scientific LM Training
Anuj Gupta · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
Evgeniia Tokarchuk, Maya K. Nachesa, Sergey Troshin, Vlad Niculae · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang · Feb 19, 2026 · Citations: 0

While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital…
Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Kensuke Okada, Yui Furukawa, Kyosuke Bunji · Feb 19, 2026 · Citations: 0

Rubric Rating

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments.
Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Bianca Raimondi, Maurizio Gabbrielli · Feb 19, 2026 · Citations: 0

The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics.
From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences
Yi-Chih Huang · Feb 19, 2026 · Citations: 0

Demonstrations

Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences.
What Makes a Good Doctor Response? A Study on Text-Based Telemedicine
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi · Feb 19, 2026 · Citations: 0

Expert Verification

As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy.
Continual uncertainty learning
Heisei Yonezawa, Ansei Yonezawa, Itsuro Kajiwara · Feb 19, 2026 · Citations: 0
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0

Multi Agent

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral…
Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests
Anton Dzega, Aviad Elyashar, Ortal Slobodin, Odeya Cohen, Rami Puzis · Feb 19, 2026 · Citations: 0

Their interpretations are highly consistent with those of human experts.
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026 · Citations: 0

Long Horizon

However, such errors have rarely been captured by existing benchmarks.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat, Ahmad Alshareef · Feb 19, 2026 · Citations: 0

We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026 · Citations: 0

To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Deepak Uniyal, Md Abul Bashar, Richi Nayak · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Large Language Models Persuade Without Planning Theory of Mind
Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber · Feb 19, 2026 · Citations: 0

Long Horizon

A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks.
ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026 · Citations: 0

Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
Arcee Trinity Large Technical Report
Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
Serin Kim, Sangam Lee, Dongha Lee · Feb 19, 2026 · Citations: 0

Pairwise Preference

Large language models have advanced web agents, yet current agents lack personalization capabilities.
Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang · Feb 19, 2026 · Citations: 0

To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories.
Exploring LLMs for User Story Extraction from Mockups
Diego Firmenich, Leandro Antonelli, Bruno Pazos, Fabricio Lozada, Leonardo Morales · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling
Divyam Madaan, Sumit Chopra, Kyunghyun Cho · Feb 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote