HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-18

Updated from current HFEPX corpus (Apr 12, 2026). 58 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 58 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: LiveCodeBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 18, 2026.

Papers: 58 Last published: Feb 18, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

58 / 58 papers are not low-signal flagged.

Benchmark Anchors

15.5%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

43.1%

Papers with reported metric mentions in extraction output.

4 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

10.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 36.2% of papers in this hub.
LiveCodeBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (5.2% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Feb 18, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Feb 18, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Recall
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Feb 18, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Recall
References Improve LLM Alignment in Non-Verifiable Domains
Feb 18, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Feb 18, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
Feb 18, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Faithfulness

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling Feb 18, 2026	Automatic Metrics	LiveCodeBench	Accuracy	Calibration
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks Feb 18, 2026	Automatic Metrics	Memoryarena	Recall	Not reported
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization Feb 18, 2026	Automatic Metrics	Banglasummeval	Accuracy, Recall	Not reported
References Improve LLM Alignment in Non-Verifiable Domains Feb 18, 2026	Automatic Metrics	LMSYS Chatbot Arena, AlpacaEval	Accuracy	Not reported
IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models Feb 18, 2026	Automatic Metrics	Indiceval	Accuracy	Not reported
Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution Feb 18, 2026	Automatic Metrics	BIG Bench, Zebralogicbench	Accuracy, Faithfulness	Not reported
Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification Feb 18, 2026	Automatic Metrics	Not reported	Agreement, Cost	Inter Annotator Agreement Reported
Discrete Stochastic Localization for Non-autoregressive Generation Feb 18, 2026	Automatic Metrics	Not reported	Latency	Calibration
Reinforced Fast Weights with Next-Sequence Prediction Feb 18, 2026	Not reported	LongBench, Needle In A Haystack	Context length, Coherence	Not reported
Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect Feb 18, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (6.9% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (3.4% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (6.9% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (5.2% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.6% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 6.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (5.2% coverage).
Annotation unit is under-specified (8.6% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (LiveCodeBench vs Memoryarena) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Human Eval Protocols Benchmark Slice: LiveCodeBench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 6.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (5.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (21)
Simulation Env (4)
Human Eval (1)

Top Metrics

Accuracy (2)
Cost (1)
Inference cost (1)
Latency (1)

Top Benchmarks

LiveCodeBench (1)
Memoryarena (1)

Quality Controls

Calibration (3)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar · Feb 18, 2026 · Citations: 0

The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik, Libby Barak, Jing Peng, Anna Feldman · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz · Feb 18, 2026 · Citations: 0

We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.
A Reversible Semantics for Janus
Ivan Lanese, Germán Vidal · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation
Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense · Feb 18, 2026 · Citations: 0

We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language.
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam · Feb 18, 2026 · Citations: 0

However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries.
Training Large Reasoning Models Efficiently via Progressive Thought Encoding
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu · Feb 18, 2026 · Citations: 0

Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over…
Claim Automation using Large Language Model
Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong · Feb 18, 2026 · Citations: 0

We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026 · Citations: 0

Red Team

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi · Feb 18, 2026 · Citations: 0

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis · Feb 18, 2026 · Citations: 0

In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural…
References Improve LLM Alignment in Non-Verifiable Domains
Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty · Feb 18, 2026 · Citations: 0

Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality…
Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma · Feb 18, 2026 · Citations: 0

Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability.
Omitted Variable Bias in Language Models Under Distribution Shift
Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael · Feb 18, 2026 · Citations: 0

Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models.
Reinforced Fast Weights with Next-Sequence Prediction
Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026 · Citations: 0

Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0

Pairwise Preference

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
Adib Sakhawat, Fardeen Sadab · Feb 18, 2026 · Citations: 0

We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026 · Citations: 0

Pairwise Preference

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
Melkamu Abay Mersha, Jugal Kalita · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes
Miguel Marques, Ana Luísa Fernandes, Ana Filipa Pacheco, Rute Rebouças, Inês Cantante · Feb 18, 2026 · Citations: 0

A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain.
Creating a digital poet
Vered Tohar, Tsahi Hayat, Amir Leshem · Feb 18, 2026 · Citations: 0

Long Horizon

In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence…
Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak · Feb 18, 2026 · Citations: 0

To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits…
Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić · Feb 18, 2026 · Citations: 0

Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data.
Optimizing Soft Prompt Tuning via Structural Evolution
Zhenzhen Huang, Chaoning Zhang, Haoyu Bian, Songbo Zhang, Chi-lok Andy Tai · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning to Learn from Language Feedback with Social Meta-Learning
Jonathan Cook, Diego Antognini, Martin Klissarov, Claudiu Musat, Edward Grefenstette · Feb 18, 2026 · Citations: 0

They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Jenny Kunz · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas · Feb 18, 2026 · Citations: 0

The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026 · Citations: 0

Long Horizon

We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.
Verifiable Semantics for Agent-to-Agent Communication
Philipp Schoenegger, Matt Carlson, Chris Schneider, Chris Daly · Feb 18, 2026 · Citations: 0

Multi Agent

Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used.
Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
Mohammad H. A. Monfared, Lucie Flek, Akbar Karimi · Feb 18, 2026 · Citations: 0

We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples.
Variable-Length Semantic IDs for Recommender Systems
Kirill Khrylchenko · Feb 18, 2026 · Citations: 0

In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions.
AI-Driven Structure Refinement of X-ray Diffraction
Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang · Feb 18, 2026 · Citations: 0

We benchmark WPEM on standard reference patterns (PbSO_4 and Tb_2BaCoO_5), where it yields lower R_p/R_{wp} than widely used packages (FullProf and TOPAS) under matched refinement conditions.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0

Red Team

LLM-based agents execute real-world workflows via tools and memory.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
PREFER: An Ontology for the PREcision FERmentation Community
Txell Amigó, Shawn Zheng Kai Tan, Angel Luu Phanthanourak, Sebastian Schulz, Pasquale D. Colaianni · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models
Martin Hyben, Sebastian Kula, Jan Cegin, Jakub Simko, Ivan Srba · Feb 18, 2026 · Citations: 0

We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles.
Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation
Jonathan Mutal, Perla Al Almaoui, Simon Hengchen, Pierrette Bouillon · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Lyapunov Spectral Analysis of Speech Embedding Trajectories in Psychosis
Jelena Vasic, Branislav Andjelic, Ana Mancic, Dusica Filipovic Djurdjevic, Ljiljana Mihic · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Are LLMs Ready to Replace Bangla Annotators?
Md. Najib Hasan, Touseef Hasan, Souvika Sarkar · Feb 18, 2026 · Citations: 0

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood.
Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications
Sanket Badhe, Deep Shah, Nehal Kathrotia · Feb 18, 2026 · Citations: 0

We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures.
The Validity of Coreference-based Evaluations of Natural Language Understanding
Ian Porada · Feb 18, 2026 · Citations: 0

In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or…
ModalImmune: Immunity Driven Unlearning via Self Destructive Training
Rong Fu, Jia Yee Tan, Zijian Zhang, Ziming Wang, Zhaolu Kang · Feb 18, 2026 · Citations: 0

Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.
Beyond Learning: A Training-Free Alternative to Model Adaptation
Namkyung Yoon, Kyeonghyun Yoo, Wooyong Jung, Sanghong Kim, Hwangnam Kim · Feb 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning Personalized Agents from Human Feedback
Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi · Feb 18, 2026 · Citations: 0

Pairwise Preference

We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory.
Discrete Stochastic Localization for Non-autoregressive Generation
Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis · Feb 18, 2026 · Citations: 0

On OpenWebText, DSL fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4\times fewer denoiser evaluations, and matches autoregressive quality at high budgets.
LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers
Peiqi Sui · Feb 18, 2026 · Citations: 0

We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations.
Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection
Rong Fu, Ziming Wang, Shuo Yin, Haiyun Wei, Kun Liu · Feb 18, 2026 · Citations: 0

Emotional expression underpins natural communication and effective human-computer interaction.
Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani · Feb 18, 2026 · Citations: 0

On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and…
Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis
Rong Fu, Ziming Wang, Chunlei Meng, Jiaxuan Lu, Jiekai Wu · Feb 18, 2026 · Citations: 0

Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full…

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now