HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-17

Updated from current HFEPX corpus (Apr 12, 2026). 56 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 56 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: Charteditbench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 17, 2026.

Papers: 56 Last published: Feb 17, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

56 / 56 papers are not low-signal flagged.

Benchmark Anchors

8.9%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

30.4%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

16.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 30.4% of papers in this hub.
Charteditbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (1.8% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Feb 17, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Feb 17, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Agreement, Cost
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Feb 17, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Not reported
Revisiting Northrop Frye's Four Myths Theory with Large Language Models
Feb 17, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Agreement
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Feb 17, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Cost
Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Feb 17, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Recursive Concept Evolution for Compositional Reasoning in Large Language Models Feb 17, 2026	Automatic Metrics	GPQA, HLE	Accuracy	Not reported
Multi-Objective Alignment of Language Models for Personalized Psychotherapy Feb 17, 2026	Automatic Metrics	Not reported	Agreement, Cost	Not reported
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models Feb 17, 2026	Automatic Metrics	Charteditbench	Not reported	Not reported
Revisiting Northrop Frye's Four Myths Theory with Large Language Models Feb 17, 2026	Automatic Metrics	Not reported	Accuracy, Agreement	Inter Annotator Agreement Reported
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework Feb 17, 2026	Automatic Metrics	Not reported	Cost	Not reported
Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination Feb 17, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
MAEB: Massive Audio Embedding Benchmark Feb 17, 2026	Automatic Metrics	Not reported	Cost	Not reported
Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings Feb 17, 2026	Automatic Metrics	Not reported	F1	Not reported
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation Feb 17, 2026	Llm As Judge	Not reported	Perplexity, Cost	Not reported
ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution Feb 17, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (16.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (5.4% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (3.6% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (3.6% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10.7% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 5.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Annotation unit is under-specified (10.7% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Charteditbench vs Mind2Web) before comparing methods.
Track metric sensitivity by reporting both cost and agreement.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Charteditbench Metric Slice: cost IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 5.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (17)
Human Eval (2)
Llm As Judge (2)
Simulation Env (2)

Top Metrics

Cost (2)
Agreement (1)

Top Benchmarks

Charteditbench (1)
Mind2Web (1)
VisualWebArena (1)

Quality Controls

Adjudication (1)
Calibration (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities
Shankar Padmanabhan, Mustafa Omer Gul, Tanya Goyal · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff
Patrick Pynadath, Ruqi Zhang · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs
Sean Trott, Samuel Taylor, Cameron Jones, James A. Michaelov, Pamela D. Rivière · Feb 17, 2026 · Citations: 0

Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs…
Activation Steering via Generative Causal Mediation
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Bradley McDanel, Steven Li, Harshit Khaitan · Feb 17, 2026 · Citations: 0

This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026 · Citations: 0

Pairwise PreferenceExpert Verification

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin · Feb 17, 2026 · Citations: 0

Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%).
A Curious Class of Adpositional Multiword Expressions in Korean
Junghyun Min, Na-Rae Han, Jena D. Hwang, Nathan Schneider · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026 · Citations: 0

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
The Geometric Anatomy of Capability Acquisition in Transformers
Jayadev Billa · Feb 17, 2026 · Citations: 0

On Pythia-2.8B, a logical deduction task that is genuinely hard for the model shows a precursor gap of {\sim}49K training steps, while easy benchmarks show none.
DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain · Feb 17, 2026 · Citations: 0

We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models.
Avey-B
Devang Acharya, Mohammad Hammoud · Feb 17, 2026 · Citations: 0

Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more…
Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter · Feb 17, 2026 · Citations: 0

Red Team

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
Suhyung Jang, Ghang Lee, Jaekun Lee, Hyunjun Lee · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026 · Citations: 0

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
Yahia Alqurnawi, Preetom Biswas, Anmol Rao, Tejas Anvekar, Chitta Baral · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou · Feb 17, 2026 · Citations: 0

Long Horizon

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0

Pairwise Preference

In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé · Feb 17, 2026 · Citations: 0

Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.
Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero · Feb 17, 2026 · Citations: 0

Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline.
Causal Effect Estimation with Latent Textual Treatments
Omri Feldman, Amar Venugopal, Jann Spiess, Amir Feder · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026 · Citations: 0

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0

Pairwise Preference

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman · Feb 17, 2026 · Citations: 0

Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use.
Revisiting Northrop Frye's Four Myths Theory with Large Language Models
Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khaled Khamis, Hesham Ali · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng · Feb 17, 2026 · Citations: 0

Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% (ρ_{T}=1.0, top-p=1.0) and 3.69%…
Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations
Chaithra Nerella, Chiranjeevi Yarra · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL
Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu · Feb 17, 2026 · Citations: 0

Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries.
RUVA: Personalized Transparent On-Device Graph Reasoning
Gabriele Conte, Alessio Mattiace, Gianni Carmosino, Potito Aghilar, Giovanni Servedio · Feb 17, 2026 · Citations: 0

We propose Ruva, the first "Glass Box" architecture designed for Human-in-the-Loop Memory Curation.
jina-embeddings-v5-text: Task-Targeted Embedding Distillation
Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther · Feb 17, 2026 · Citations: 0

Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size.
Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer, Chris Biemann · Feb 17, 2026 · Citations: 0

Demonstrations

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.
ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper · Feb 17, 2026 · Citations: 0

ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks.
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DependencyAI: Detecting AI Generated Text through Dependency Parsing
Sara Ahmed, Tracy Hammond · Feb 17, 2026 · Citations: 0

To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text.
Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen, Yujian Gan, Matthew Purver · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LuxMT Technical Report
Nils Rehlinger · Feb 17, 2026 · Citations: 0

To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg.
Towards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit
Aswathy Velutharambath, Amelie Wührl · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu · Feb 17, 2026 · Citations: 0

Pairwise Preference

Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms.
TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen · Feb 17, 2026 · Citations: 0

TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation.
Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato, Veera Schroderus, Jenna Kanerva, Jenni Kauppi, Virpi Lummaa · Feb 17, 2026 · Citations: 0

We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale.
SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition
Youness Dkhissi, Valentin Vielzeuf, Elys Allesiardo, Anthony Larcher · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Multi Agent

To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language
Prathamesh Devadiga, Paras Chopra · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang · Feb 17, 2026 · Citations: 0

Demonstrations

Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability.
Far Out: Evaluating Language Models on Slang in Australian and Indian English
Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi · Feb 17, 2026 · Citations: 0

We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models.
Alignment as Iatrogenesis: Pastoral Power, Collective Pathology, and the Structural Limits of Monolingual Safety Evaluation
Hiroki Fukui · Feb 17, 2026 · Citations: 0
NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu · Feb 17, 2026 · Citations: 0

Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.
Discovering Implicit Large Language Model Alignment Objectives
Edward Chen, Sanmi Koyejo, Carlos Guestrin · Feb 17, 2026 · Citations: 0

Rubric Rating

To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives.
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026 · Citations: 0

Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via…
Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory
Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li · Feb 17, 2026 · Citations: 0

Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.
Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026 · Citations: 0

Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
The Information Geometry of Softmax: Probing and Steering
Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch · Feb 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Long Horizon

Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now