HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-16

Updated from current HFEPX corpus (Apr 12, 2026). 63 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 63 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: Innoeval. Common metric signal: bleu. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 16, 2026.

Papers: 63 Last published: Feb 16, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 63 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

45.0%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

11.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 31.7% of papers in this hub.
Innoeval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (3.2% of papers).
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Scaling Beyond Masked Diffusion Language Models
Feb 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Perplexity
LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
Feb 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Evolutionary System Prompt Learning for Reinforcement Learning in LLMs
Feb 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Success rate
HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation
Feb 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Mrr
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models
Feb 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System
Feb 16, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Scaling Beyond Masked Diffusion Language Models Feb 16, 2026	Automatic Metrics	GSM8K	Perplexity	Not reported
LLMStructBench: Benchmarking Large Language Model Structured Data Extraction Feb 16, 2026	Automatic Metrics	Llmstructbench	Accuracy	Not reported
Evolutionary System Prompt Learning for Reinforcement Learning in LLMs Feb 16, 2026	Automatic Metrics	AIME	Success rate	Not reported
HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation Feb 16, 2026	Automatic Metrics	HotpotQA	Accuracy, Mrr	Not reported
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models Feb 16, 2026	Automatic Metrics	BBQ	Accuracy	Not reported
Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System Feb 16, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
Feature Recalibration Based Olfactory-Visual Multimodal Model for Enhanced Rice Deterioration Detection Feb 16, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
TruthStance: An Annotated Dataset of Conversations on Truth Social Feb 16, 2026	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported
Weight space Detection of Backdoors in LoRA Adapters Feb 16, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts Feb 16, 2026	Automatic Metrics	Not reported	Accuracy	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (6.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (1.6% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (3.2% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (6.3% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 6.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.7% coverage).
Annotation unit is under-specified (6.3% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Track metric sensitivity by reporting both bleu and f1.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Innoeval Metric Slice: bleu IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 6.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (20)
Llm As Judge (2)
Human Eval (1)

Top Metrics

Bleu (1)
F1 (1)
Rouge (1)

Top Benchmarks

Innoeval (1)

Quality Controls

Calibration (2)
Adjudication (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
How to Train Your Long-Context Visual Document Model
Austin Veselka · Feb 16, 2026 · Citations: 0

Pairwise Preference

We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art…
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0

Multi Agent

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0

Tool Use

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Kevin Zhu, Javier Ferrando, Maheep Chaudhary · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking
Herbert Ullrich, Jan Drchal · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · Feb 16, 2026 · Citations: 0

ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain, Saddam Hussain Khan · Feb 16, 2026 · Citations: 0

The qualitative evaluation noted better extraction and discrimination and theological precision.
Symmetry in language statistics shapes the geometry of model representations
Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, Yasaman Bahri · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0

Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation
Ruoxi Liu, Philipp Koehn · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0

Pairwise Preference

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System
Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar · Feb 16, 2026 · Citations: 0

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback.
Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan, Shreyas Guha, Ayush Kumar · Feb 16, 2026 · Citations: 0

Critique Edit

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools…
BFS-PO: Best-First Search for Large Reasoning Models
Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara · Feb 16, 2026 · Citations: 0

Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Matteo Rinaldi, Rossella Varvara, Viviana Patti · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning State-Tracking from Code Using Linear RNNs
Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, Babak Rahmani · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026 · Citations: 0

Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
Overthinking Loops in Agents: A Structural Risk via MCP Tools
Yohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, Seungtaek Choi · Feb 16, 2026 · Citations: 0

Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages.
A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026 · Citations: 0

Long Horizon

Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment
Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff · Feb 16, 2026 · Citations: 0

Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin · Feb 16, 2026 · Citations: 0
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America
Yannis Karmim, Renato Pino, Hernan Contreras, Hernan Lira, Sebastian Cifuentes · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio · Feb 16, 2026 · Citations: 0

Critique Edit Long Horizon

We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models.
Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers
Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch · Feb 16, 2026 · Citations: 0

Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins
Francesco Gariboldi, Emma Franchino, Edith Haim, Gianluca Lattanzi, Alessandro Grecucci · Feb 16, 2026 · Citations: 0

Human networks show greater overlapping between mathematics and anxiety than GPT-oss.
Rethinking the Role of LLMs in Time Series Forecasting
Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang · Feb 16, 2026 · Citations: 0

We show that such conclusions stem from limited evaluation settings and do not hold at scale.
LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner · Feb 16, 2026 · Citations: 0

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text.
Evolutionary System Prompt Learning for Reinforcement Learning in LLMs
Lunjun Zhang, Ryan Chen, Bradly C. Stadie · Feb 16, 2026 · Citations: 0

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI.
Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
Lukas Struppek, Adam Gleave, Kellin Pelrine · Feb 16, 2026 · Citations: 0

Red Team

We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models.
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
Gianluca Vico, Jindřich Libovický · Feb 16, 2026 · Citations: 0

We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation.
Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?
Matteo Gay, Coleman Haley, Mario Giulianelli, Edoardo Ponti · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation
Hao Liu, Guangyan Li, Wensheng Zhang, Yongqiang Tang · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought
Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi · Feb 16, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Alignment Adapter to Improve the Performance of Compressed Deep Learning Models
Rohit Raj Rai, Abhishek Dhaka, Amit Awekar · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Wikidata Query Logs Dataset
Sebastian Walter, Hannah Bast · Feb 16, 2026 · Citations: 0

To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions.
MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs
Gabriel Roccabruna, Olha Khomyn, Giuseppe Riccardi · Feb 16, 2026 · Citations: 0

AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution.
Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury · Feb 16, 2026 · Citations: 0

We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning.
Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets
Yuchen Yang, Wenze Lin, Enhao Huang, Zhixuan Chu, Hongbin Zhou · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil
Sukumar Kishanthan, Kumar Thushalika, Buddhi Jayasekara, Asela Hevapathige · Feb 16, 2026 · Citations: 0

These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.
Query as Anchor: Scenario-Adaptive User Representation via Large Language Model
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao · Feb 16, 2026 · Citations: 0

Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment.
Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts
Buze Zhang, Jinkai Tao, Zilang Zeng, Neil He, Ali Maatouk · Feb 16, 2026 · Citations: 0

Our experiments across diverse benchmarks demonstrate that MoSLoRA consistently outperforms strong baselines, achieving up to 5.6% improvement on MATH500 and 15.9% on MAWPS.
BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026 · Citations: 0

Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.
TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning
Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu · Feb 16, 2026 · Citations: 0
HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation
Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang · Feb 16, 2026 · Citations: 0

Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness.
Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation
Guangyue Peng, Zongchao Chen, Wen Luo, Yuntao Wen, Wei Li · Feb 16, 2026 · Citations: 0

Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models
Lance Calvin Lim Gamboa, Yue Feng, Mark Lee · Feb 16, 2026 · Citations: 0

With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by…
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5
Dongrui Liu, Yi Yu, Jie Zhang, Guanxu Chen, Qihao Lin · Feb 16, 2026 · Citations: 0

As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber…
Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning
Qianyue Wang, Jinwu Hu, Huanxiang Lin, Bolin Chen, Zhiquan Wen · Feb 16, 2026 · Citations: 0

Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs'reasoning paradigm…
Selective Synchronization Attention
Hasi Hays · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Synthetic Reader Panels: Tournament-Based Ideation with LLM Personas for Autonomous Publishing
Fred Zimmerman · Feb 16, 2026 · Citations: 0

Pairwise Preference

We present a system for autonomous book ideation that replaces human focus groups with synthetic reader panels -- diverse collections of LLM-instantiated reader personas that evaluate book concepts through structured tournament…
LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning
Wang Xing, Wei Song, Siyu Lin, Chen Wu, Man Wang · Feb 16, 2026 · Citations: 0

Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a…
WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)
Kiyotaka Kasubuchi, Kazuo Fukiya · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Feature Recalibration Based Olfactory-Visual Multimodal Model for Enhanced Rice Deterioration Detection
Rongqiang Zhao, Hengrui Hu, Yijing Wang, Mingchun Sun, Jie Liu · Feb 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TruthStance: An Annotated Dataset of Conversations on Truth Social
Fathima Ameen, Danielle Brown, Manusha Malgareddy, Amanul Haque · Feb 16, 2026 · Citations: 0

We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote