HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-24

Updated from current HFEPX corpus (Apr 12, 2026). 121 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 121 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 121 Last published: Feb 24, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 121 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

10.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

9.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 38% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (0.8% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Feb 24, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Accuracy
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Feb 24, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, Pass@k
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Feb 24, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Cost
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Feb 24, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Precision
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Feb 24, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Success rate, Cost
Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Feb 24, 2026 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Faithfulness

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration Feb 24, 2026	Automatic Metrics	MMLU, GSM8K	Accuracy	Calibration
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning Feb 24, 2026	Automatic Metrics	MATH 500, AIME	Accuracy, Pass@k	Not reported
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery Feb 24, 2026	Automatic Metrics	Not reported	Cost	Not reported
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems Feb 24, 2026	Automatic Metrics	Not reported	Precision	Not reported
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG Feb 24, 2026	Automatic Metrics	Not reported	Success rate, Cost	Not reported
Causal Decoding for Hallucination-Resistant Multimodal Large Language Models Feb 24, 2026	Automatic Metrics	Not reported	Faithfulness	Not reported
The Headless Firm: How AI Reshapes Enterprise Boundaries Feb 24, 2026	Automatic Metrics	Not reported	Throughput, Cost	Not reported
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages Feb 24, 2026	Automatic Metrics	Not reported	F1, F1 macro	Not reported
Towards Controllable Video Synthesis of Routine and Rare OR Events Feb 24, 2026	Automatic Metrics	Not reported	Recall	Not reported
Towards single-shot coherent imaging via overlap-free ptychography Feb 24, 2026	Automatic Metrics	Not reported	Throughput	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (1.7% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0.8% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (10.7% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (12.4% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 1.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Annotation unit is under-specified (12.4% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ALFWorld vs WebShop) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ALFWorld Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 1.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (46)
Simulation Env (7)
Human Eval (3)
Llm As Judge (3)

Top Metrics

Accuracy (5)
Cost (4)
Coherence (2)
Agreement (1)

Top Benchmarks

ALFWorld (1)
WebShop (1)

Quality Controls

Calibration (1)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026 · Citations: 0

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning
Jesse He, Helen Jenne, Max Vargas, Davis Brown, Gal Mishne · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
Shiwei Tan, Hengyi Wang, Weiyi Qin, Qi Xu, Zhigang Hua · Feb 24, 2026 · Citations: 0

Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
Provably Safe Generative Sampling with Constricting Barrier Functions
Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0

Long Horizon

However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.
On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation
Alexander Galozy · Feb 24, 2026 · Citations: 0

Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context.
ECHOSAT: Estimating Canopy Height Over Space And Time
Jan Pauls, Karsten Schrödter, Sven Ligensa, Martin Schwartz, Berkant Turan · Feb 24, 2026 · Citations: 0

Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions.
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026 · Citations: 0

Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0

Multi Agent

We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali · Feb 24, 2026 · Citations: 0

Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings.
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages
Felix Schneider, Maria Gogolev, Sven Sickert, Joachim Denzler · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Guoan Wang, Shihao Yang, Jun-en Ding, Hao Zhu, Feng Liu · Feb 24, 2026 · Citations: 0

Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator's capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware,…
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging
Sameer Ambekar, Reza Nasirigerdeh, Peter J. Schuffler, Lina Felsner, Daniel M. Lang · Feb 24, 2026 · Citations: 0

We extensively evaluate our method with state-of-the-art baselines using two backbones across nine medical and natural-domain generalization image classification datasets, showing consistent gains across standard evaluation and challenging…
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026 · Citations: 0

We validate across five benchmarks, five models from three families, and both synthetic and real data.
Towards Controllable Video Synthesis of Routine and Rare OR Events
Dominik Schneider, Lalithkumar Seenivasan, Sampath Rapuri, Vishalroshan Anil, Aiza Maksutova · Feb 24, 2026 · Citations: 0

An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events.
Towards single-shot coherent imaging via overlap-free ptychography
Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg · Feb 24, 2026 · Citations: 0

On synthetic benchmarks, reconstructions remain accurate at low counts (\sim\!10^4 photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with…
Representation Theorems for Cumulative Propositional Dependence Logics
Juha Kontinen, Arne Meier, Kai Sauerwald · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRed Team

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs).
Scaling View Synthesis Transformers
Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann · Feb 24, 2026 · Citations: 0

Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto…
Equitable Evaluation via Elicitation
Elbert Du, Cynthia Dwork, Lunjia Hu, Reid McIlroy-Young, Han Shao · Feb 24, 2026 · Citations: 0

To obtain sufficient training data, we train an LLM to act as synthetic humans.
Test-Time Training with KV Binding Is Secretly Linear Attention
Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aletheia tackles FirstProof autonomously
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov · Feb 24, 2026 · Citations: 0

We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge.
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0

Long Horizon

Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions…
On Data Engineering for Scaling LLM Terminal Capabilities
Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro · Feb 24, 2026 · Citations: 0

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0

Long Horizon

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0

Long Horizon

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning
Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0

Tool Use

To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0

Multi Agent

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
Attention-Based SINR Estimation in User-Centric Non-Terrestrial Networks
Bruno De Filippo, Alessandro Guidotti, Alessandro Vanelli-Coralli · Feb 24, 2026 · Citations: 0

These results enable the integration of DMHSA-based estimators into scheduling procedures, allowing the evaluation of multiple candidate user groups and the selection of those offering the highest average SINR and capacity.
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0

These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as…
Probing Graph Neural Network Activation Patterns Through Graph Topology
Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis · Feb 24, 2026 · Citations: 0

Pairwise Preference

However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs.
Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification
Vishal Patil, Shree Vaishnavi Bacha, Revanth Yamani, Yidan Sun, Mayank Kejriwal · Feb 24, 2026 · Citations: 0

Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17…
Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning
Zhangjie Xia, Yu Yang, Pan Xu · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Initial Exploration Problem in Knowledge Graph Exploration
Claire McNamara, Lucy Hederman, Declan O'Sullivan · Feb 24, 2026 · Citations: 0

Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three…
Motivation is Something You Need
Mehdi Acheli, Walid Gaaloul · Feb 24, 2026 · Citations: 0

Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model…
Tool Building as a Path to "Superintelligence"
David Koplow, Tomer Galanti, Tomaso Poggio · Feb 24, 2026 · Citations: 0

In this work, we design a benchmark to measure γ on logical out-of-distribution inference.
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Expert Verification

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in…
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li · Feb 24, 2026 · Citations: 0

Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for…
Position-Aware Sequential Attention for Accurate Next Item Recommendations
Timur Nabiev, Evgeny Frolov · Feb 24, 2026 · Citations: 0

Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.
PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A
Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li · Feb 24, 2026 · Citations: 0

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof.
MIP Candy: A Modular PyTorch Framework for Medical Image Processing
Tianhao Fu, Yucheng Chen · Feb 24, 2026 · Citations: 0

MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, build_network, while retaining…
HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026 · Citations: 0

Pairwise Preference

Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
Generative Pseudo-Labeling for Pre-Ranking with LLMs
Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures
Yubin Ge, Yongsong Huang, Xiaofeng Liu · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa · Feb 24, 2026 · Citations: 0

We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks.
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie · Feb 24, 2026 · Citations: 0

Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning…
Toward an Agentic Infused Software Ecosystem
Mark Marron · Feb 24, 2026 · Citations: 0

Fully leveraging the capabilities of AI agents in software development requires a rethinking of the software ecosystem itself.
Evaluating Proactive Risk Awareness of Large Language Models
Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu · Feb 24, 2026 · Citations: 0

As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks.
Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving
Yuliang Ji, Fuchen Shen, Jian Wu, Qiujie Xie, Yue Zhang · Feb 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote