HFEPX Archive Slice

HFEPX Weekly Archive: 2026-W06

Updated from current HFEPX corpus (Apr 12, 2026). 92 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 92 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: Chemcotbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 8, 2026.

Papers: 92 Last published: Feb 8, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 92 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

23.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

36.7%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

6.5% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 31.5% of papers in this hub.
Chemcotbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (1.1% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Feb 8, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Latency
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Feb 6, 2026 · Citations: 0 · Score: 6.0

Eval: Simulation Env · Metrics: Coherence
On Randomness in Agentic Evals
Feb 6, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Pass@k, Pass@1
Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory
Feb 6, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, F1
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Feb 6, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Win rate, Task success
EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
Feb 5, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Mse

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering Feb 8, 2026	Automatic Metrics	MLE Bench	Latency	Not reported
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors? Feb 6, 2026	Simulation Env	Sp Abcbench	Coherence	Not reported
On Randomness in Agentic Evals Feb 6, 2026	Automatic Metrics	SWE Bench, SWE Bench Verified	Pass@k, Pass@1	Not reported
Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory Feb 6, 2026	Automatic Metrics	Dg Eval	Accuracy, F1	Not reported
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning Feb 6, 2026	Automatic Metrics	Chemcotbench	Win rate, Task success	Not reported
EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization Feb 5, 2026	Automatic Metrics	AIME, Olympiadbench	Mse	Not reported
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	Automatic Metrics	DROP	Accuracy	Not reported
Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors Feb 6, 2026	Automatic Metrics	Not reported	Cost	Not reported
PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs Feb 6, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness Feb 4, 2026	Llm As Judge	Reliablebench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (6.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (1.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (8.7% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23.9% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (6.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (7.6% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 1.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.5% coverage).
Annotation unit is under-specified (7.6% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (Chemcotbench vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: Chemcotbench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 1.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (29)
Simulation Env (3)
Llm As Judge (1)

Top Metrics

Accuracy (10)
Agreement (3)
Cost (3)
Relevance (3)

Top Benchmarks

Chemcotbench (1)
DROP (1)
HellaSwag (1)
LongBench (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

Transforming Science Learning Materials in the Era of Artificial Intelligence
Xiaoming Zhai, Kent Crippen · Feb 8, 2026 · Citations: 0

However, these innovations also raise critical ethical and pedagogical concerns, including issues of algorithmic bias, data privacy, transparency, and the need for human oversight.
The Landscape of AI in Science Education: What is Changing and How to Respond
Xiaoming Zhai, Kent Crippen · Feb 8, 2026 · Citations: 0

At the same time, this chapter examines the ethical, social, and pedagogical challenges that arise, particularly issues of fairness, transparency, accountability, privacy, and human oversight.
IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery
Ivaxi Sheth, Zhijing Jin, Bryan Wilder, Dominik Janzing, Mario Fritz · Feb 8, 2026 · Citations: 0
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026 · Citations: 0

Long Horizon

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu · Feb 8, 2026 · Citations: 0
LLMs Know More About Numbers than They Can Say
Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis
Jiale Qian, Hao Meng, Tian Zheng, Pengcheng Zhu, Haopeng Lin · Feb 8, 2026 · Citations: 0

Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot…
VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su · Feb 8, 2026 · Citations: 0
PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification
Qiuming Luo, Yuebing Li, Feng Li, Chang Kong · Feb 8, 2026 · Citations: 0
Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng · Feb 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu · Feb 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling
Nisharg Nargund, Priyesh Shukla · Feb 7, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice
Savan Doshi · Feb 7, 2026 · Citations: 0
KRONE: Hierarchical and Modular Log Anomaly Detection
Lei Ma, Jinyang Liu, Tieying Zhang, Peter M. VanNostrand, Dennis M. Hofmann · Feb 7, 2026 · Citations: 0
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026 · Citations: 0

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors
Maximilian Vierlboeck, Antonio Pugliese, Roshanak Rose Nilchian, Paul T. Grogan, Rashika Sugganahalli Natesh Babu · Feb 6, 2026 · Citations: 0

Expert Verification

Complexity in engineered systems presents one of the most persistent challenges in modern development since it is driving cost overruns, schedule delays, and outright project failures.
PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs
Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki · Feb 6, 2026 · Citations: 0

Pairwise Preference

Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g.,…
On Randomness in Agentic Evals
Bjarni Haukur Bjarnason, André Silva, Martin Monperrus · Feb 6, 2026 · Citations: 0

Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.
Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research
Paul Tschisgale, Peter Wulff · Feb 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough
Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz · Feb 6, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Personality as Relational Infrastructure: User Perceptions of Personality-Trait-Infused LLM Messaging
Dominik P. Hofer, David Haag, Rania Islambouli, Jan D. Smeddinck · Feb 6, 2026 · Citations: 0
Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory
Sanyam Singh, Naga Ganesh, Vineet Singh, Lakshmi Pedapudi, Ritesh Kumar · Feb 6, 2026 · Citations: 0

We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall,…
Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study
Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li · Feb 6, 2026 · Citations: 0

Third-party agent skills extend LLM-based agents with instruction files and executable code that run on users' machines.
LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models
Brian Rabern, Philipp Mondorf, Barbara Plank · Feb 6, 2026 · Citations: 0

Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master.
From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG
Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan · Feb 6, 2026 · Citations: 0

In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic…
Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki · Feb 6, 2026 · Citations: 0
LLM-Enhanced Rumor Detection via Virtual Node Induced Edge Prediction
Jiran Tao, Cheng Wang, Binyan Jiang · Feb 6, 2026 · Citations: 0
CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
Videet Mehta, Liming Wang, Hilde Kuehne, Rogerio Feris, James R. Glass · Feb 6, 2026 · Citations: 0

Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains…
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao · Feb 6, 2026 · Citations: 0

Long Horizon

Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84\times average reduction in reasoning overhead.
RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
Isaac Picov, Ritesh Goru · Feb 6, 2026 · Citations: 0

Tool Use

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering
Amir H. Ashouri, Shayan Shirahmad Gale Bagi, Kavin Satheeskumar, Tejas Srikanth, Jonathan Zhao · Feb 5, 2026 · Citations: 0

Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes.
Self-Improving World Modelling with Latent Actions
Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen · Feb 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang · Feb 5, 2026 · Citations: 0

Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs.
Rewards as Labels: Revisiting RLVR from a Classification Perspective
Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen · Feb 5, 2026 · Citations: 0

Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO.
Transport and Merge: Cross-Architecture Merging for Large Language Models
Chenhang Cui, Binyun Yang, Fei Shen, Yuxin Chen, Jingnan Zheng · Feb 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LLM-driven Multimodal Recommendation
Yicheng Di · Feb 5, 2026 · Citations: 0
Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai · Feb 5, 2026 · Citations: 0

Pairwise Preference

We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences.
Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang · Feb 5, 2026 · Citations: 0

In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks.
The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li · Feb 5, 2026 · Citations: 0

Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench.
GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek
Yang Zhang, Mersin Konomi, Christos Xypolopoulos, Konstantinos Divriotis, Konstantinos Skianis · Feb 5, 2026 · Citations: 0

Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited.
Cross-talk based multi-task learning for fault classification of machine system influenced by multiple variables
Wonjun Yi, Rismaya Kumar Mishra, Yong-Hwa Park · Feb 5, 2026 · Citations: 0

We build on our previously introduced residual neural dimension reductor model, and extend its application to two benchmarks where system influenced by multiple variables.
SPARE: Self-distillation for PARameter-Efficient Removal
Natnael Mola, Leonardo S. B. Pereira, Carolina R. Kelsch, Luis H. Arribas, Juan C. S. M. Avedillo · Feb 4, 2026 · Citations: 0
CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
Zhao Tong, Chunlin Gong, Yiping Zhang, Haichao Shi, Qiang Liu · Feb 4, 2026 · Citations: 0

From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process.
When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
Xinyu Zhou, Chang Jin, Carsten Eickhoff, Zhijiang Guo, Seyed Ali Bahrainian · Feb 4, 2026 · Citations: 0
Investigating Disability Representations in Text-to-Image Models
Yang Tian, Yu Fan, Liudmila Zavolokina, Sarah Ebling · Feb 4, 2026 · Citations: 0
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel · Feb 4, 2026 · Citations: 0

Red Team

Automated LLM-as-a-Judge frameworks have become the de facto standard for scalable evaluation across natural language processing.
WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning
Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu · Feb 4, 2026 · Citations: 0

Tool Use

To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution.
VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
Jaeyoon Jung, Yejun Yoon, Kunwoo Park · Feb 4, 2026 · Citations: 0

Multi Agent

This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration.
Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim · Feb 4, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Forecasting Future Language: Context Design for Mention Markets
Sumin Kim, Jihoon Kwon, Yoon Kim, Nicole Kagan, Raffi Khatchadourian · Feb 4, 2026 · Citations: 0
Contextual Drag: How Errors in the Context Affect LLM Reasoning
Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora · Feb 4, 2026 · Citations: 0

Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration.
Expert Selections In MoE Models Reveal (Almost) As Much As Text
Amir Nuriyev, Gabriel Kulp · Feb 4, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0

Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries
Natalie Perez, Sreyoshi Bhaduri, Aman Chadha · Feb 3, 2026 · Citations: 0

Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Web Browsing

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026 · Citations: 0

Tool Use

To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
$V_0$: A Generalist Value Model for Any Policy at State Zero
Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu · Feb 3, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue
Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo · Feb 3, 2026 · Citations: 0

However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data.
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026 · Citations: 0

Long Horizon

In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now