Skip to content
← Back to explorer

Tag: Coding

Involves software engineering or code-quality expertise.

Papers in tag: 310

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (13)
  • Simulation Env (8)
  • Human Eval (1)

Human Feedback Types

  • Pairwise Preference (3)
  • Rubric Rating (3)
  • Critique Edit (2)

Required Expertise

  • Coding (20)
  • Math (2)
  • Multilingual (1)
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu · Feb 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Coding
  • Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and informati
  • By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port fo
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health

Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Human EvalSimulation Env Coding
  • Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
  • Human evaluation further confirms that FrameRef's generated framings measurably affect human judgment.
OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery

Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0

Simulation Env Coding
  • We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
  • OR-Agent organizes research as a structured tree-based workflow that explicitly models branching hypothesis generation and systematic backtracking, enabling controlled management of research trajectories beyond simple mutation-crossover loo
Small Reward Models via Backward Inference

Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov · Feb 14, 2026 · Citations: 0

Rubric Rating Automatic Metrics Coding
  • However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility.
  • Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%.
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li · Feb 12, 2026 · Citations: 0

Automatic Metrics Coding
  • To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
  • To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zoo
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong · Feb 11, 2026 · Citations: 0

Pairwise Preference Simulation Env MathCoding
  • We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
  • We focus on what matters most when building agents: sharp reasoning and fast, reliable execution.
UI-Venus-1.5 Technical Report

Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen · Feb 9, 2026 · Citations: 0

Simulation Env Coding
  • GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
  • In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.
Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim · Feb 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Coding
  • However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
  • In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.
VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Jaeyoon Jung, Yejun Yoon, Kunwoo Park · Feb 4, 2026 · Citations: 0

Automatic Metrics Coding
  • This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration.
  • For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking.
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng · Feb 3, 2026 · Citations: 0

Simulation Env Coding
  • In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
  • SWE-Master systematically explores the complete agent development pipeline, including teacher-trajectory synthesis and data curation, long-horizon SFT, RL with real execution feedback, and inference framework design.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0

Simulation Env Coding
  • While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
  • Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment.
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Coding
  • Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success.
  • To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Rubric RatingCritique Edit Automatic Metrics Coding
  • Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
  • We address this measurement gap with a tri-tier evaluation framework grounded in art-theoretical constructs (Section 2).
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri, Trizal Garg · Dec 26, 2025 · Citations: 0

Expert Verification Automatic Metrics CodingMultilingual
  • To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
  • Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages.
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025 · Citations: 0

Automatic Metrics Coding
  • On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
  • More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.