Tag: Coding

Involves software engineering or code-quality expertise.

Papers in tag: 314

Tag RSS Global RSS

Research Utility Snapshot

Evaluation Modes

Automatic Metrics (17)
Simulation Env (4)
Human Eval (3)

Human Feedback Types

Pairwise Preference (6)
Demonstrations (3)

Required Expertise

Coding (20)
Multilingual (3)
Medicine (2)

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang · Oct 29, 2025 · Citations: 0

Demonstrations Automatic Metrics Coding

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji · Oct 28, 2025 · Citations: 0

Simulation Env CodingMultilingual

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nv

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025 · Citations: 0

Human EvalAutomatic Metrics Coding

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English.

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang · Oct 27, 2025 · Citations: 0

Pairwise Preference Human Eval Coding

Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.
However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation.

Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong · Oct 14, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Coding

Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.

Mapping Semantic & Syntactic Relationships with Geometric Rotation

Michael Freenor, Lauren Alvarez · Oct 10, 2025 · Citations: 0

Demonstrations Automatic Metrics CodingMultilingual

PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li · Oct 5, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Coding

On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Coding

These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.

RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility

Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0

Automatic Metrics Coding

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reason

CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics LawCoding

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions.

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Qianhao Yuan, Xuanang Chen, Yaojie Lu · Aug 3, 2025 · Citations: 0

Automatic Metrics MedicineCoding

Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model's context, bypassing the challenges of large-scale retrieva
The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration.

GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong · Jul 4, 2025 · Citations: 0

Automatic Metrics Coding

Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation.
To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets.

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster · Jun 23, 2025 · Citations: 0

Demonstrations Automatic Metrics Coding

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implica

Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025 · Citations: 0

Automatic MetricsSimulation Env Coding

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025 · Citations: 0

Simulation Env MathCoding

EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025 · Citations: 0

Human EvalAutomatic Metrics CodingMultilingual

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics.

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang, Yinan Yu · Jun 2, 2025 · Citations: 0

Automatic Metrics Coding

Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou, Jianfei Yang · May 23, 2025 · Citations: 0

Automatic MetricsSimulation Env Coding

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning acros

Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen · May 16, 2025 · Citations: 0

Automatic Metrics Coding

In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Zhanliang Wang, Da Wu, Quan Nguyen, Zhuoran Xu, Kai Wang · May 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics MedicineCoding

To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimiz
While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone.