- Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
Ahmed Attia, Alham Fikri Aji · Jan 18, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs
Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury · Jan 18, 2026 · Citations: 0
To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs.
- Legal Experts Disagree With Rationale Extraction Techniques for Explaining ECtHR Case Outcome Classification
Mahammad Namazov, Tomáš Koref, Ivan Habernal · Jan 18, 2026 · Citations: 0
We study this task on decisions from the European Court of Human Rights (ECtHR), introducing a new ECtHR dataset with carefully curated positive (violation) and negative (non-violation) cases.
- Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0
Long Horizon
Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
- Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents
Raffi Khatchadourian · Jan 17, 2026 · Citations: 0
Long Horizon
We introduce the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism, decision determinism, and evidence-conditioned faithfulness in tool-using agents deployed in financial services.
- PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg · Jan 17, 2026 · Citations: 0
Pairwise Preference Long Horizon
To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
- Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The unreasonable effectiveness of pattern matching
Gary Lupyan, Blaise Agüera y Arcas · Jan 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch · Jan 16, 2026 · Citations: 0
Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context.
- T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu · Jan 16, 2026 · Citations: 0
Long Horizon
Starting from an AR-initialized small-block MDM, T^\star transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks.
- The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel, Nikola Ljubešić · Jan 16, 2026 · Citations: 0
- Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li · Jan 16, 2026 · Citations: 0
Long Horizon
To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other.
- Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026 · Citations: 0
Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
- Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
Fenglin Zhang, Jie Wang · Jan 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang · Jan 16, 2026 · Citations: 0
Red Team
Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
- A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning
Jinshi Liu, Pan Liu, Lei He · Jan 16, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents
Kaiyu Zhou, Yongsen Zheng, Yicheng He, Meng Xue, Xueluan Gong · Jan 16, 2026 · Citations: 0
- Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework
Qingyue Zhang, Chang Chu, Haohao Fu, Tianren Peng, Yanru Wu · Jan 15, 2026 · Citations: 0
- Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Zirui Ren, Ziming Liu · Jan 15, 2026 · Citations: 0
- Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi · Jan 15, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib · Jan 15, 2026 · Citations: 0
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath…
- Development of Ontological Knowledge Bases by Leveraging Large Language Models
Le Ngoc Luyen, Marie-Hélène Abel, Philippe Gouspillou · Jan 15, 2026 · Citations: 0
- Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0
Long Horizon
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
- DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu · Jan 15, 2026 · Citations: 0
- HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang · Jan 15, 2026 · Citations: 0
We present HumanLLM, a framework treating psychological patterns as interacting causal forces.
- AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026 · Citations: 0
We introduce AWED-FiNER, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than 6.6…
- Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa · Jan 15, 2026 · Citations: 0
We share our models, data, and evaluations at AlignmentPretraining.ai.
- Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang · Jan 15, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan, Raphaël Merx, Jey Han Lau · Jan 15, 2026 · Citations: 0
Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0
Pairwise Preference Long Horizon
Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
- Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight · Jan 14, 2026 · Citations: 0
However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation.
- Information Access of the Oppressed: A Problem-Posing Framework for Envisioning Emancipatory Information Access Platforms
Bhaskar Mitra, Nicola Neophytou, Sireesh Gururaja · Jan 14, 2026 · Citations: 0
Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, confidentiality, transparency, and safety.
- MVSS: A Unified Framework for Multi-View Structured Survey Generation
Yinqi Liu, Yueqi Zhu, Yongkang Zhang, Feiran Liu, Yutong Shen · Jan 14, 2026 · Citations: 0
In addition, we introduce a dedicated evaluation framework that systematically assesses generated surveys from multiple dimensions, including structural quality, comparative completeness, and citation fidelity.
- CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery
Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo · Jan 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs
Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary · Jan 14, 2026 · Citations: 0
With the increasing presence of embodied conversational agents and social robots, the ability to correctly ground this kind of conversational content in order to refer back later also becomes important for dialog systems.
- GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization
Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng · Jan 14, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang · Jan 14, 2026 · Citations: 0
Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent.
- CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026 · Citations: 0
Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
- Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li · Jan 13, 2026 · Citations: 0
- ConvoLearn: A Dataset for Fine-Tuning Dialogic AI Tutors
Mayank Sharma, Roy Pea, Hari Subramonyam · Jan 13, 2026 · Citations: 0
- APEX-SWE
Abhi Kottamasu, Chirag Mahapatra, Sam Lee, Ben Pan, Aakash Barthwal · Jan 13, 2026 · Citations: 0
We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work.
- A Geolocation-Aware Multimodal Approach for Ecological Prediction
Valerie Zermatten, Chiara Vanalli, Gencer Sumbul, Diego Marcos, Devis Tuia · Jan 13, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students
Nifu Dan · Jan 13, 2026 · Citations: 0
- A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026 · Citations: 0
Pairwise Preference
The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
- Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
Saumitra Yadav, Manish Shrivastava · Jan 13, 2026 · Citations: 0
To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation.
- Rewriting Video: Text-Driven Reauthoring of Video Footage
Sitong Wang, Anh Truong, Lydia B. Chilton, Dingzeyu Li · Jan 13, 2026 · Citations: 0
A technical evaluation of the algorithm reveals a critical human-AI perceptual gap.
- PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark
Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan · Jan 13, 2026 · Citations: 0
Pairwise Preference
To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
- High-Fidelity Modeling of Stochastic Chemical Dynamics on Complex Manifolds: A Multi-Scale SIREN-PINN Framework for the Curvature-Perturbed Ginzburg-Landau Equation
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto · Jan 13, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao · Jan 12, 2026 · Citations: 0
To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry.
- Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors
Laurits Lyngbaek, Pascale Feldkamp, Yuri Bizzoni, Kristoffer L. Nielbo, Kenneth Enevoldsen · Jan 12, 2026 · Citations: 0
Use cases of sentiment analysis in the humanities often require contextualized, continuous scores.
- DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
Nayoung Choi, Jonathan Zhang, Jinho D. Choi · Jan 12, 2026 · Citations: 0
Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.
- VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0
Critique Edit
We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
- Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset
Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler · Jan 12, 2026 · Citations: 0
Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and…
- Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0
Rubric RatingCritique Edit
Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
- Learning Through Dialogue: Engagement and Efficacy Matter More Than Explanations
Shaz Furniturewala, Gerard Christopher Yeo, Kokil Jaidka · Jan 12, 2026 · Citations: 0
We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in…
- GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models
Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang · Jan 12, 2026 · Citations: 0
Extensive experiments show that our framework improves the aggregated Average by 22.4% over the strongest baseline on HumanML3D and by 14.4% on KIT-ML, while ablations confirm the effectiveness of the tokenizer, projection, and…
- Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026 · Citations: 0
Pairwise PreferenceCritique Edit
To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent…
- VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò · Jan 12, 2026 · Citations: 0
- NRR-Phi: Text-to-State Mapping for Ambiguity Preservation in LLM Inference
Kei Saito · Jan 12, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Measuring Iterative Temporal Reasoning with Time Puzzles
Zhengxiang Wang, Zeyu Dong · Jan 12, 2026 · Citations: 0