- Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026
Long Horizon
Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
- Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026
The current state of event detection research has two notable re-occurring limitations that we investigate in this study.
- Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026
Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
- Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026
Long Horizon
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
- AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026
We introduce \textbf{AWED-FiNER}, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than
- Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa · Jan 15, 2026
We share our models, data, and evaluations at AlignmentPretraining.ai.
- Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan, Raphaël Merx, Jey Han Lau · Jan 15, 2026
Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026
Pairwise Preference Long Horizon
Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
- CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery
Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo · Jan 14, 2026
In data-driven scientific discovery, a challenge lies in classifying well-characterized phenomena while identifying novel anomalies.
- CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026
Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
- A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026
Pairwise Preference
The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
- FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao · Jan 12, 2026
To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry.
- VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026
Critique Edit
We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
- Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026
Rubric RatingCritique Edit
Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
- Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026
Pairwise PreferenceCritique Edit
Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs).