- JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun · Jan 4, 2026 · Citations: 0
Red Team
To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare.
- Vision-language models lag human performance on physical dynamics and intent reasoning
Tianjun Gu, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma · Jan 4, 2026 · Citations: 0
To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set.
- AppellateGen: A Benchmark for Appellate Legal Judgment Generation
Hongkun Yang, Lionel Z. Wang, Wei Fan, Yiran Hu, Lixu Wang · Jan 4, 2026 · Citations: 0
Multi Agent
To address this, we introduce AppellateGen, a benchmark for second-instance legal judgment generation comprising 7,351 case pairs.
- ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System
Anantha Sharma · Jan 3, 2026 · Citations: 0
Pairwise Preference
Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity…
- Collusive Pricing Under LLM
Shengyu Cao, Ming Hu · Jan 3, 2026 · Citations: 0
Pairwise Preference
Above it, the system is bistable, with competitive and collusive pricing both locally stable and the realized outcome determined by the model's initial preference.
- EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation
Zilin Li, Weiwei Xu, Xuanbo Lu, Zheda Liu · Jan 3, 2026 · Citations: 0
To ensure protocol-faithful and fair evaluation, we unify data loading, training, and inference under a single JSON input-output contract and remove avoidable variance by adopting KV-off decoding as the default setting.
- Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study
Ata Akbari Asanjan, Milad Memarzadeh, Bryan Matthews, Nikunj Oza · Jan 3, 2026 · Citations: 0
We showcase our findings with two low-dimensional synthetic datasets for data representation, and an aviation safety dataset, called Dashlink, for high-dimensional reconstruction-based anomaly detection.
- Sigmoid Head for Quality Estimation under Language Ambiguity
Tu Anh Dinh, Jan Niehues · Jan 2, 2026 · Citations: 0
As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.
- Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026 · Citations: 0
Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
- A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
Yuang Zheng, Dongxu Chen, Yuxiang Mei, Dongxing Xu, Jie Chen · Jan 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Deep Neural Networks as Discrete Dynamical Systems: Implications for Physics-Informed Learning
Abhisek Ganguly, Santosh Ansumali, Sauro Succi · Jan 1, 2026 · Citations: 0
- Toward Better Temporal Structures for Geopolitical Events Forecasting
Kian Ahrabian, Eric Boxer, Jay Pujara · Jan 1, 2026 · Citations: 0
Finally, we benchmark and analyze popular LLMs on our dataset, providing insights into 1) the positive impact of utilizing the HTKGH formalization compared to existing ones and 2) LLMs' adaptability and capabilities in complex forecasting…
- Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset
Alistair Plum, Laura Bernardy, Tharindu Ranasinghe · Jan 1, 2026 · Citations: 0
We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline.
- Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation
Qianli Wang, Van Bach Nguyen, Yihong Liu, Fedor Splitt, Nils Feldhus · Jan 1, 2026 · Citations: 0
We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.
- Mitigating Latent Mismatch in cVAE-Based Singing Voice Synthesis via Flow Matching
Minhyeok Yun, Yong-Hoon Choi · Jan 1, 2026 · Citations: 0
- From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark
Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li · Jan 1, 2026 · Citations: 0
Rubric RatingExpert Verification
Validated in sports rehabilitation, we release a knowledge graph (357,844 nodes, 371,226 edges) and a benchmark of 1,637 QA pairs.
- FCMBench: The First Large-scale Financial Credit Multimodal Benchmark for Real-world Applications
Yehui Yang, Dalu Yang, Fangxin Shang, Wenshuo Zhou, Jie Ren · Jan 1, 2026 · Citations: 0
- Speculative Decoding: Performance or Illusion?
Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung · Dec 31, 2025 · Citations: 0
Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch…
- The Agentic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLMs
Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko · Dec 31, 2025 · Citations: 0
We design a large-language-model (LLM) agent system that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text.
- RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang · Dec 31, 2025 · Citations: 0
While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics…
- ADOPT: Adaptive Dependency-Guided Joint Prompt Optimization for Multi-Step LLM Pipelines
Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi · Dec 31, 2025 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao · Dec 31, 2025 · Citations: 0
Long Horizon
We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic model.
- Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech
Fabian Retkowski, Alexander Waibel · Dec 30, 2025 · Citations: 0
First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task.
- Multi-Agent LLMs for Generating Research Limitations
Ibrahim Al Azher, Zhishuai Guo, Hamed Alhoori · Dec 30, 2025 · Citations: 0
Multi Agent
We propose, a multi-agent LLM framework for generating substantive limitations.
- Activation Steering for Masked Diffusion Language Models
Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi · Dec 30, 2025 · Citations: 0
Using safety refusal as a deployment-relevant case study, we find that refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace.
- WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025 · Citations: 0
This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as…
- VL-RouterBench: A Benchmark for Vision-Language Model Routing
Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu · Dec 29, 2025 · Citations: 0
The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets.
- Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao · Dec 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.