- PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo · Oct 8, 2025 · Citations: 0
- EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025 · Citations: 0
To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
- Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible
Imry Ziv, Nur Lan, Emmanuel Chemla · Oct 8, 2025 · Citations: 0
Are large language models (LLMs) sensitive to the distinction between humanly possible and impossible languages?
- Search-R3: Unifying Reasoning and Embedding in Large Language Models
Yuntao Gui, James Cheng · Oct 8, 2025 · Citations: 0
Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes.
- Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri · Oct 8, 2025 · Citations: 0
We present the Open ASR Leaderboard, a reproducible benchmarking platform with community contributions from academia and industry.
- Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Exposing Citation Vulnerabilities in Generative Engines
Riku Mochizuki, Shusuke Komatsu, Souta Noguchi, Kazuto Ataka · Oct 8, 2025 · Citations: 0
- FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin · Oct 8, 2025 · Citations: 0
Multi Agent
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.
- PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Manuel Frank, Haithem Afli · Oct 8, 2025 · Citations: 0
- PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0
Pairwise Preference
Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
- StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang · Oct 8, 2025 · Citations: 0