- GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR
Pouya Mehralian, Melissa Farasyn, Anne Breitbarth, Anne-Sophie Ghyselen, Hugo Van hamme · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Directed Graph Model and Experimental Framework for Design and Study of Time-Dependent Text Visualisation
Songhai Fan, Simon Angus, Tim Dwyer, Ying Yang, Sarah Goodwin · Mar 2, 2026 · Citations: 0
Exponential growth in the quantity of digital news, social media, and other textual sources makes it difficult for humans to keep up with rapidly evolving narratives about world events.
- RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
Alexandra Diaconu, Mădălina Vînaga, Bogdan Alexe · Mar 2, 2026 · Citations: 0
- Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs
Jiangang Hao · Mar 2, 2026 · Citations: 0
- Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects
Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva · Mar 2, 2026 · Citations: 0
- Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
Valentin Lacombe, Valentin Quesnel, Damien Sileo · Mar 2, 2026 · Citations: 0
- Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh · Mar 2, 2026 · Citations: 0
- Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui · Mar 2, 2026 · Citations: 0
Pairwise Preference
The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem.
- Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
Luigi Medrano, Arush Verma, Mukul Chhabra · Mar 2, 2026 · Citations: 0
- Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)
Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Daniel DeAlcala, Gonzalo Mancera · Mar 2, 2026 · Citations: 0
- LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing · Mar 2, 2026 · Citations: 0
- LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
Veronika Solopova, Viktoria Skorik, Maksym Tereshchenko, Alina Haidun, Ostap Vykhopen · Mar 2, 2026 · Citations: 0
- Recursive Models for Long-Horizon Reasoning
Chenxiao Yang, Nathan Srebro, Zhiyuan Li · Mar 2, 2026 · Citations: 0
- Recursive Think-Answer Process for LLMs and VLMs
Byung-Kwan Lee, Youngchae Chee, Yong Man Ro · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
Chuong Huynh, Manh Luong, Abhinav Shrivastava · Mar 2, 2026 · Citations: 0
- ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li · Mar 2, 2026 · Citations: 0
Rubric Rating
However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows.
- Learning from Synthetic Data Improves Multi-hop Reasoning
Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go · Mar 2, 2026 · Citations: 0
- Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
Thierry Geoffre, Trystan Geoffre · Mar 2, 2026 · Citations: 0
Critique Edit
Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across…
- What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy · Mar 2, 2026 · Citations: 0
- GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Jiale Lao, Immanuel Trummer · Mar 2, 2026 · Citations: 0
Multi Agent
As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
- Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026 · Citations: 0
Pairwise Preference Multi Agent
When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
- EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu · Mar 2, 2026 · Citations: 0
Pairwise Preference
We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
- Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
Simon Ging, Philipp Arnold, Sebastian Walter, Hani Alnahas, Hannah Bast · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao · Mar 2, 2026 · Citations: 0
Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation.
- PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu · Mar 2, 2026 · Citations: 0
On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice.
- According to Me: Long-Term Personalized Referential Memory QA
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li · Mar 2, 2026 · Citations: 0
However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience.
- CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou · Mar 2, 2026 · Citations: 0
We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online…
- AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao · Mar 2, 2026 · Citations: 0
Long Horizon
Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
- Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm · Mar 2, 2026 · Citations: 0
In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks.
- When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Thibault Prouteau, Francis Lareau, Nicolas Dugué, Jean-Charles Lamirel, Christophe Malaterre · Mar 2, 2026 · Citations: 0
Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment.
- From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
Junbo Huang, Max Weinig, Ulrich Fritsche, Ricardo Usbeck · Mar 2, 2026 · Citations: 0
To evaluate annotation quality, we employed a 6\times3 factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's α),…
- AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen · Mar 2, 2026 · Citations: 0
Expert Verification Multi Agent
Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs.
- FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
Liliia Bogdanova, Shiran Sun, Lifeng Han, Natalia Amat Lefort, Flor Miriam Plaza-del-Arco · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Efficient RLVR Training via Weighted Mutual Information Data Selection
Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo · Mar 2, 2026 · Citations: 0
Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning,…
- KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Sovereign AI-based Public Services are Viable and Affordable
António Branco, Luís Gomes, Rodrigo Santos, Eduardo Santos, João Silva · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen · Mar 2, 2026 · Citations: 0
LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations.
- Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
Xufei Lv, Jiahui Yang, Yifu Gao, Linbo Qiao, Houde Liu · Mar 2, 2026 · Citations: 0
Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval.
- OpenAutoNLU: Open Source AutoML Library for NLU
Grigory Arshinov, Aleksandr Boriskin, Sergey Senichev, Ayaz Zaripov, Daria Galimzianova · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- PleaSQLarify: Visual Pragmatic Repair for Natural Language Database Querying
Robin Shing Moon Chan, Rita Sevastjanova, Mennatallah El-Assady · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
Xunlei Chen, Jinyu Guo, Yuang Li, Zhaokun Wang, Yi Gong · Mar 2, 2026 · Citations: 0
ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens.
- Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
Fred Zimmerman · Mar 2, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff · Mar 2, 2026 · Citations: 0
Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3)…
- LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FreeAct: Freeing Activations for LLM Quantization
Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
Harry Stuart, Masahiro Kaneko, Timothy Baldwin · Mar 2, 2026 · Citations: 0
Rubric Rating
Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale.
- AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff · Mar 2, 2026 · Citations: 0
Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in…
- Bootstrapping Embeddings for Low Resource Languages
Merve Basoz, Andrew Horne, Mattia Opper · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training
Jinluan Yang, Yuxin Liu, Zhengyu Chen, Chengcheng Han, Yueqing Sun · Mar 2, 2026 · Citations: 0
Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.
- Legal RAG Bench: an end-to-end benchmark for legal RAG
Abdur-Rahman Butler, Umar Butler · Mar 2, 2026 · Citations: 0
We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems.
- Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
Yixuan Tang, Zhenghong Lin, Yandong Sun, Wynne Hsu, Mong Li Lee · Mar 2, 2026 · Citations: 0
Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders,…
- Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han · Mar 2, 2026 · Citations: 0
Pairwise Preference
While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
- Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu · Mar 2, 2026 · Citations: 0
Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models.
- LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence
Anka Chandrahas Tummepalli, Preethu Rose Anish · Mar 2, 2026 · Citations: 0
We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments.
- Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu · Mar 2, 2026 · Citations: 0
We conducted extensive evaluations on five diverse LLMs and four distinct tasks.
- Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank · Mar 2, 2026 · Citations: 0
This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive,…
- More Data, Fewer Diacritics: Scaling Arabic TTS
Ahmed Musleh, Yifan Zhang, Kareem Darwish · Mar 2, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty · Mar 2, 2026 · Citations: 0