- "Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025 · Citations: 0
Web Browsing
Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
- Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025 · Citations: 0
In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
- Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025 · Citations: 0
We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
- High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning
Tim Franzmeyer, Archie Sravankumar, Lijuan Liu, Yuning Mao, Rui Hou · Jun 4, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025 · Citations: 0
Expert Verification
However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences…
- EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025 · Citations: 0
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
- CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling
Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu · Jun 4, 2025 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Go-Browse: Training Web Agents with Structured Exploration
Apurva Gandhi, Graham Neubig · Jun 4, 2025 · Citations: 0
Web Browsing
To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments.
- Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing
Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang · Jun 4, 2025 · Citations: 0
Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored.