- SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0
Expert Verification Automatic Metrics Multi Agent
Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
- Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025 · Citations: 0
Expert Verification Automatic Metrics Tool Use
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
- Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao · Apr 7, 2026 · Citations: 0
Expert Verification Automatic Metrics
These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
- An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs
Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu · Mar 15, 2026 · Citations: 0
Expert VerificationRlaif Or Synthetic Feedback Automatic Metrics
Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples).
- Evaluation of LLMs in retrieving food and nutritional context for RAG systems
Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov, Barbara Koroušić Seljak · Mar 10, 2026 · Citations: 0
Expert Verification Automatic Metrics
In this article, we evaluate four Large Language Models (LLMs) and their effectiveness at retrieving data within a specialized Retrieval-Augmented Generation (RAG) system, using a comprehensive food composition database.
- An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in…
- LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li · Feb 15, 2026 · Citations: 0
Expert Verification Automatic Metrics
By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art…
- Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors
Maximilian Vierlboeck, Antonio Pugliese, Roshanak Rose Nilchian, Paul T. Grogan, Rashika Sugganahalli Natesh Babu · Feb 6, 2026 · Citations: 0
Expert Verification Automatic Metrics
Complexity in engineered systems presents one of the most persistent challenges in modern development since it is driving cost overruns, schedule delays, and outright project failures.
- GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data
Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha · Oct 10, 2025 · Citations: 0
Expert Verification Automatic Metrics
GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines.
- "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.