- HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0
Pairwise Preference Llm As JudgeAutomatic Metrics
Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
- Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models
Yixuan Tang, Yi Yang · Mar 15, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Long Horizon
Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot · Oct 28, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics
Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs.
- LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics
A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
- Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
- The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral…
- Multi-Agent LLMs for Generating Research Limitations
Ibrahim Al Azher, Zhishuai Guo, Hamed Alhoori · Dec 30, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics Multi Agent
We propose, a multi-agent LLM framework for generating substantive limitations.
- Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
- ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
- From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
Fabian Lukassen, Jan Herrmann, Christoph Weisser, Benjamin Saefken, Thomas Kneib · Jan 5, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting.
- Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
- Overton Pluralistic Reinforcement Learning for Large Language Models
Yu Fu, Seongho Son, Ilija Bogunovic · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Existing alignment paradigms remain limited in capturing the pluralistic nature of human values.
- Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves >70\% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
- CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
- Representing data in words: A context engineering approach
Amandine M. Caut, Amy Rouillard, Beimnet Zenebe, Matthias Green, Ágúst Pálmason Morthens · Jan 27, 2025 · Citations: 0
Llm As JudgeAutomatic Metrics
Due to the absence of standardized benchmarks for this specific task, we conduct LLM-as-a-judge and human-as-a-judge evaluations to assess accuracy across the three applications.