A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark…
Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task.
We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.
Browse by Topic
Jump directly into tag and hub pages to crawl deeper content clusters.
We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks.
These constraints have led to isolated, task-specific developments in the field, with models and benchmarks focused on single prediction tasks.
Exploiting the shared structure of work-related data, we propose a unifying framework, combining a wide range of tasks in a multi-task ranking benchmark, and a flexible architecture tackling text-driven work tasks with a single model.
Inspired by human cognition of thinking while reading, we first design a streaming thinking paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete.
On the Jigsaw Toxic Comment benchmark, CoGate-LSTM achieves 0.881 macro-F1 (95% CI: [0.873, 0.889]) and 96.0% accuracy, outperforming fine-tuned BERT by 6.9 macro-F1 points (p < 0.001) and XGBoost by 4.7, while using only 7.3M parameters…
Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.
Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly…
The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with…
We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks.
Our primary contribution is a systematic empirical study of when each component provides value: prompt engineering delivers consistent performance with minimal overhead, ICL enhances reasoning for complex queries when applied selectively,…
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency.