HFEPX Archive Slice

HFEPX Daily Papers for 2026-06-18

Daily archive slice for 2026-06-18 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-20); covers 57 papers from 2026-06-18.

Papers: 57 Last published: Jun 18, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

57 / 57 papers are not low-signal flagged.

Benchmark Anchors

21.1%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

52.6%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

Use this archive slice to monitor protocol drift and shifts in evaluation methods over 2026-06-18.

Protocol Takeaways For This Period

Evaluation modes for this slice cluster around automatic_metrics, simulation_env.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems
Jun 18, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Cost, Token cost
Source-Grounded Data Generation for Text-to-JSON Learning
Jun 18, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Exact match
GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs
Jun 18, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Perplexity
CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis
Jun 18, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, F1
Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
Jun 18, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost
AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA
Jun 18, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems Jun 18, 2026	Automatic Metrics	Herabench	Cost, Token cost	Not reported
Source-Grounded Data Generation for Text-to-JSON Learning Jun 18, 2026	Automatic Metrics	Stage Eval	Accuracy, Exact match	Not reported
GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs Jun 18, 2026	Automatic Metrics	GSM8K	Accuracy, Perplexity	Not reported
CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis Jun 18, 2026	Automatic Metrics	Wikisplitbench, Claimdecompbench	Accuracy, F1	Not reported
Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning Jun 18, 2026	Automatic Metrics	CommonsenseQA	Accuracy, Cost	Not reported
AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA Jun 18, 2026	Automatic Metrics	ChartQA	Accuracy	Not reported
NEST: Narrative Event Structures in Time for Long Video Understanding Jun 18, 2026	Automatic Metrics	Needle In A Haystack	F1	Not reported
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users Jun 18, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse Jun 18, 2026	Automatic Metrics	Not reported	Accuracy	Calibration
Benchmarking Agentic Review Systems Jun 18, 2026	Automatic Metrics	Not reported	Accuracy, Recall	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Human feedback

Human feedback is present in 10 of 57 papers.
Gap: Quality controls

Quality controls is present in 2 of 57 papers.
Gap: Benchmarks

Benchmarks is present in 12 of 57 papers.
Strong: Metrics

Metrics is present in 30 of 57 papers.
Gap: Known rater population

Known rater population is present in 2 of 57 papers.
Moderate: Known annotation unit

Known annotation unit is present in 12 of 57 papers.

Strengths

Metrics is present in 30 of 57 papers.

Known Gaps

Human feedback is present in 10 of 57 papers.
Quality controls is present in 2 of 57 papers.
Benchmarks is present in 12 of 57 papers.

Suggested Next Analyses

Compare 2026-06-18 against neighboring archive slices to flag protocol drift.

Recommended Queries

Browse all HFEPX daily archives

Known Limitations

This synthetic archive page is generated on-demand from extraction data because no cached payload was available for 2026-06-18.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (23)
Simulation Env (4)
Llm As Judge (1)

Top Metrics

Accuracy (11)
Cost (9)
F1 (7)
Recall (3)

Top Benchmarks

ALFWorld (1)
ChartQA (1)
Claimdecompbench (1)
Combeval (1)

Quality Controls

Calibration (2)

Papers In This Archive Slice

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral · Jun 18, 2026 · Citations: 0

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies.
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner · Jun 18, 2026 · Citations: 0

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood.
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems
Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu · Jun 18, 2026 · Citations: 0

We propose H-RePlan, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution.
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users
Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani · Jun 18, 2026 · Citations: 0

Pairwise Preference

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text.
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter · Jun 18, 2026 · Citations: 0

On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs.
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Helena Bonaldi, Genoveffa Martone, Marco Guerini · Jun 18, 2026 · Citations: 0

While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer…
Token-Operations-Oriented Inference Optimization Techniques for Large Models
Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng · Jun 18, 2026 · Citations: 0

Pairwise PreferenceCritique Edit Multi Agent

PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric…
The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
Celestine Achi · Jun 18, 2026 · Citations: 0

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent.
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
Abdul Rafay Syed · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
Josef Jon, Ondřej Bojar · Jun 18, 2026 · Citations: 0

The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact
Jelena Meyer, David Garcia, Dirk U. Wulff · Jun 18, 2026 · Citations: 0

Pairwise Preference

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research.
Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores
Augustin Bouquillard, Florent Jacquemard · Jun 18, 2026 · Citations: 0

We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores…
ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion
Maxim Melichov, Yakov Kolani, Morris Alper · Jun 18, 2026 · Citations: 0

Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods.
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization
Aueaphum Aueawatthanaphisut · Jun 18, 2026 · Citations: 0

Expert Verification

The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning.
NAMESAKES: Probing Identity Memorization in Text-to-Image Models
Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor · Jun 18, 2026 · Citations: 0

To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names.
From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models
Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo · Jun 18, 2026 · Citations: 0

Rubric Rating

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood.
Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring
Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer · Jun 18, 2026 · Citations: 0

The simulation benchmark shows the router outperforming two static baselines (0.694 vs.
PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors
Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu · Jun 18, 2026 · Citations: 0

Further, PASQA shows stronger agreement with human accent-correctness judgments.
When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation
Elroy Galbraith · Jun 18, 2026 · Citations: 0

Tool Use

On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of…
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship
William Guey, Pierrick Bougault · Jun 18, 2026 · Citations: 0

Pairwise PreferenceCritique Edit

Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts…
IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar · Jun 18, 2026 · Citations: 0

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks.
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li · Jun 18, 2026 · Citations: 0

Long Horizon

To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps.
Source-Grounded Data Generation for Text-to-JSON Learning
Sunghee Ahn, Guijin Son, Youngjae Yu · Jun 18, 2026 · Citations: 0

Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches.
Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines
Pratyush Kumar · Jun 18, 2026 · Citations: 0

Web Browsing

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents
Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou · Jun 18, 2026 · Citations: 0

Pairwise Preference Tool Use

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant.
Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning
Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li · Jun 18, 2026 · Citations: 0

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long…
Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning
Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata · Jun 18, 2026 · Citations: 0

We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics.
GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs
Yu Deng · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multi-Agent Transactive Memory
To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh · Jun 18, 2026 · Citations: 0

Multi Agent

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations.
Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal
Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury · Jun 18, 2026 · Citations: 0

Cross-dataset evaluation on L2-ARCTIC shows consistent gains.
REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj · Jun 18, 2026 · Citations: 0

We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts.
The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI
Serge Sharoff · Jun 18, 2026 · Citations: 0

Red Team

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation.
Large Language Models Do Not Always Need Readable Language
Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang · Jun 18, 2026 · Citations: 0

Multi Agent

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model.
Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives
Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan · Jun 18, 2026 · Citations: 0

In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports.
AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts
Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu · Jun 18, 2026 · Citations: 0

Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.
Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models
Hongliang Liu · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines
Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang · Jun 18, 2026 · Citations: 0

We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine.
CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis
Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le · Jun 18, 2026 · Citations: 0

We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings.
Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability
Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models
Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang · Jun 18, 2026 · Citations: 0

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models.
AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA
Aravind Narayanan, Shaina Raza · Jun 18, 2026 · Citations: 0

Multi Agent

Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise.
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
Darrien McKenzie, Nicklas Hansen, Xiaolong Wang · Jun 18, 2026 · Citations: 0

Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance).
Benchmarking Agentic Review Systems
Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan · Jun 18, 2026 · Citations: 0

Pairwise Preference

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated.
Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings
Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim · Jun 18, 2026 · Citations: 0

Pairwise Preference

Aligning language models with human preferences often requires optimising multiple behavioural objectives.
NRITYAM: Language Models Meet Art and Heritage of Dance
Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber · Jun 18, 2026 · Citations: 0

To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions.
Closing the Calibration Gap in Semantic Caching
Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs
Elijah Feldman, Dipak Meher, Carlotta Domeniconi · Jun 18, 2026 · Citations: 0

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents.
NEST: Narrative Event Structures in Time for Long Video Understanding
Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang · Jun 18, 2026 · Citations: 0

Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early…
TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature
Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska · Jun 18, 2026 · Citations: 0

Researchers are interested in learning about Mars so that it may eventually become habitable for humans.
What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations
Jason Potteiger · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Efficiently Representing Algorithms With Chain-of-Thought Transformers
Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Code-Switching Reveals Language Anchoring in Multilingual LLMs
Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
Kaizhen Tan, Rong Gu, Mingyuan Li · Jun 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.