HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W42

Updated from current HFEPX corpus (Apr 12, 2026). 72 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 72 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: APPS. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 19, 2025.

Papers: 72 Last published: Oct 19, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 72 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

10.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

35.0%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

11.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 29.2% of papers in this hub.
APPS is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Oct 17, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
Oct 14, 2025 · Citations: 0 · Score: 5.0

Eval: Not reported · Metrics: Cost
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
Oct 15, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Cost
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
Oct 19, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy, F1
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Oct 18, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Latency
ScholarEval: Research Idea Evaluation Grounded in Literature
Oct 17, 2025 · Citations: 0 · Score: 4.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling Oct 17, 2025	Automatic Metrics	MATH 500, BBH	Accuracy	Not reported
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization Oct 14, 2025	Not reported	BIG Bench, BBH	Cost	Not reported
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers Oct 15, 2025	Automatic Metrics	Not reported	Cost	Not reported
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification Oct 19, 2025	Automatic Metrics	Not reported	Accuracy, F1	Not reported
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution Oct 18, 2025	Automatic Metrics	Not reported	Latency	Not reported
ScholarEval: Research Idea Evaluation Grounded in Literature Oct 17, 2025	Not reported	Scholareval	Not reported	Not reported
BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance Oct 17, 2025	Automatic Metrics	Not reported	Bertscore, Hallucination rate	Not reported
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination Oct 17, 2025	Automatic Metrics	Not reported	Precision	Not reported
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics Oct 17, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
AI-BAAM: AI-Driven Bank Statement Analytics as Alternative Data for Malaysian MSME Credit Scoring Oct 17, 2025	Automatic Metrics	Not reported	Auroc	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (11.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (9.7% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (19.4% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (6.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.9% coverage).
Annotation unit is under-specified (16.7% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (APPS vs BBH) before comparing methods.
Track metric sensitivity by reporting both accuracy and coherence.

Recommended Queries

Human Eval Protocols Benchmark Slice: APPS Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (21)
Simulation Env (4)
Human Eval (2)

Top Metrics

Accuracy (5)
Coherence (3)
Cost (3)
F1 (2)

Top Benchmarks

APPS (1)
BBH (1)
BIG Bench (1)
GPQA (1)

Quality Controls

Papers In This Archive Slice

NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation
Rikard Vinge, Isabelle Wittmann, Jannik Schneider, Michael Marszalek, Luis Gilch · Oct 19, 2025 · Citations: 0
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
Noor Islam S. Mohammad · Oct 19, 2025 · Citations: 0

On the Jigsaw Toxic Comment benchmark, CoGate-LSTM achieves 0.881 macro-F1 (95% CI: [0.873, 0.889]) and 96.0% accuracy, outperforming fine-tuned BERT by 6.9 macro-F1 points (p < 0.001) and XGBoost by 4.7, while using only 7.3M parameters…
SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
Chih-Kai Yang, Yen-Ting Piao, Tzu-Wen Hsu, Szu-Wei Fu, Zhehuai Chen · Oct 19, 2025 · Citations: 0

We introduce SAKE, the first benchmark for editing perceptual auditory attribute knowledge in large audio-language models (LALMs), which requires modifying acoustic generalization rather than isolated facts.
MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An · Oct 18, 2025 · Citations: 0

Critique Edit Multi Agent

However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail.
Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless · Oct 18, 2025 · Citations: 0

Tool Use

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025 · Citations: 0

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
ScholarEval: Research Idea Evaluation Grounded in Literature
Hanane Nour Moussa, Patrick Queiroz Da Silva, Daniel Adu-Ampratwum, Alyson East, Zitong Lu · Oct 17, 2025 · Citations: 0

Rubric Rating

As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas.
SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
Yang Feng, Xudong Pan · Oct 17, 2025 · Citations: 0
In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
Aria Pessianzadeh, Naima Sultana, Hildegarde Van den Bulck, David Gefen, Shahin Jabbari · Oct 17, 2025 · Citations: 0

The rise of generative AI (GenAI) has impacted many aspects of human life.
PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
Simon Yu, Gang Li, Weiyan Shi, Peng Qi · Oct 17, 2025 · Citations: 0
BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi · Oct 17, 2025 · Citations: 0

Multi Agent

We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules.
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He · Oct 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Models are Injective and Hence Invertible
Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis · Oct 17, 2025 · Citations: 0
OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning
Woo-Jin Ahn, Sang-Ryul Baek, Yong-Jun Lee, Hyun-Duck Choi, Myo-Taeg Lim · Oct 17, 2025 · Citations: 0
Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0

Demonstrations

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang · Oct 17, 2025 · Citations: 0

We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang · Oct 17, 2025 · Citations: 0

Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
AI-BAAM: AI-Driven Bank Statement Analytics as Alternative Data for Malaysian MSME Credit Scoring
Chun Chet Ng, Zhen Hao Chu, Jia Yu Lim, Yin Yin Boon, Wei Zeng Low · Oct 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Three-dimensional inversion of gravity data using implicit neural representations and scientific machine learning
Pankaj K Mishra, Sanni Laaksonen, Jochen Kamm, Anand Singh · Oct 17, 2025 · Citations: 0
SAG-Agent: Enabling Long-Horizon Reasoning in Strategy Games via Dynamic Knowledge Graphs
Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang · Oct 17, 2025 · Citations: 0
GUIrilla: A Scalable Framework for Automated Desktop UI Exploration
Sofiya Garkot, Maksym Shamrai, Ivan Synytsia, Mariya Hirna · Oct 16, 2025 · Citations: 0
Composition-Grounded Data Synthesis for Visual Reasoning
Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li · Oct 16, 2025 · Citations: 0
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao · Oct 16, 2025 · Citations: 0

Tool Use

In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Lizhi Yang, Blake Werner, Massimiliano de Sa, Aaron D. Ames · Oct 16, 2025 · Citations: 0

Web Browsing

Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety.
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng · Oct 16, 2025 · Citations: 0

In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects.
Circuit Insights: Towards Interpretability Beyond Activations
Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek · Oct 16, 2025 · Citations: 0
TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG
Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni · Oct 16, 2025 · Citations: 0

However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols.
Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas · Oct 16, 2025 · Citations: 0
Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI
Xingmeng Zhao, Tongnian Wang, Dan Schumacher, Veronica Rammouz, Anthony Rios · Oct 16, 2025 · Citations: 0

Multi Agent

Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect.
LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
I-Fan Lin, Faegheh Hasibi, Suzan Verberne · Oct 16, 2025 · Citations: 0

Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is…
E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng · Oct 16, 2025 · Citations: 0

Multi Agent

However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities.
PluriHopRAG: Exhaustive, Recall-Sensitive QA Through Corpus-Specific Document Structure Learning
Mykolas Sveistrys, Richard Kunert · Oct 16, 2025 · Citations: 0

To study this setting, we introduce PluriHopWIND, a multilingual diagnostic benchmark of 48 pluri-hop questions over 191 real wind-industry reports, with high repetitiveness to reflect the challenge of distractors in real-world datasets.
From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program
Joseph E. Trujillo-Falcon, Monica L. Bozeman, Liam E. Llewellyn, Samuel T. Halvorson, Meryl Mizell · Oct 16, 2025 · Citations: 0

We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public.
Understanding the Ability of LLMs to Handle Character-Level Perturbation
Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang · Oct 16, 2025 · Citations: 0

Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as…
CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization
Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai · Oct 15, 2025 · Citations: 0

We evaluate CodeEvolve on benchmarks used to assess Google DeepMind's AlphaEvolve, and include direct comparisons with popular open-source frameworks for algorithmic discovery and heuristic design.
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou · Oct 15, 2025 · Citations: 0
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025 · Citations: 0

Pairwise Preference

In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general…
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DeDelayed: Deleting Remote Inference Delay via On-Device Correction
Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar · Oct 15, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh · Oct 15, 2025 · Citations: 0
Closing the Gap Between Text and Speech Understanding in LLMs
Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu · Oct 15, 2025 · Citations: 0

Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech…
MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan · Oct 15, 2025 · Citations: 0

Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%.
Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
Liesbeth Allein, Nataly Pineda-Castañeda, Andrea Rocci, Marie-Francine Moens · Oct 15, 2025 · Citations: 0

In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures.
Embedding-Based Context-Aware Reranker
Ye Yuan, Mohammad Amin Shabani, Siqi Liu · Oct 15, 2025 · Citations: 0

We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models
Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni · Oct 15, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe · Oct 15, 2025 · Citations: 0

Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs.
Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism
Xiaoshu Chen, Sihang Zhou, Ke Liang, Duanyang Yuan, Haoyuan Chen · Oct 15, 2025 · Citations: 0

It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc.
On the Reasoning Abilities of Masked Diffusion Language Models
Anej Svete, Ashish Sabharwal · Oct 15, 2025 · Citations: 0
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang · Oct 14, 2025 · Citations: 0

Pairwise Preference

We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge.
Schema for In-Context Learning
Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung · Oct 14, 2025 · Citations: 0

Demonstrations

Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce Schema-Activated In-Context…
Reveal-to-Revise: Explainable Bias-Aware Generative Modeling with Multimodal Attention
Noor Islam S. Mohammad, Md Muntaqim Meherab · Oct 14, 2025 · Citations: 0
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes · Oct 14, 2025 · Citations: 0

We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain.
Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025 · Citations: 0

Rubric Rating

As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is…
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang · Oct 14, 2025 · Citations: 0
When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu · Oct 14, 2025 · Citations: 0

In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations.
Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test
Nikoleta Pantelidou, Evelina Leivada, Raquel Montero, Paolo Morosi · Oct 14, 2025 · Citations: 0

The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the size of the linguistic community, which affects the quantity of available training data.
PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu · Oct 14, 2025 · Citations: 0

Long Horizon

Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining…
An Order-Sensitive Conflict Measure for Random Permutation Sets
Ruolan Cheng, Yong Deng · Oct 14, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents
Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li · Oct 14, 2025 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now