HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W21

Updated from current HFEPX corpus (Apr 17, 2026). 59 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 59 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 25, 2025.

Papers: 59 Last published: May 25, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

59 / 59 papers are not low-signal flagged.

Benchmark Anchors

13.6%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

30.5%

Papers with reported metric mentions in extraction output.

2 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

22% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 28.8% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (3.4% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
May 23, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
May 21, 2025 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
One RL to See Them All: Visual Triple Unified Reinforcement Learning
May 23, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Iou
UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models
May 20, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs
May 21, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Calibration error
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
May 21, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods May 23, 2025	Automatic Metrics	TruthfulQA	Accuracy	Not reported
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models May 21, 2025	Automatic Metrics	Verifybench	Accuracy	Not reported
One RL to See Them All: Visual Triple Unified Reinforcement Learning May 23, 2025	Automatic Metrics	Mega Bench	Iou	Not reported
UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models May 20, 2025	Automatic Metrics	Ultraeditbench	Accuracy	Not reported
ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs May 21, 2025	Automatic Metrics	Not reported	Calibration error	Calibration
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision May 21, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
Structured Agent Distillation for Large Language Model May 20, 2025	Simulation Env	ALFWorld, WebShop	Not reported	Not reported
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases May 23, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning May 23, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning May 23, 2025	Automatic Metrics	Not reported	Accuracy, Context length	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (22% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.4% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (10.2% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (18.6% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.9% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 3.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.5% coverage).
Annotation unit is under-specified (11.9% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (ALFWorld vs DROP) before comparing methods.
Track metric sensitivity by reporting both accuracy and recall.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: ALFWorld Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 3.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (17)
Llm As Judge (2)
Simulation Env (2)

Top Metrics

Accuracy (9)
Recall (3)
Cost (2)
F1 (1)

Top Benchmarks

ALFWorld (1)
DROP (1)
HotpotQA (1)
LiveCodeBench (1)

Quality Controls

Calibration (2)

Papers In This Archive Slice

Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments
Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake · May 25, 2025 · Citations: 0
LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen · May 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Do LLMs have a Gender (Entropy) Bias?
Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta · May 24, 2025 · Citations: 0

We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across…
Disentangling Knowledge Representations for Large Language Model Editing
Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren · May 24, 2025 · Citations: 0

To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge.
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song · May 24, 2025 · Citations: 0

To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities.
SEW: Self-Evolving Agentic Workflows for Automated Code Generation
Siwei Liu, Jinyuan Fang, Han Zhou, Yingxu Wang, Zaiqiao Meng · May 24, 2025 · Citations: 0
Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ShIOEnv: A Command Evaluation Environment for Grammar-Constrained Synthesis and Execution Behavior Modeling
Jarrod Ragsdale, Rajendra Boppana · May 23, 2025 · Citations: 0
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak · May 23, 2025 · Citations: 0

Long Horizon

We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base.
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li · May 23, 2025 · Citations: 0

The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks.
Training with Pseudo-Code for Instruction Following
Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor · May 23, 2025 · Citations: 0

Demonstrations

We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models.
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha · May 23, 2025 · Citations: 0

Pairwise Preference

Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods.
Two-Stage Regularization-Based Structured Pruning for LLMs
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin · May 23, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang · May 23, 2025 · Citations: 0
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025 · Citations: 0

In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning…
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025 · Citations: 0

On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 absolute percentage points over DAPO.
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025 · Citations: 0

Red Team

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang · May 22, 2025 · Citations: 0
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Hiding in Plain Sight: A Steganographic Approach to Stealthy LLM Jailbreaks
Jianing Geng, Biao Yi, Zekun Fei, Ruiqi He, Lihai Nie · May 22, 2025 · Citations: 0
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng · May 22, 2025 · Citations: 0

The rapid development and widespread adoption of Audio Large Language Models (ALLMs) demand rigorous evaluation of their trustworthiness.
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025 · Citations: 0

Pairwise Preference

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
Dynamic Token Reweighting for Robust Vision-Language Models
Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma · May 22, 2025 · Citations: 0

Red Team

Large vision-language models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails.
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy, Mostafa Elhoushi, Amr Alanwar · May 22, 2025 · Citations: 0

Red Team

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning.
Efficient PRM Training Data Synthesis via Formal Verification
Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang · May 21, 2025 · Citations: 0

However, existing approaches for constructing PRM training data remain costly and noisy, as they typically rely on human annotation or sampling-based labeling methods that require repeated LLM calls.
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025 · Citations: 0

Pairwise Preference

In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems.
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra · May 21, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Niraj Chitla, Zishen Wan · May 21, 2025 · Citations: 0
Explainable embeddings with Distance Explainer
Christiaan Meijer, E. G. Patrick Bos · May 21, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs
Artem Zabolotnyi, Roman Makarov, Mile Mitrovic, Polina Proskura, Oleg Travkin · May 21, 2025 · Citations: 0

Experiments across seven classification datasets and two NER benchmarks, evaluated on five language models (RoBERTa, ELECTRA, LLaMA-2, Qwen2.5, and Qwen3), show that ALIEN consistently outperforms strong baselines across all considered…
Guided Policy Optimization under Partial Observability
Yueheng Li, Guangming Xie, Zongqing Lu · May 21, 2025 · Citations: 0
Understanding the Anchoring Effect of LLM with Synthetic Data: Existence, Mechanism, and Potential Mitigations
Yiming Huang, Biquan Bie, Zuqiu Na, Weilin Ruan, Songxin Lei · May 21, 2025 · Citations: 0

Combining refined evaluation metrics, we benchmark current widely used LLMs.
A quantitative analysis of semantic information in deep representations of text and images
Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni · May 21, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SAKE: Structured Agentic Knowledge Extrapolation for Complex LLM Reasoning via Reinforcement Learning
Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth · May 21, 2025 · Citations: 0

Long Horizon

We propose SAKE (Structured Agentic Knowledge Extrapolation), a RL powered agentic framework that trains LLMs to autonomously retrieve and extrapolate structured knowledge through tool-augmented reinforcement learning.
MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation
Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo · May 21, 2025 · Citations: 0
Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin · May 21, 2025 · Citations: 0

Critique Edit Multi Agent

Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks.
REFLEX: Metacognitive Reasoning for Reflective Zero-Shot Robotic Planning with Large Language Models
Wenjie Lin, Jin Wei-Kocsis, Jiansong Zhang, Byung-Cheol Min, Dongming Gan · May 20, 2025 · Citations: 0

Demonstrations

Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing…
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models
Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang · May 20, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry
Oren Sultan, Eitan Stern, Dafna Shahaf · May 20, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich · May 20, 2025 · Citations: 0

To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question…
AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin
Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou · May 20, 2025 · Citations: 0
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Darpan Aswal, Siddharth D Jaiswal · May 20, 2025 · Citations: 0

Red Team

Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics.
MSDformer: Multi-scale Discrete Transformer For Time Series Generation
Shibo Feng, Zhicheng Chen, Xi Xiao, Zhong Zhang, Qing Li · May 20, 2025 · Citations: 0
Integration of TinyML and LargeML: A Survey of 6G and Beyond
Thai-Hoc Vu, Ngo Hoang Tu, Thien Huynh-The, Kyungchun Lee, Sunghwan Kim · May 20, 2025 · Citations: 0
Not Minds, but Signs: Reframing LLMs through Semiotics
Davide Picca · May 20, 2025 · Citations: 0

Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations.
Word length predicts word order: "Min-max"-ing drives language evolution
Hiram Ring · May 20, 2025 · Citations: 0

This paper proposes a general universal explanation for word order change based on a theory of communicative interaction (the Min-Max theory of language behavior) in which agents seek to minimize effort while maximizing information.
Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0

Demonstrations Long Horizon

We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
Let's Verify Math Questions Step by Step
Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang · May 20, 2025 · Citations: 0

In this work, we present ValiMath, a benchmark consisting of 2147 human-verified mathematical questions covering a wide range of domains such as arithmetic, algebra, and geometry, which are synthesized and curated from the NuminaMath…
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian · May 19, 2025 · Citations: 0

Rubric Rating

Current benchmarks usually involve simplified tasks.
Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques
Avinash Patil · May 19, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Reality Check of Language Models as Formalizers on Constraint Satisfaction Problems
Rikhil Amonkar, Ceyhun Efe Kayan, Qimei Lai, Ronan Le Bras, Li Zhang · May 19, 2025 · Citations: 0

We systematically investigate the formalization capability of LLMs on real-life constraint satisfaction problems on 4 benchmarks, 6 LLMs, and 2 types of formal languages.
What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text
Aswathy Velutharambath, Kai Sassenberg, Roman Klinger · May 19, 2025 · Citations: 0

We further benchmark against other English deception datasets following similar data collection protocols.
Iterative Formalization and Planning in Partially Observable Environments
Liancheng Gong, Wang Zhu, Jesse Thomason, Li Zhang · May 19, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao · May 19, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Complexity counts: global and local perspectives on Indo-Aryan numeral systems
Chundra Cathcart · May 19, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer · May 19, 2025 · Citations: 0

Long Horizon

To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now