HFEPX Hub

Human Eval Papers (Last 45 Days)

Updated from current HFEPX corpus (Apr 17, 2026). 38 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 38 papers are grouped in this hub page. Common evaluation modes: Human Eval, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Cpgbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 38 Last published: Mar 22, 2026 Global RSS Tag RSS

Human EvalLast 45d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (38) Replication-Ready Only (4)

High-Signal Coverage

100.0%

38 / 38 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

4 papers are replication-ready (benchmark + metric + explicit evaluation mode).
2 papers support judge-vs-human agreement analysis.
5 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

43.5% of papers report explicit human-feedback signals, led by rubric ratings.
human evaluation appears in 60.5% of papers in this hub.
Cpgbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

2 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is inter-annotator agreement reporting (7.9% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Benchmark Interpretation

Cpgbench appears in 4.3% of hub papers (1/38); use this cohort for benchmark-matched comparisons.
Frtr-Bench appears in 4.3% of hub papers (1/38); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 34.8% of hub papers (8/38); compare with a secondary metric before ranking methods.
agreement is reported in 21.7% of hub papers (5/38); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (43.5% vs 45% target).
Moderate: Papers reporting quality controls

Coverage is usable but incomplete (21.7% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (82.6% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (43.5% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (34.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 26.1% of papers.

Known Gaps

No dominant metadata gap detected in current extraction coverage.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Cpgbench vs Frtr-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Cpgbench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Rewardbench with accuracy gives a fast comparison anchor.

Strongest recent paper

Is this Idea Novel? An Automated Benchmark for Judgment of Research I…

Useful for current practice scanning; published Mar 11, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
Mar 11, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Human Eval · Benchmark: Rinobench · Metric: Not Reported
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Mar 31, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Human Eval · Benchmark: Not Reported · Metric: Kappa
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Mar 25, 2026 · Citations: 0 · Score: 7.5

HF: Not reported · Eval: Human Eval, Llm As Judge · Benchmark: Not Reported · Metric: Accuracy
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Mar 26, 2026 · Citations: 0 · Score: 6.5

HF: Expert Verification · Eval: Human Eval · Benchmark: Cpgbench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas Mar 11, 2026	Yes Rubric Rating	Human Eval	Rinobench	Not Reported	Gold Questions
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias Mar 31, 2026	Yes Rubric Rating	Human Eval	Not Reported	Kappa , Agreement	Inter Annotator Agreement Reported , Adjudication
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith Mar 25, 2026	No Not Reported	Human Eval , Llm As Judge	Not Reported	Accuracy , Kappa	Inter Annotator Agreement Reported
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations Mar 26, 2026	Yes Expert Verification	Human Eval	Cpgbench	Not Reported	Not Reported
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues Mar 31, 2026	Yes Rubric Rating , Expert Verification	Human Eval	Not Reported	Not Reported	Adjudication
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning Mar 29, 2026	Yes Expert Verification	Human Eval , Automatic Metrics	Not Reported	Accuracy	Not Reported
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling Apr 7, 2026	No Not Reported	Human Eval	Insightbench	Recall	Not Reported
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring Mar 6, 2026	Yes Rubric Rating	Human Eval	Not Reported	Agreement	Not Reported
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations Mar 6, 2026	Yes Pairwise Preference	Human Eval	Not Reported	Agreement , Faithfulness	Not Reported
VRM: Teaching Reward Models to Understand Authentic Human Preferences Mar 5, 2026	Yes Pairwise Preference	Human Eval	Not Reported	Coherence	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	Personalized RewardBench: Evaluating Reward Models…	Is this Idea Novel? An Automated Benchmark for Judg…
Human Feedback	Demonstrations	Pairwise Preference, Rubric Rating	Rubric Rating
Evaluation Modes	Human Eval, Llm As Judge	Human Eval, Automatic Metrics	Human Eval
Benchmarks	WebArena, ToolBench	Rewardbench	Rinobench
Metrics	Precision, Pass@1	Accuracy, Helpfulness	Not reported
Quality Controls	Not reported	Not reported	Gold Questions
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Trajectory	Pairwise	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Rubric Rating (5)
Expert Verification (3)
Pairwise Preference (3)
Demonstrations (1)

Evaluation Modes

Human Eval (23)
Automatic Metrics (12)
Llm As Judge (2)
Simulation Env (2)

Top Benchmarks

Cpgbench (1)
Frtr Bench (1)
Insightbench (1)
Rewardbench (1)

Top Metrics

Accuracy (8)
Agreement (5)
Bleu (2)
Cost (2)

Rater Population Mix

Domain Experts (9)
Mixed (1)

Quality Controls

Inter Annotator Agreement Reported (3)
Adjudication (2)
Gold Questions (1)

Coverage diagnostics (sample-based): human-feedback 26.3% · benchmarks 15.8% · metrics 57.9% · quality controls 13.2%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0

Rubric RatingExpert Verification Human Eval Web Browsing

The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
Tim Schopf, Michael Färber · Mar 11, 2026 · Citations: 0

Rubric Rating Human Eval

To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments.
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0

Rubric Rating Human Eval

We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0

Expert Verification Human EvalAutomatic Metrics Multi Agent

In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0

Rubric Rating Human Eval

This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu · Mar 26, 2026 · Citations: 0

Expert Verification Human Eval

To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations.
PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0

Pairwise Preference Human Eval

To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0

Pairwise Preference Human Eval

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0

Human EvalAutomatic Metrics Long Horizon

We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0

Human Eval Long Horizon

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo · Mar 9, 2026 · Citations: 0

Human Eval Multi Agent

To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0

Human EvalAutomatic Metrics

However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Somaya Eltanbouly, Samer Rashwani · Mar 25, 2026 · Citations: 0

Human EvalLlm As Judge

Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon · Mar 31, 2026 · Citations: 0

Human EvalAutomatic Metrics

Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0

Human EvalAutomatic Metrics

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
Voxtral TTS
Mistral-AI, :, Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5.
Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties
Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech
Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin · Mar 26, 2026 · Citations: 0

Human EvalAutomatic Metrics

We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data.
Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl · Mar 19, 2026 · Citations: 0

Human EvalAutomatic Metrics

Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan · Apr 8, 2026 · Citations: 0

Human EvalSimulation Env

We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
Gabriel Stefan, Adrian-Marius Dumitran · Apr 9, 2026 · Citations: 0

Human Eval

We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation.
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0

Human Eval

To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han · Apr 2, 2026 · Citations: 0

Human Eval

Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations.
Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
HyunJoon Jung, William Na · Apr 1, 2026 · Citations: 0

Human Eval

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed?
ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026 · Citations: 0

Human Eval

Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
Open Machine Translation for Esperanto
Ona de Gibert, Lluís de Gibert · Mar 31, 2026 · Citations: 0

Human Eval

In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes.
Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Cole Walsh, Rodica Ivan · Mar 26, 2026 · Citations: 0

Human Eval

These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
LLMs Do Not Grade Essays Like Humans
Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa · Mar 24, 2026 · Citations: 0

Human Eval

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear.
Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang · Mar 14, 2026 · Citations: 0

Human Eval

Across multiple benchmarks and human evaluations, CAP-TTA effectively reduces toxicity/bias score with significantly lower latency than standard optimization methods (e.g., AdamW or SGD).
Enhancing Debunking Effectiveness through LLM-based Personality Adaptation
Pietro Dell'Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro · Mar 10, 2026 · Citations: 0

Human Eval

To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels.
Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard · Mar 9, 2026 · Citations: 0

Human Eval

As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation.
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska · Mar 9, 2026 · Citations: 0

Human Eval

Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages.
Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
Thanathai Lertpetchpun, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd · Mar 8, 2026 · Citations: 0

Human Eval

Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
Anca Dinu, Andreiana Mihail, Andra-Maria Florescu, Claudiu Creanga · Mar 6, 2026 · Citations: 0

Human Eval

The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions.
TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
Christian Greisinger, Steffen Eger · Mar 3, 2026 · Citations: 0

Human Eval

Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at…

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now