Skip to content
← Back to explorer

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, Francisco Guzmán · Sep 30, 2025 · Citations: 0

Abstract

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation (https://huggingface.co/datasets/facebook/menlo).

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

75/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Pairwise Preference, Rubric Rating
  • Rater population: Unknown
  • Unit of annotation: Pairwise
  • Expertise required: Multilingual
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Inter Annotator Agreement Reported
  • Confidence: 0.80
  • Flags: runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

agreement

Research Brief

Deterministic synthesis

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. HFEPX signals include Pairwise Preference, Rubric Rating, Automatic Metrics with confidence 0.80. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 7:17 AM · Grounded in abstract + metadata only

Key Takeaways

  • To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
  • We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches.

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Identify benchmark choices from full text before operationalizing conclusions.
  • Validate metric comparability (agreement).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
  • We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches.
  • Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.

Why It Matters For Eval

  • To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
  • Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Pairwise Preference, Rubric Rating

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Pass: Quality control reporting appears

    Detected: Inter Annotator Agreement Reported

  • Gap: Benchmark or dataset anchors are present

    No benchmark/dataset anchor extracted from abstract.

  • Pass: Metric reporting is present

    Detected: agreement

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.