EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

Sen Fang, Weiyuan Ding, Mengshi Zhang, Zihao Chen, Bowen Xu · May 18, 2025 · Citations: 0

Abstract

Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more critically, they operate solely through external perturbations, failing to capture the intrinsic stability essential for autonomous coding agents where subsequent inputs are endogenously generated by the model itself. We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering tasks (e.g., code generation and code summarization). EVALOOOP establishes a self-contained feedback loop where an LLM iteratively transforms between code and natural language until functional failure occurs, with robustness quantified by a novel Average Sustainable Loops (ASL) metric-the mean number of iterations maintaining functional correctness across benchmark tasks. This cyclical strategy intrinsically evaluates robustness without relying on external attack configurations, providing a unified metric that reveals how effectively LLMs preserve semantic integrity through sustained self-referential transformations. We evaluate 96 popular LLMs, ranging from 0.5B to 685B parameters, on EVALOOOP equipped with the MBPP Plus benchmark, and found that EVALOOOP typically induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, Qwen3-235B-A22B-Instruct-2507, despite inferior initial code generation compared to OpenAI's o-series models and DeepSeek-V3, demonstrated the superior robustness (ASL score).

HFEPX Relevance Assessment

This paper appears adjacent to HFEPX scope (human-feedback/eval), but does not show strong direct protocol evidence in metadata/abstract.

Eval-Fit Score

5/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: Coding
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.45
Flags: low_signal, possible_false_positive

Protocol And Measurement Signals

Benchmarks / Datasets

MBPP+DROP

Reported Metrics

accuracypass@1

Research Brief

Deterministic synthesis

However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more… HFEPX signals include Automatic Metrics with confidence 0.45. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 7:11 PM · Grounded in abstract + metadata only

Key Takeaways

However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack…
We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering…

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Cross-check benchmark overlap: MBPP+, DROP.
Validate metric comparability (accuracy, pass@1).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Low-signal flag detected: protocol relevance may be indirect.

Recommended Queries

human-eval protocol design pairwise preference data quality inter-rater agreement adjudication

Research Summary

Contribution Summary

However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more…
We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering tasks (e.g., code generation and code summarization).
We evaluate 96 popular LLMs, ranging from 0.5B to 685B parameters, on EVALOOOP equipped with the MBPP Plus benchmark, and found that EVALOOOP typically induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops.

Why It Matters For Eval

However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more…
We evaluate 96 popular LLMs, ranging from 0.5B to 685B parameters, on EVALOOOP equipped with the MBPP Plus benchmark, and found that EVALOOOP typically induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: MBPP+, DROP
Pass: Metric reporting is present

Detected: accuracy, pass@1

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning Category Neighbor

Citations: 0 Relevance: 4.90
- Shared arXiv category (cs.CL, cs.LG)
- Shared metric mentions
- Shared terminology (accuracy, framework)
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables Category Neighbor

Citations: 0 Relevance: 3.60
- Shared arXiv category (cs.CL)
- Shared benchmark mentions
- Shared terminology (drop, framework)
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning Category Neighbor

Citations: 0 Relevance: 3.30
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, framework)
Confusion-Aware Rubric Optimization for LLM-based Automated Grading Category Neighbor

Citations: 0 Relevance: 3.30
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, framework)
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance Category Neighbor

Citations: 0 Relevance: 3.30
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, framework)
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? Category Neighbor

Citations: 0 Relevance: 3.20
- Shared arXiv category (cs.SE, cs.CL)
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation Category Neighbor

Citations: 0 Relevance: 2.85
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote