DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

25/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Protocol And Measurement Signals

Benchmarks / Datasets

Dare-Bench

Reported Metrics

accuracy

Research Brief

Deterministic synthesis

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. HFEPX signals include Automatic Metrics, Long Horizon with confidence 0.55. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 7:09 AM · Grounded in abstract + metadata only

Key Takeaways

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the…

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Cross-check benchmark overlap: Dare-Bench.
Validate metric comparability (accuracy).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data.
To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following.

Why It Matters For Eval

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: Dare-Bench
Pass: Metric reporting is present

Detected: accuracy

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
GATES: Self-Distillation under Privileged Context with Consensus Gating Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Long Horizon
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions