StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Protocol And Measurement Signals

Benchmarks / Datasets

Kernelbench

Reported Metrics

success rate

Research Brief

Deterministic synthesis

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it… HFEPX signals include Rubric Rating, Automatic Metrics, Multi Agent with confidence 0.80. Updated from current HFEPX corpus.

Generated Mar 4, 2026, 4:38 AM · Grounded in abstract + metadata only

Key Takeaways

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to…
To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Cross-check benchmark overlap: Kernelbench.
Validate metric comparability (success rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with…
Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Why It Matters For Eval

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Rubric Rating
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: Kernelbench
Pass: Metric reporting is present

Detected: success rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: Rubric RatingShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
CoAct-1: Computer-using Multi-Agent System with Coding Actions Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Multi Agent
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Multi Agent
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Multi Agent
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Protocol Overlap

Citations: 0 Relevance: 4.60 Shared tag: Multi Agent
- Shared HFEPX protocol tags
- Aligned agent-evaluation setup
- Shared metric mentions
A Scalable Framework for Evaluating Health Language Models Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol
APEX-Agents Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol
ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Confusion-Aware Rubric Optimization for LLM-based Automated Grading Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Rubric Rating
- Shared HFEPX protocol tags
- Aligned human feedback protocol