LLM Agent Evaluation Scenario Writer

Design structured evaluation scenarios and gold-standard behaviors for LLM-based agents in a remote, part-time contractor role (20+ hrs/week). Pay $18–$24/hr; requires QA-style thinking, basic Python/JavaScript, and strong written English.

Generative AI & RLHF

100% Remote Hourly · $18–$24/hr

$18–$24/hr

Compensation

Worldwide

Eligibility

Intermediate

Experience

Jan 13, 2026

Posted

Open worldwide

Interested in this role?

Create a free OpenTrain account and apply in minutes.

Apply now

About OpenTrain AI

OpenTrain is the #1 platform for finding and building careers in AI training and data labeling, and OpenTrain AI is the hiring and contracting organization for this role. Creating an OpenTrain account is free — this is where people start and grow careers teaching AI.

We connect contributors with hands-on work that shapes how modern AI systems behave. If you want practical, remote work that directly influences state-of-the-art agents, this role is a great place to apply your QA and testing skills.

About AI Training Work

AI training (also called data labeling or human feedback work) is the human side of building AI: people create, evaluate, and refine examples that models learn from. This role focuses on evaluation and test design for LLM-based agents, a core activity in making conversational and task-oriented systems reliable.

Work in this industry is often 100% remote, flexible, and accessible to contributors with attention to detail, domain knowledge, or testing experience. You will help define what 'correct' agent behavior looks like and how to measure it.

The Role

You will design realistic, reusable evaluation scenarios that simulate real-world tasks for LLM-based agents. Each scenario will include a golden path (expected agent behavior), acceptable variations, annotated task steps, edge cases, and clear scoring logic.

You will also review agent outputs against your scenarios, iterate on scenarios for clarity and coverage, and collaborate with developers and other contributors to refine evaluation frameworks and ensure reproducibility.

What You'll Do

Design structured evaluation scenarios and test cases that simulate real user tasks for LLM agents.
Define gold-standard outputs, acceptable variations, and explicit scoring rules for each scenario.
Document annotated task steps, edge cases, and failure modes for reproducible testing.
Represent scenarios in structured formats such as JSON or YAML and keep them maintainable.
Review agent outputs against scenarios, identify gaps, and update scenarios to improve coverage.
Collaborate with developers and other evaluators to validate scoring logic and interpretation.
Work across topics quickly while following complex guidelines and standardized templates.

Requirements

Bachelor’s and/or Master’s degree in CS, Software Engineering, Data Science/Analytics, AI/ML, Computational Linguistics/NLP, Information Systems, or a related field.
Prior experience in QA, software testing, test case design, data analysis, or NLP annotation.
Demonstrated ability to design reproducible test scenarios with strong coverage and edge cases.
Comfortable reading and authoring structured formats like JSON and/or YAML to describe scenarios.
Able to define gold-standard agent behavior, acceptable variations, and clear, testable scoring logic.
Basic working experience with Python and JavaScript (able to read and edit simple scripts).
Strong written English skills for producing clear, unambiguous documentation.
Comfortable working with AI-generated outputs, agent logs, and prompt-based behaviors.
Able to switch topics quickly and follow complex guidelines accurately.
Fully remote readiness: reliable laptop, stable internet connection, and consistent availability.

Who Should Apply

This role is a good fit for people with QA/test-case design experience, analytical attention to detail, and familiarity with structured data formats. Candidates with backgrounds in software testing, data analysis, or NLP annotation will excel.

If you enjoy turning ambiguous behaviors into explicit acceptance criteria, documenting edge cases, and iterating with technical teams, you should apply.

Intermediate experience level — prior hands-on experience in testing, annotation, or data evaluation is expected.
Comfortable working as a contractor in a part-time capacity (20+ hours/week).

Compensation & Logistics

This is a remote, part-time contractor position requiring 20+ hours per week. Pay is hourly at USD $18–$24 per hour. Employment types: Contractor, Part-time.

Work is text-based evaluation (data type: TEXT) and will involve EVALUATION_RATING-style labeling and scenario-driven review. The role is open worldwide; you must be fully remote-ready with a reliable laptop and internet.

How It Works / How to Apply

To apply, create a free OpenTrain account and submit your profile and resume. Include examples of test scenarios, QA artifacts, or documentation that demonstrate your ability to define gold standards and scoring logic when available.

If selected, you'll work with OpenTrain AI as a contractor and collaborate with other evaluators and developers to build and refine evaluation frameworks. The role emphasizes clear documentation, reproducibility, and iterative improvement.

Keep exploring

Similar Jobs

View all jobs

Insurance LLM Evaluation SME (US, Remote)

Join OpenTrain as an Insurance LLM Evaluation SME to design and score underwriting, claims, and risk-assessment evaluation tasks for LLMs. Remote (U.S. only), $60–$80/hr, 35 hours/week, contractor/part-time.

Apply now View job

Generative AI & RLHF

Text

Remote · United States

English

Part-time · Flexible

Expert level

Hourly · $60–$80/hr

Posted Jul 10, 2026

AI Evaluation Analyst — LLM Conversation & Rubric Authoring

Create multi-turn conversations, rubrics, and evaluation assets for frontier LLMs while working remotely as a contractor 20+ hours/week. Rapid onboarding and clear specs; paid on a per-task/hour basis at $20–$30/hr.

Apply now View job

Generative AI & RLHF

Text

Remote · Worldwide

English

Part-time · Flexible

Entry level

Hourly · $20–$30/hr

Posted Jul 16, 2026

English LLM Evaluation Generalist

Join OpenTrain to evaluate large language model outputs, create challenging prompts, and deliver recorded verbal feedback; remote, contract role (20+ hrs/week) paying $20–$30/hr. Entry-level friendly for strong American English speakers with LLM experience.

Apply now View job

Generative AI & RLHF

Text

Remote · Worldwide

English

Part-time · Flexible

Entry level

Hourly · $20–$30/hr

Posted Jul 15, 2026

Explore related categories

Generative AI & RLHF Coding & Software Audio & Speech Legal & Finance