LLM Agent Evaluation Scenario Writer

Design structured evaluation scenarios and gold-standard behavior for LLM-based agents in a remote, part-time contractor role (20+ hrs/week) paying $18–$24/hr. Requires QA/test-case experience and basic Python and JavaScript skills.

Generative AI & RLHF

100% Remote Hourly · $18–$24/hr

$18–$24/hr

Compensation

Worldwide

Eligibility

Intermediate

Experience

Jan 13, 2026

Posted

Open worldwide

Interested in this role?

Create a free OpenTrain account and apply in minutes.

Apply now

About OpenTrain

OpenTrain is the #1 platform for finding and building careers in AI training and data labeling. The platform helps people start and grow careers teaching AI — discover projects, build a profile, and apply for roles; creating an OpenTrain account is free.

We connect talented contributors with hands-on work that directly shapes how state-of-the-art AI systems behave. This role is offered as a contractor, part-time opportunity through OpenTrain's marketplace of projects.

About AI training and evaluation work

AI training (data labeling and human evaluation) is the human side of building modern AI: people create examples, define expected outputs, and judge model behavior so models learn reliably and safely.

This role focuses on evaluation design for LLM-based agents — writing reproducible scenarios, defining gold-standard responses, documenting edge cases, and applying clear scoring logic to improve agent performance.

The role

You will design realistic, reusable evaluation scenarios that simulate real-world tasks for LLM agents, define the golden path and acceptable behavior variations, and describe scoring logic and edge cases. You will also review agent outputs, iterate scenarios for coverage and clarity, and collaborate with developers and contributors to refine evaluation frameworks.

Work is remote, contractor, part-time with an expected commitment of 20+ hours per week. You will work with text-based tasks and use structured formats (JSON/YAML) to describe scenarios and scoring rules.

Role type: Contractor, Part-time
Time requirement: 20+ hours/week
Data type: Text; label type: Evaluation rating
(structured formats like JSON/YAML used)

What you'll do

Create structured evaluation scenarios that simulate practical tasks for agents and describe step-by-step expectations, acceptable variations, and edge cases.

Define gold-standard agent behavior, annotate expected outputs, and specify scoring logic so reviews are reproducible and objective.

Review agent outputs and logs, rate responses, and iterate on scenarios to improve clarity, coverage, and defensibility in collaboration with engineers and other contributors.

Write scenarios and scoring in JSON/YAML or similar structured formats
Document golden path, acceptable deviations, and failure cases
Run evaluations and provide constructive feedback to refine scenarios

Requirements

Candidates must meet the educational and experience requirements and be ready to work remotely with a reliable laptop and stable internet connection.

This role demands strong written English, QA-style thinking, and the ability to follow complex guidelines while switching topics quickly.

Bachelor’s and/or Master’s in CS, Software Engineering, Data Science/Analytics, AI/ML, Computational Linguistics/NLP, Information Systems, or related field
Prior experience in QA, software testing, test case design, data analysis, or NLP annotation
Demonstrated ability to design reproducible test scenarios with comprehensive coverage and edge cases
Comfortable reading and authoring structured formats like JSON and/or YAML
Able to define gold-standard behavior, acceptable variations, and clear scoring logic
Basic working experience with Python and JavaScript (read/edit simple scripts)
Strong written English for clear, unambiguous documentation
Comfortable working with AI-generated outputs, agent logs, and prompt-based behavior
Reliable laptop, stable internet connection, and consistent availability for remote work

Who should apply

Intermediate-level contributors with a background in software testing, QA, data analysis, or NLP annotation who enjoy structured, analytical work should consider this role.

If you like designing reproducible test cases, writing clear documentation, and iterating with engineers to improve AI behavior, this position is a strong fit.

Ideal for QA engineers, test designers, annotation leads, or NLP-focused analysts
Good fit for people comfortable with JSON/YAML and basic scripting in Python/JavaScript

Compensation & scheduling

Pay is hourly at USD $18–$24 per hour, paid per the project’s payroll terms. The position is part-time contractor work with a recommended minimum of 20 hours per week; scheduling is remote and flexible within the project's needs.

OpenTrain connects you to the project; exact scheduling, onboarding, and payment cadence are handled by the project owner through the platform.

Hourly rate: $18–$24 USD/hour
Contractor, part-time: 20+ hrs/week

How it works

Create a free OpenTrain account to apply and build your contributor profile. Profiles highlight your skills, experience, and availability so project owners can match you to relevant tasks.

If selected, you'll receive onboarding materials, scenario templates, and evaluation guidelines. You will collaborate remotely with engineers and reviewers and submit scenario definitions and evaluation ratings per project instructions.

Apply via your OpenTrain profile (account creation is free)
Onboarding includes templates and example scenarios; follow provided guidelines closely