LLM Agent Evaluation Scenario Writer
Design structured evaluation scenarios and gold-standard behavior for LLM-based agents in a remote, part-time contractor role (20+ hrs/week) paying $18–$24/hr. Requires QA/test-case experience and basic Python and JavaScript skills.
Generative AI & RLHF
$18–$24/hr
Compensation
Worldwide
Eligibility
Intermediate
Experience
Jan 13, 2026
Posted
Open worldwide
About OpenTrain
OpenTrain is the #1 platform for finding and building careers in AI training and data labeling. The platform helps people start and grow careers teaching AI — discover projects, build a profile, and apply for roles; creating an OpenTrain account is free.
We connect talented contributors with hands-on work that directly shapes how state-of-the-art AI systems behave. This role is offered as a contractor, part-time opportunity through OpenTrain's marketplace of projects.
About AI training and evaluation work
AI training (data labeling and human evaluation) is the human side of building modern AI: people create examples, define expected outputs, and judge model behavior so models learn reliably and safely.
This role focuses on evaluation design for LLM-based agents — writing reproducible scenarios, defining gold-standard responses, documenting edge cases, and applying clear scoring logic to improve agent performance.
The role
You will design realistic, reusable evaluation scenarios that simulate real-world tasks for LLM agents, define the golden path and acceptable behavior variations, and describe scoring logic and edge cases. You will also review agent outputs, iterate scenarios for coverage and clarity, and collaborate with developers and contributors to refine evaluation frameworks.
Work is remote, contractor, part-time with an expected commitment of 20+ hours per week. You will work with text-based tasks and use structured formats (JSON/YAML) to describe scenarios and scoring rules.
- Role type: Contractor, Part-time
- Time requirement: 20+ hours/week
- Data type: Text; label type: Evaluation rating
- (structured formats like JSON/YAML used)
What you'll do
Create structured evaluation scenarios that simulate practical tasks for agents and describe step-by-step expectations, acceptable variations, and edge cases.
Define gold-standard agent behavior, annotate expected outputs, and specify scoring logic so reviews are reproducible and objective.
Review agent outputs and logs, rate responses, and iterate on scenarios to improve clarity, coverage, and defensibility in collaboration with engineers and other contributors.
- Write scenarios and scoring in JSON/YAML or similar structured formats
- Document golden path, acceptable deviations, and failure cases
- Run evaluations and provide constructive feedback to refine scenarios
Requirements
Candidates must meet the educational and experience requirements and be ready to work remotely with a reliable laptop and stable internet connection.
This role demands strong written English, QA-style thinking, and the ability to follow complex guidelines while switching topics quickly.
- Bachelor’s and/or Master’s in CS, Software Engineering, Data Science/Analytics, AI/ML, Computational Linguistics/NLP, Information Systems, or related field
- Prior experience in QA, software testing, test case design, data analysis, or NLP annotation
- Demonstrated ability to design reproducible test scenarios with comprehensive coverage and edge cases
- Comfortable reading and authoring structured formats like JSON and/or YAML
- Able to define gold-standard behavior, acceptable variations, and clear scoring logic
- Basic working experience with Python and JavaScript (read/edit simple scripts)
- Strong written English for clear, unambiguous documentation
- Comfortable working with AI-generated outputs, agent logs, and prompt-based behavior
- Reliable laptop, stable internet connection, and consistent availability for remote work
Who should apply
Intermediate-level contributors with a background in software testing, QA, data analysis, or NLP annotation who enjoy structured, analytical work should consider this role.
If you like designing reproducible test cases, writing clear documentation, and iterating with engineers to improve AI behavior, this position is a strong fit.
- Ideal for QA engineers, test designers, annotation leads, or NLP-focused analysts
- Good fit for people comfortable with JSON/YAML and basic scripting in Python/JavaScript
Compensation & scheduling
Pay is hourly at USD $18–$24 per hour, paid per the project’s payroll terms. The position is part-time contractor work with a recommended minimum of 20 hours per week; scheduling is remote and flexible within the project's needs.
OpenTrain connects you to the project; exact scheduling, onboarding, and payment cadence are handled by the project owner through the platform.
- Hourly rate: $18–$24 USD/hour
- Contractor, part-time: 20+ hrs/week
How it works
Create a free OpenTrain account to apply and build your contributor profile. Profiles highlight your skills, experience, and availability so project owners can match you to relevant tasks.
If selected, you'll receive onboarding materials, scenario templates, and evaluation guidelines. You will collaborate remotely with engineers and reviewers and submit scenario definitions and evaluation ratings per project instructions.
- Apply via your OpenTrain profile (account creation is free)
- Onboarding includes templates and example scenarios; follow provided guidelines closely