Skip to content
OpenTrain AI

Labelling Video Highlights for Food Videos

OpenTrain AI · Remote · Worldwide · Posted Jun 10, 2026

Apply for this job Per task · $0.05/label

# Video Moment Retrieval Annotation Guide - FOOD Domain

## Goal
Identify and mark temporal segments in videos that match natural language queries.
Provide accurate boundaries and domain-specific metadata for training.

## Instructions
1. **Read the Query**: Understand what action/event you need to find
2. **Watch the Video**: Identify all segments that match the query
3. **Mark Boundaries**: Use the timeline to mark precise start/end times
4. **Classify**: Select the action type and objects present
5. **Describe**: Write a visual proxy description for CLIP training

## Domain-Specific Actions
- chopping ingredients
- mise en place
- mixing ingredients
- kneading dough
- sautéing
- stirring the pot
- deglazing
- tasting food
- adding seasoning
- boiling
- grilling
- baking
- plating the dish
- garnishing
- sauce drizzle
- recipe introduction
- finished dish reveal
- eating reaction

## Quality Guidelines
- **Boundary Accuracy**: Mark within 0.5 seconds of actual moment
- **Query Specificity**: Segment should clearly match the query
- **Visual Description**: Write what you SEE, not what you know
- **Complete Coverage**: Mark ALL segments that match, not just the first one

## Visual Proxy Guidelines

Write descriptions that are:
- **Visually grounded**: "person in white chef coat slicing red tomatoes"
- **Specific**: Not "cooking" but "stirring wooden spoon in stainless steel pot"
- **Action-focused**: Include the action and key objects
- **CLIP-friendly**: Use clear, descriptive language

## Common Mistakes to Avoid
- Marking too short (missing context)
- Marking too long (including unrelated content)
- Vague queries ("something interesting")
- Abstract descriptions ("delicious food" vs "golden brown crust on pie")

## Confidence Levels
- **High**: Certain about boundaries (within 0.5 seconds)
- **Medium**: Boundaries approximate (within 1-2 seconds)
- **Low**: Action is ambiguous or boundaries are unclear