Skip to content
← Back to explorer

Are LLMs Ready to Replace Bangla Annotators?

Md. Najib Hasan, Touseef Hasan, Souvika Sarkar · Feb 18, 2026 · Citations: 0

Abstract

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: General

Evaluation Lens

  • Evaluation modes: Human Eval
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.35
  • Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

  • Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood.
  • In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences.
  • We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework.

Why It Matters For Eval

  • In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences.
  • We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework.

Related Papers