BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Ishmam Tashdeed, Md Taukir Azam Chowdhury · May 29, 2026 · Citations: 0

Open arXiv Find Implementation RSS feed Shortlist (0)

How to use this page

Low trust

Use this as background context only. Do not make protocol decisions from this page alone.

Best use

Background context only

What to verify

Validate the exact study setup in the full paper before operational use.

Evidence quality

Low

Derived from extracted protocol signals and abstract evidence.

Abstract

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.

Abstract-only analysis — low confidence

All signals on this page are inferred from the abstract only and may be inaccurate. Do not use this page as a primary protocol reference.

This paper looks adjacent to evaluation work, but not like a strong protocol reference.
The available metadata is too thin to trust this as a primary source.
The abstract does not clearly describe the evaluation setup.

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub

Should You Rely On This Paper?

This paper is adjacent to HFEPX scope and is best used for background context, not as a primary protocol reference.

Best use

Background context only

Use if you need

Background context only.

Main weakness

This paper looks adjacent to evaluation work, but not like a strong protocol reference.

Trust level

Low

Usefulness score

0/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Weak / implicit signal

Usefulness for eval research

Adjacent candidate

Extraction confidence 35%

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

What We Could Verify

These are the protocol signals we could actually recover from the available paper metadata. Use them to decide whether this paper is worth deeper reading.

Human Feedback Types

missing

None explicit

No explicit feedback protocol extracted.

"Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali."

Evaluation Modes

missing

None explicit

Validate eval design from full paper text.

"Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali."

Quality Controls

partial

Calibration

Calibration/adjudication style controls detected.

"To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration."

Benchmarks / Datasets

partial

Benhallueval, Banglahallueval

Useful for quick benchmark comparison.

"We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning."

Reported Metrics

missing

Not extracted

No metric anchors detected.

"Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali."

Human Feedback Details

Uses human feedback: No
Feedback types: None
Rater population: Not reported
Expertise required: Coding, Multilingual

Evaluation Details

Evaluation modes:
Agentic eval: None
Quality controls: Calibration
Evidence quality: Low
Use this page as: Background context only

Protocol And Measurement Signals

Benchmarks / Datasets

BenhalluevalBanglahallueval

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Metadata summary

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali.

Based on abstract + metadata only. Check the source paper before making high-confidence protocol decisions.

Key Takeaways

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali.
We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning.
We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B).

Researcher Actions

Compare this paper against nearby papers in the same arXiv category before using it for protocol decisions.
Validate inferred eval signals (Automatic metrics) against the full paper.
Use related-paper links to find stronger protocol-specific references.

Caveats

Generated from abstract + metadata only; no PDF parsing.
Signals below are heuristic and may miss details reported outside the abstract.

Recommended Queries

Calibration

Research Summary

Contribution Summary

We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning.
To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial…
BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings.

Why It Matters For Eval

We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning.
BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Gap: Evaluation mode is explicit

No clear evaluation mode extracted.
Pass: Quality control reporting appears

Detected: Calibration
Pass: Benchmark or dataset anchors are present

Detected: Benhallueval, Banglahallueval
Gap: Metric reporting is present

No metric terms extracted.

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

No related papers found for this item yet.