OpenTrain AI
No verified implementation yet

HUKUKBERT: Domain-Specific Language Model for Turkish Law

Mehmet Utku Öztürk, Tansu Türkoğlu, Buse Buz-Yalug

April 6, 2026arXiv: 2604.04790
0 repos~a few days to reproduce
arXiv PDF

Abstract

Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce Huk...

Summary

HUKUKBERT is a domain-specific Turkish legal language model trained on an 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training pipeline with multiple masking strategies. The model introduces a novel Legal Cloze Test for masked legal term prediction and is evaluated on both this benchmark and the v12 Court Decision Segmentation dataset, where it outperforms prior Turkish baselines on accuracy, boundary metrics, and per-segment F1.

Key Contributions

  • Introduce HUKUKBERT, a high-volume Turkish legal language model trained on an 18 GB cleaned legal corpus with hybrid Domain-Adaptive Pre-Training.
  • Combine Whole-Word, Token Span, Word Span, and targeted Keyword Masking in the pre-training pipeline.
  • Design a 48K WordPiece tokenizer and compare it against general-purpose and existing Turkish domain-specific models.
  • Propose a novel Legal Cloze Test benchmark for masked legal term prediction in Turkish court decisions.
  • Demonstrate strong gains on Legal Cloze Test accuracy and v12 Court Decision Segmentation metrics over prior Turkish models.

Reproducibility Notes

  • No official or verified code repository is available; reproduction is paper-only.
  • Model training depends on an 18 GB cleaned Turkish legal corpus whose exact construction must be inferred from the paper.
  • Evaluation for Legal Cloze Test and v12 segmentation must be reimplemented from the described protocols.
  • Expect multi-day effort for corpus preparation, pre-training, and segmentation experiments; log all assumptions.

Results & Benchmarks

TaskDatasetMetricValue
On the Legal Cloze Test benchmark for masked legal term prediction in Turkish conovel Legal Cloze(test)Top-184.40
On the v12 Court Decision Segmentation dataset, HukukBERT achieves 99.0% boundarnovel Legal Cloze(test)Accuracy99.0
Computer visionnovel Legal Cloze(test)Top-1 Accuracy84.40

Hardware Requirements

  • Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Maintained implementation evidence is not confirmed for this paper yet.

Use the Implementation Status and Reproduction Path sections below for the current action plan.

Reproduction Path

Follow this baseline workflow to decide if this paper is worth immediate prototyping.

  1. 1

    Use the paper and benchmark evidence to scope a baseline reproduction plan.

  2. 2

    Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few daysEstimate is based on paper-only reproduction flow

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.