Conditioning LLMs to Generate Code-Switched Text

Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa · Feb 18, 2025 · Citations: 0

Coding Pairwise Preference

Open arXiv Find Implementation RSS feed Shortlist (0)

How to use this page

Low trust

Use this as background context only. Do not make protocol decisions from this page alone.

Best use

Background context only

What to verify

Read the full paper before copying any benchmark, metric, or protocol choices.

Evidence quality

Low

Derived from extracted protocol signals and abstract evidence.

Abstract

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP), due to the limited availability of large-scale, diverse CS datasets for robust training and evaluation. Despite recent advances, the capabilities and limitations of LLMs in handling CS are still not fully understood. In this work, we investigate the extent to which LLMs can be used in a framework for CS text generation, focusing on the English-Spanish language pair. Our proposed methodology consists of back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis, an evaluation with popular reference-based metrics and LLM-based judgment. Results show that fine-tuning can be a key step to ensure that current LLMs consistently generate fluent code-switched text and that our methodology generates high-quality outputs, expanding research opportunities in CS communication. We find that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data, but LLM-based judgment aligns more closely with human preferences. We release our code and generated dataset under a CC-BY-NC-SA license.

Low-signal caution for protocol decisions

Use this page for context, then validate protocol choices against stronger HFEPX references before implementation decisions.

The available metadata is too thin to trust this as a primary source.
The abstract does not clearly describe the evaluation setup.
The abstract does not clearly name benchmarks or metrics.

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub

Should You Rely On This Paper?

This paper is adjacent to HFEPX scope and is best used for background context, not as a primary protocol reference.

Best use

Background context only

Use if you need

Background context only.

Main weakness

The available metadata is too thin to trust this as a primary source.

Trust level

Low

Usefulness score

40/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Detected

Evaluation Signal

Weak / implicit signal

Usefulness for eval research

Adjacent candidate

Extraction confidence 45%

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

What We Could Verify

These are the protocol signals we could actually recover from the available paper metadata. Use them to decide whether this paper is worth deeper reading.