Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral · Feb 27, 2026 · Citations: 0

Automatic Metrics

Open arXiv RSS feed

Abstract

Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploits ML-based performance estimates to maximize GPU efficiency. The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization. Experimental results demonstrate that the pipeline substantially improves GPU efficiency by reducing the number of GPUs required to sustain target workloads. Beyond GPU efficiency, the pipeline can be adapted to alternative objectives, such as latency minimization, highlighting its versatility for future large-scale LLM serving infrastructures.

HFEPX Relevance Assessment

This paper appears adjacent to HFEPX scope (human-feedback/eval), but does not show strong direct protocol evidence in metadata/abstract.

Eval-Fit Score

0/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: General
Extraction source: Runtime deterministic fallback

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.35
Flags: low_signal, possible_false_positive, runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

accuracylatencythroughputcost

Research Brief

Deterministic synthesis

The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. HFEPX signals include Automatic Metrics with confidence 0.35. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 8:37 PM · Grounded in abstract + metadata only

Key Takeaways

The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both…
The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization.

Researcher Actions

Treat this as method context, then pivot to protocol-specific HFEPX hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (accuracy, latency, throughput).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Low-signal flag detected: protocol relevance may be indirect.

Recommended Queries

human-eval protocol design pairwise preference data quality inter-rater agreement adjudication

Research Summary

Contribution Summary

The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads.
The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization.

Why It Matters For Eval

The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads.

Researcher Checklist

Gap: Human feedback protocol is explicit

No explicit human feedback protocol detected.
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: accuracy, latency, throughput, cost

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning Category Neighbor

Citations: 0 Relevance: 7.75
- Shared arXiv category (cs.AI, cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, cost, optimization)
Confusion-Aware Rubric Optimization for LLM-based Automated Grading Category Neighbor

Citations: 0 Relevance: 5.35
- Shared arXiv category (cs.AI, cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, optimization, efficiency)
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Category Neighbor

Citations: 0 Relevance: 4.80
- Shared arXiv category (cs.AI, cs.CL)
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Category Neighbor

Citations: 0 Relevance: 4.45
- Shared arXiv category (cs.AI, cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance Category Neighbor

Citations: 0 Relevance: 4.45
- Shared arXiv category (cs.AI, cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation Category Neighbor

Citations: 0 Relevance: 4.45
- Shared arXiv category (cs.AI, cs.CL)
- Shared metric mentions
- Shared terminology (accuracy)
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation Category Neighbor

Citations: 0 Relevance: 4.10
- Shared arXiv category (cs.CL)
- Shared metric mentions
- Shared terminology (accuracy, latency)
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering Category Neighbor

Citations: 0 Relevance: 3.20
- Shared arXiv category (cs.AI, cs.CL)

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote