FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn · Feb 26, 2025 · Citations: 0

Open arXiv Find Implementation RSS feed Shortlist (0)

How to use this page

Provisional trust

This page is a lightweight research summary built from the abstract and metadata while deeper extraction catches up.

Best use

Background context only

What to verify

Read the full paper before copying any benchmark, metric, or protocol choices.

Evidence quality

Provisional

Derived from abstract and metadata only.

Abstract

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context capabilities of LLMs, we propose few-shot preference optimization (FSPO), an algorithm for LLM personalization that reframes reward modeling as a meta-learning problem. Under FSPO, an LLM learns to quickly infer a personalized reward function for a user via a few labeled preferences. FSPO also utilizes user description rationalization (RAT) to encourage better reward modeling and instruction following, recovering performance with the oracle user description. Since real-world preference data is challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. To successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, education, and open-ended question answering. We also run a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate in generating responses that are personalized to synthetic users and a 70% winrate with real human users in open-ended question answering.

Abstract-only analysis — low confidence

All signals on this page are inferred from the abstract only and may be inaccurate. Do not use this page as a primary protocol reference.

This page is still relying on abstract and metadata signals, not a fuller protocol read.

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub

Should You Rely On This Paper?

Signal extraction is still processing. This page currently shows metadata-first guidance until structured protocol fields are ready.

Best use

Background context only

Use if you need

A provisional background reference while structured extraction finishes.

Main weakness

This page is still relying on abstract and metadata signals, not a fuller protocol read.

Trust level

Provisional

Usefulness score

Unavailable

Eval-fit score is unavailable until extraction completes.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Weak / implicit signal

Usefulness for eval research

Provisional (processing)

Extraction confidence 0%

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

What We Could Verify

These are the protocol signals we could actually recover from the available paper metadata. Use them to decide whether this paper is worth deeper reading.