Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, Mehwish Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael McKenna, Rachel Bawden, Thomas J. Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf
Core AI workload signals detected from paper context and implementation/artifact evidence.
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system ...
for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.
Results & Benchmarks
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020).
Implementation Evidence Summary
Hannibal046/Awesome-LLM is the closest maintained adjacent implementation (Matches contextual method/domain keyword: language model). It is not paper-verified; validate algorithm and evaluation setup against the paper before trusting reported metrics. Community adoption signal: 26923 GitHub stars.
Reproduction Risks
- Adjacent implementations are not paper-verified
- Recommended repository is adjacent and not paper-verified.
Hardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence disclosure
Evidence graph: 3 refs, 3 links.
Utility signals: depth 70/100, grounding 75/100, status medium.
Implementation Status
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
- No maintained paper-verified implementation was found; start with the closest related repositories below.
- Compare repo methods against the paper equations/algorithm before trusting metrics.
- Create a minimal baseline implementation from the paper and use adjacent repos as references.
Reproduction readiness
Hardware requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
No verified implementation available
- · No maintained repository has been identified for this paper. Check adjacent implementations or HF artifacts below.
No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
Closest related implementations
These are not paper-verified. Use them as reference points when no direct implementation is available.
- Hannibal046/Awesome-LLMAdjacentConfidence: MediumStars: 26,923
Matches contextual method/domain keyword: language model
- RUCAIBox/LLMSurveyAdjacentConfidence: MediumStars: 12,170
Matches contextual method/domain keyword: language model
Hugging Face artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Research context
561
Citations
132
References
Tasks
Computer science, Generalization, Benchmark (surveying), Task (project management), Set (abstract data type), Encoder, Benchmarking, Multi-task learning
Methods
Language model
Domains
Artificial intelligence, Machine learning, Natural language processing
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Related papers
-
Search on Paper2Code
COMPARISON OF BEST PRACTICE BENCHMARKING MODELS (2011) Semantic similarity
-
Search on Paper2Code
Strategic Benchmarking: How to Rate Your Company's Performance against the World's Best (1993) Semantic similarity
-
Search on Paper2Code
Theoretical Aspects of Benchmarking Theory (2004) Semantic similarity
-
Search on Paper2Code
A guide for mental health clinicians to develop and undertake benchmarking activities (2010) Semantic similarity
-
Search on Paper2Code
Comparing ourselves: using benchmarking techniques to measure performance between academic libraries (2009) Semantic similarity
-
Search on Paper2Code
Benchmarking: Roadmap to best practices (2006) Semantic similarity
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.