Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

Q: How reproducible is "Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions"?

Estimated time to first reproduction: a few days. Risk flags: Only historical official implementation is available. Only historical official repository was found (nightingal3/llm-pretraining-behaviours).

Q: What framework is used to implement "Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions"?

The primary implementation uses none.

Emmy Liu, Amanda Bertsch, Lintang Sutawika, Lindia Tjuatja, Patrick Fernandes, Lara Marinov, Michael Chen, Shreya Singhal, Carolin Lawrence, Aditi Raghunathan, Kiril Gashteovski, Graham Neubig

Published: Mar 5, 2025

Historical official implementation (not recommended for new builds)

Evidence: Historical

Domain fit: AI-adjacent

Verified repos: 1

Top repo stars: 7

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Framework: none

Time to first repro: a few days

1 risk flag

arXiv PDF

Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including ...

Read full abstract

state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.

Technical details

Canonical key: arxiv-2503.03862

Cache status: Stale (SWR served)

Generated at: Apr 7, 2026, 11:07 PM

Artifact coverage: sparse

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

1 risk flag

none

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

Natural language processing

ARC

Baseline MAE

13.23

Source: paper fulltext

Natural language processing

GSM8K

Baseline MAE

15.65

Source: paper fulltext

Benchmark evidence drill-down

2 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Natural language processing	ARC	Baseline MAE	13.23	paper-derived	No explicit refs
Natural language processing	GSM8K	Baseline MAE	15.65	paper-derived	No explicit refs

Use This Implementation Because…

Confidence: low

Only historical official repository was found (nightingal3/llm-pretraining-behaviours).

Open nightingal3/llm-pretraining-behaviours

Reproduction Risks

Only historical official implementation is available
No direct maintained implementation is currently verified.

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 2 refs, 1 links.

Utility signals: depth 95/100, grounding 68/100, status medium.

Implementation Comparison

Top 1 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

nightingal3/llm-pretraining-behaviours

historical official

Maintenance: Recently updated

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 7
Last push: Feb 18, 2026 (57d ago)

CIDependencies

Risk flags

No tagged releases
No Docker setup

Best implementation now

Only a historical official implementation is available.

Use with caution for new projects; verify against current tooling and maintained community alternatives.

nightingal3/llm-pretraining-behaviours

Historical official

Stars: 7

Last push: Feb 18, 2026

Only historical official repository was found: nightingal3/llm-pretraining-behaviours.
No maintained paper-verified implementation met reliability thresholds.

Reproduction readiness

Ready to Run

Time to first repro: days

Last checked: Apr 7, 2026

Hardware requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Ready to reproduce

· Clone nightingal3/llm-pretraining-behaviours and install dependencies from environment.yml.
· CI pipeline detected — automated tests are in place.
· Last updated 57 days ago.

Open nightingal3/llm-pretraining-behaviours

Quick start

git clone https://github.com/nightingal3/llm-pretraining-behaviours.git
conda env create -f environment.yml && conda activate <env-name>

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

arxiv:2503.03862 Not-Just-Scaling Not-Just-Scaling Laws

Datasets

arxiv:2503.03862 Not-Just-Scaling dataset

Spaces

arxiv:2503.03862 Not-Just-Scaling demo

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

Tasks

Natural language processing

Methods

Transformer

Domains

Natural Language Processing

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Natural language processing Transformer Natural Language Processing

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote