MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Q: What is the best open-source implementation of "MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research"?

The best maintained implementation is Purewhiter/mobilegym with 632 stars on GitHub. Confidence: medium. Reproducibility: Moderate.

Q: How reproducible is "MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research"?

Estimated time to first reproduction: a few days. Risk flags: Dependency manifest is missing. Start with Purewhiter/mobilegym and validate setup instructions in README.

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, Zhaoxiang Zhang

Published: May 25, 2026

Best maintained implementation now

Evidence: Direct

Domain fit: AI-adjacent

Verified repos: 1

Top repo stars: 632

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Time to first repro: a few days

1 risk flag

arXiv PDF

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full en ...

Read full abstract

vironment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

Technical details

Canonical key: arxiv-2605.26114

Cache status: Stale (SWR served)

Generated at: Jun 19, 2026, 3:29 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

1 risk flag

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Agentic tool use

Gemini 3.1 Pro

T256-Risk

71.4

Source: paper fulltext

Agentic tool use

Doubao-Seed-2.0-Pro

T256-Risk

71.4

Source: paper fulltext

Benchmark evidence drill-down

2 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Agentic tool use	Gemini 3.1 Pro	T256-Risk	71.4	paper-derived	No explicit refs
Agentic tool use	Doubao-Seed-2.0-Pro	T256-Risk	71.4	paper-derived	No explicit refs

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends.

Use This Implementation Because…

Confidence: medium

Purewhiter/mobilegym is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. CI workflows are present. License is declared (Apache-2.0).

Open Purewhiter/mobilegym

Reproduction Risks

Dependency manifest is missing

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 95/100, grounding 85/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

Purewhiter/mobilegym

best maintained

Maintenance: Active

Confidence: Medium

Reproducibility: Moderate

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars: 632
Last push: Jun 15, 2026 (5d ago)

CIReleases

Risk flags

No Docker setup
Dependency manifest missing

xcho7i/mobilegym

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Moderate

Matched via arXiv identifier search

Stars: 1
Last push: May 28, 2026 (23d ago)

Risk flags

No tagged releases
No Docker setup
Dependency manifest missing

Carbon-Glitch/pixel

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Limited

Matched via arXiv identifier search

Stars: 1
Last push: Jun 18, 2026 (2d ago)

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

Best implementation now

Purewhiter/mobilegym

Confidence: Medium

Reproducibility: Moderate

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research · 浏览器里运行的安卓模拟器 · Browser-hosted Android Simulator · Verifiable Evaluation · Scalable Online RL Training

Stars: 632

Forks: 102

Last push: Jun 15, 2026

License: Apache-2.0

Matched via arXiv identifier search

Strong overlap with paper title keywords

Community adoption signal (632 stars)

License ✓

CI ✓

Deps –

Docker –

Selected Purewhiter/mobilegym as the strongest maintained implementation for new work.
Includes CI workflow signals.
Repository activity is within the last 24 months.

Reproduction readiness

Major Work

Time to first repro: days

Last checked: Jun 19, 2026

Hardware requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

No dependency manifest — manual reconstruction required

· Purewhiter/mobilegym has no requirements.txt, environment.yml, pyproject.toml, or Dockerfile.
· You will need to reverse-engineer dependencies from import statements in the source code.

Open Purewhiter/mobilegym

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (2)

These repositories had low-confidence matching signals and are hidden by default.

xcho7i/mobilegym

Confidence: Low

Stars: 1
Carbon-Glitch/pixel

Confidence: Low

Stars: 1

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

arxiv:2605.26114 MobileGym AI Agents

Datasets

arxiv:2605.26114 MobileGym dataset

Spaces

arxiv:2605.26114 MobileGym demo

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

Tasks

Agentic tool use, Scientific computing

Methods

Agentic systems

Domains

AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Agentic tool use Scientific computing Agentic systems AI Agents

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote