Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin
Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.
The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization prob ...
lem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.
Results & Benchmarks
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection.
Implementation Evidence Summary
This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.
Reproduction Risks
- No maintained paper-verified implementation is currently available
Evidence disclosure
Evidence graph: 2 refs, 1 links.
Utility signals: depth 55/100, grounding 58/100, status medium.
Implementation Status
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
- This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.
- Start with framework-native implementations (e.g. PyTorch optimizer module, Optax, or Transformers training loops).
- Replicate the paper ablation settings first, then compare against modern baselines.
Reproduction readiness
No verified implementation available
- · No maintained repository has been identified for this paper. Check adjacent implementations or HF artifacts below.
No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
Hugging Face artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Research context
Tasks
Agentic tool use
Methods
Agentic systems
Domains
AI Agents
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.