Featured Papers
Popular high-signal papers with direct links to full protocol pages.
- LLMSurgeon: Diagnosing Data Mixture of Large Language Models
May 28, 2026 · Citations: 0
To evaluate, we introduce LLMScan, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures.
- SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
May 28, 2026 · Citations: 0
We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation.
- Unlocking the Working Memory of Large Language Models for Latent Reasoning
May 28, 2026 · Citations: 0
In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts.
- Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
May 28, 2026 · Citations: 0
Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent.
- Demystifying Data Organization for Enhanced LLM Training
May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- COMPOSE: Composing Future Theorems from Citations and Formal Structure
May 28, 2026 · Citations: 0
To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025.
- Reasoning with Sampling: Cutting at Decision Points
May 28, 2026 · Citations: 0
Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
- On Language Generation in the Limit with Bounded Memory
May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Resolution Diagnostics for Paired LLM Evaluation
May 28, 2026 · Citations: 0
Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
- MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
May 28, 2026 · Citations: 0
We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems.
- Self-Trained Verification for Training- and Test-Time Self-Improvement
May 28, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
May 28, 2026 · Citations: 0
Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data,…