Literature review
This review surveys prior work in four areas relevant to TerminalWorld (Marin issue #5866): (A) neural world models for code/web/interactive environments, (B) terminal agents and Terminal-Bench, (C) training agents inside learned simulators, and (D) co-evolution of policy and simulator. Each section concludes by flagging what is missing and the risks the proposal will need to confront.
A. Neural world models for code, web, and interactive environments
- CWM: An Open-Weights LLM for Research on Code Generation with World Models — Meta FAIR CodeGen team (Cummins, Leather, Markosyan, Pagliardini, Remez et al.), 2025. A 32B model “mid-trained on observation-action trajectories from Python interpreter and agentic Docker environments.” Demonstrates that supervising on execution traces gives 65.8% on SWE-bench Verified and enables step-by-step Python execution simulation inside the model itself. arxiv:2510.02387
- WebDreamer / “Is Your LLM Secretly a World Model of the Internet?” — Gu, Zheng et al., OSU NLP / Orby AI, TMLR 2025. Uses an LLM as a natural-language world model to simulate “what happens if I click X” for web planning; their distilled Dreamer-7B matches GPT-4o while being 4–5× more efficient than tree search in sandboxes. arxiv:2411.06559
- WebWorld: A Large-Scale World Model for Web Agent Training — Xiao, Tu, Zou et al., Qwen team / Zhejiang Univ., 2026. Cited by the issue. Treats large-scale web simulation as the training substrate for agents; releases code at QwenLM/WebWorld. arxiv:2602.14721v1
- R-WoM: Retrieval-augmented World Model for Computer-use Agents — Mei, Guo, Chang et al., AWS AI, 2025. Explicitly identifies that pure-LLM world models hallucinate and degrade in long-horizon planning; grounds simulations with retrieved tutorials. +23.4% on OSWorld, +16.3% on WebArena. arxiv:2510.11892
- WorldCoder — Tang, Key, Ellis, NeurIPS 2024. A model-based LLM agent that writes Python as its world model rather than simulating in latent space — a useful contrast to fully neural simulators. arxiv:2402.12275
- UI-Simulator: LLMs as Scalable, General-Purpose Simulators for Evolving Digital Agent Training — Wang, Yin, Cui et al., UCLA, 2025. Generates synthetic UI states/transitions for WebArena and AndroidWorld training. arxiv:2510.14969
- Genie 2 / Genie 3 — DeepMind, 2024/2025. Foundation world models that generate playable 3D environments from a single image; relevant only as analogy — they show that learned simulators can be coherent for minutes of interactive use. DeepMind blog
- UniSim / Sora-as-world-simulator — Yang et al., ICLR 2024; Brooks et al., 2024. Generative video models repurposed as physics/world simulators — again, only an analogy for what learned simulators of symbolic environments might look like. openreview
- ToolEmu: Identifying the Risks of LM Agents with an LM-Emulated Sandbox — Ruan, Dong et al., ICLR 2024 Spotlight. Prompts GPT-4 to emulate tool execution given only a tool spec and inputs, so LM agents can be exercised against arbitrary (un-implemented) tools. The closest direct precursor to TerminalWorld’s central idea (LM-as-environment) — but ToolEmu is prompted, not trained, and is aimed at safety red-teaming, not generating training rollouts. arxiv:2309.15817 · GitHub
Gap. No prior work trains a model end-to-end to emit the bytes a real shell would emit given a command and prior session state. ToolEmu demonstrates that a prompted frontier LM can plausibly fake tool execution for evaluation, but stops short of training a dedicated simulator or using one as a training substrate. CWM trains on execution traces but does not position the model as the executor for downstream RL/SFT; WebDreamer/WebWorld/DynaWeb cover the web; WorldCoder writes code instead of simulating it. The terminal — discrete, partially-observable, with strict syntactic structure — is an untouched substrate.
B. Terminal agents and Terminal-Bench
- Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces — Merrill et al., Stanford / Laude Institute, Jan 2026. 89 hand-authored tasks in real sandboxed terminals; frontier agents are still under ~65%. arxiv:2601.11868 · tbench.ai
- Terminus — Laude Institute, 2025/2026. The reference test-bed agent that drives Terminal-Bench via tmux; Terminus 2 + Claude Opus 4.5 currently sits near the top of TB 2.0 (~58%). tbench.ai/news/terminus
- ECHO: Terminal Agents Learn World Models for Free — Vaishnavh Shrivastava & Dimitris Papailiopoulos, Microsoft Research AI Frontiers, May 2026. In RL, ECHO is a hybrid objective
L_ECHO = L_GRPO(actions) + λ · L_env(observations)— a length-normalized CE loss on terminal-output tokens added alongside the standard GRPO loss on action tokens; “same rollout and forward pass, just a different mask over the logits.” In SFT, the same idea collapses to simply dropping the assistant-only loss mask so the existing CE runs over environment tokens too. Headline numbers: TerminalBench-2.0 pass@1 nearly doubles at both 8B (2.7 → 5.2) and 14B (5.2 → 10.8); 2.3× speed-up to match plain GRPO on TB-Lite; recovers 50–104% of the gap that expert-SFT-then-GRPO opens over GRPO-from-base; and a “sparks of self-improvement” result where training with only the env loss (no GRPO, no verifier) still improves OOD task performance by 3.8–10.0 pp. ECHO explicitly positions itself as “the online, in-the-RL-loop, CLI-flavored version” of auxiliary world-modeling work including Agent Learning via Early Experience, VAGEN, RWML, and CWM. This is the direct empirical motivation TerminalWorld is extending. Digg writeup · original tweet - Endless Terminals: Scaling RL Environments for Terminal Agents — Gandhi, Garg, Goodman, Papailiopoulos, Stanford / MSR, Jan 2026. Procedural pipeline generating 3,255 containerized terminal tasks with tests; Qwen2.5-7B jumps 10.7→53.3% on their held-out set and gets meaningful TB 2.0 lift. Concrete demonstration that environment quantity is the bottleneck. arxiv:2601.16443
- Nemotron-Terminal / “On Data Engineering for Scaling LLM Terminal Capabilities” — Pi, Lam, Shoeybi, Jannaty, Catanzaro, Ping, NVIDIA, 2026. Releases the Nemotron-Terminal-Corpus (~366k trajectories: ~226k dataset-adapter, ~140k skill-synthetic) and Nemotron-Terminal-8B/14B/32B; the 32B beats Qwen3-Coder-480B on TB 2.0 (27.4% vs 23.9%). This is the exact training corpus TerminalWorld will both consume and try to simulate. arxiv:2602.21193 · HF Corpus
- OpenHands (CodeAct 2.1) and SWE-agent — All-Hands AI / Princeton-Stanford, 2025. Open-source frameworks for terminal/shell-using software-engineering agents; OpenHands SDK hits 72% SWE-bench Verified with Sonnet 4.5. Together with Warp and Claude Code, they define the consumer of any TerminalWorld-trained policy. arxiv:2511.03690
Gap. Existing terminal-agent training assumes a real Docker container behind every rollout. Endless Terminals, NemotronTerminal, R2E-Gym, and SWE-rebench all reduce the cost of building tasks, but every rollout still pays full container-startup + execution cost. Nobody has tried to replace the container with a neural net during training.
C. Training agents in learned simulators (model-based RL for LLM agents)
- DreamerV3 — Hafner et al., Nature 2025. Canonical model-based RL: learn a latent world model, train the policy via imagined rollouts. Mastered Minecraft from scratch. Provides the theoretical template TerminalWorld implicitly follows. arxiv:2301.04104
- DynaWeb: Model-Based Reinforcement Learning of Web Agents — Ding, Liu, Wang et al., Jan 2026. A “web world model” trained on millions of page-transition tuples that generates “dream” rollouts for on-policy RL of web agents; interleaves real-expert trajectories to combat compounding error. Closest prior work to TerminalWorld but for the browser. arxiv:2601.22149
- Agent Learning via Early Experience — Oct 2025. Pre-RL stage in which the agent generates its own action-consequence data and uses (a) implicit world modeling and (b) self-reflection to ground the policy in environment dynamics without reward. Same intellectual lineage as ECHO — train on consequences before / instead of training on rewards — but staged rather than in-loop, and cited explicitly by ECHO as inspiration. arxiv:2510.08558
- VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents — Oct 2025 (NeurIPS ‘25). Adds a World Modeling Reward (dense, turn-level reward for state prediction) plus Bi-Level GAE to RL of VLM agents; a 3B VLM hits 0.82 across five agent benchmarks. Direct precedent for adding a world-modeling auxiliary signal during RL, in the visual domain. Cited by ECHO. arxiv:2510.16907 · GitHub
- RWML: Reinforcement World Model Learning for LLM-based Agents — Feb 2026. Self-supervised next-state-embedding alignment (not next-state-token CE) as a reward signal during RL; +6.9 / +5.7 over plain task-success RL on ALFWorld / τ²-Bench. Important caution for TerminalWorld: the authors argue that token-level next-state CE is prone to model collapse and reward hacking compared with embedding-space alignment — a counterpoint to the simple “predict the bytes” approach we plan to start with. Cited by ECHO. arxiv:2602.05842
- Kimi K2: Open Agentic Intelligence — Kimi Team / Moonshot AI, July 2025. A 1.04T-parameter MoE (32B active) whose post-training pipeline synthesizes large-scale agentic tool-use data via a three-stage loop: build a tool-spec repository → generate diverse agents and tasks → roll out successful multi-turn trajectories in simulated environments. The strongest existence proof to date that simulated-environment trajectories are a viable training substrate at scale; K2 hits 65.8% pass@1 on SWE-bench Verified. Simulator details aren’t open-sourced, and the shell isn’t broken out as its own track, so TerminalWorld’s narrower, open ablation is complementary. arxiv:2507.20534 · GitHub
- R2E-Gym — Jain et al., COLM 2025. 8.1k synthetic SWE problems with executable tests; 51% on SWE-bench Verified for open-weight agents. Real (not neural) execution. arxiv:2504.07164
- SWE-rebench — Badertdinov et al., 2025. Automated pipeline scraping real GitHub tasks for decontaminated SWE-agent training/eval. Real execution. arxiv:2505.20411
- AgentGen — Hu et al., KDD 2025. LLM-generates environments + tasks for planning agents; BI-EVOL difficulty curriculum. Llama-3-8B beats GPT-3.5 after AgentGen training. arxiv:2408.00764
- NNetNav — Murty et al., 2025. Unsupervised browser-agent training via 10k self-generated demonstrations; SOTA among unsupervised methods on WebArena. arxiv:2410.02907
Gap. With the single exception of DynaWeb (web, not terminal), no published work trains an LLM agent end-to-end inside another LLM-as-environment for a symbolic, execution-grounded domain. Terminal is harder than web because the “next observation” must respect deterministic POSIX semantics rather than free-form HTML.
D. Self-improvement / co-training of policy and simulator
- rStar-Math — Guan et al., Microsoft, 2025. Co-evolved policy SLM and process-reward model via MCTS; lifts Qwen2.5-Math-7B from 58.8% → 90.0% on MATH. Template for joint solver/verifier training. arxiv:2501.04519
- Self-Challenging Language Model Agents (SCA) — Zhou et al., Meta, June 2025. Challenger LLM proposes synthetic tool-use tasks; solver LLM is trained on them. +20.2% on tool-use evals with no human tasks. arxiv:2506.01716
- GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators — Princeton / ByteDance, Dec 2025. Most directly relevant: an LLM acts as an environment simulator that adapts task difficulty to the policy’s zone of proximal development; +40.3% on 7B baselines across five benchmarks. arxiv:2512.19682
- Kimi K2 agentic data synthesis (revisited) — Moonshot AI, July 2025. The same three-stage pipeline cited in §C also functions as automated task generation: the tool-spec repository plus an agent/task generator produces tasks before the simulator produces trajectories. Direct prior art for our bonus experiment (LLM-generated terminal tasks). arxiv:2507.20534
- SPIN / Self-Rewarding LM / DPO-style self-play. Standard reference points for closing the loop between a generator and an evaluator.
Gap. Co-evolution work to date pairs a task-generator with a policy — not a dynamics simulator with a policy. TerminalWorld’s “bonus” experiment (generating terminal tasks themselves) plus a learned dynamics model would, if both are co-trained with the policy, be the first such loop in the terminal domain.