Literature review

This review surveys prior work in four areas relevant to TerminalWorld (Marin issue #5866): (A) neural world models for code/web/interactive environments, (B) terminal agents and Terminal-Bench, (C) training agents inside learned simulators, and (D) co-evolution of policy and simulator. Each section concludes by flagging what is missing and the risks the proposal will need to confront.

A. Neural world models for code, web, and interactive environments

CWM: An Open-Weights LLM for Research on Code Generation with World Models — Meta FAIR CodeGen team (Cummins, Leather, Markosyan, Pagliardini, Remez et al.), 2025. A 32B model “mid-trained on observation-action trajectories from Python interpreter and agentic Docker environments.” Demonstrates that supervising on execution traces gives 65.8% on SWE-bench Verified and enables step-by-step Python execution simulation inside the model itself. arxiv:2510.02387
WebDreamer / “Is Your LLM Secretly a World Model of the Internet?” — Gu, Zheng et al., OSU NLP / Orby AI, TMLR 2025. Uses an LLM as a natural-language world model to simulate “what happens if I click X” for web planning; their distilled Dreamer-7B matches GPT-4o while being 4–5× more efficient than tree search in sandboxes. arxiv:2411.06559
WebWorld: A Large-Scale World Model for Web Agent Training — Xiao, Tu, Zou et al., Qwen team / Zhejiang Univ., 2026. Cited by the issue. Treats large-scale web simulation as the training substrate for agents; releases code at QwenLM/WebWorld. arxiv:2602.14721v1
R-WoM: Retrieval-augmented World Model for Computer-use Agents — Mei, Guo, Chang et al., AWS AI, 2025. Explicitly identifies that pure-LLM world models hallucinate and degrade in long-horizon planning; grounds simulations with retrieved tutorials. +23.4% on OSWorld, +16.3% on WebArena. arxiv:2510.11892
WorldCoder — Tang, Key, Ellis, NeurIPS 2024. A model-based LLM agent that writes Python as its world model rather than simulating in latent space — a useful contrast to fully neural simulators. arxiv:2402.12275
UI-Simulator: LLMs as Scalable, General-Purpose Simulators for Evolving Digital Agent Training — Wang, Yin, Cui et al., UCLA, 2025. Generates synthetic UI states/transitions for WebArena and AndroidWorld training. arxiv:2510.14969
Genie 2 / Genie 3 — DeepMind, 2024/2025. Foundation world models that generate playable 3D environments from a single image; relevant only as analogy — they show that learned simulators can be coherent for minutes of interactive use. DeepMind blog
UniSim / Sora-as-world-simulator — Yang et al., ICLR 2024; Brooks et al., 2024. Generative video models repurposed as physics/world simulators — again, only an analogy for what learned simulators of symbolic environments might look like. openreview
ToolEmu: Identifying the Risks of LM Agents with an LM-Emulated Sandbox — Ruan, Dong et al., ICLR 2024 Spotlight. Prompts GPT-4 to emulate tool execution given only a tool spec and inputs, so LM agents can be exercised against arbitrary (un-implemented) tools. The closest direct precursor to TerminalWorld’s central idea (LM-as-environment) — but ToolEmu is prompted, not trained, and is aimed at safety red-teaming, not generating training rollouts. arxiv:2309.15817 · GitHub

Gap. No prior work trains a model end-to-end to emit the bytes a real shell would emit given a command and prior session state. ToolEmu demonstrates that a prompted frontier LM can plausibly fake tool execution for evaluation, but stops short of training a dedicated simulator or using one as a training substrate. CWM trains on execution traces but does not position the model as the executor for downstream RL/SFT; WebDreamer/WebWorld/DynaWeb cover the web; WorldCoder writes code instead of simulating it. The terminal — discrete, partially-observable, with strict syntactic structure — is an untouched substrate.

B. Terminal agents and Terminal-Bench

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces — Merrill et al., Stanford / Laude Institute, Jan 2026. 89 hand-authored tasks in real sandboxed terminals; frontier agents are still under ~65%. arxiv:2601.11868 · tbench.ai
Terminus — Laude Institute, 2025/2026. The reference test-bed agent that drives Terminal-Bench via tmux; Terminus 2 + Claude Opus 4.5 currently sits near the top of TB 2.0 (~58%). tbench.ai/news/terminus
ECHO: Terminal Agents Learn World Models for Free — Vaishnavh Shrivastava & Dimitris Papailiopoulos, Microsoft Research AI Frontiers, May 2026. In RL, ECHO is a hybrid objective L_ECHO = L_GRPO(actions) + λ · L_env(observations) — a length-normalized CE loss on terminal-output tokens added alongside the standard GRPO loss on action tokens; “same rollout and forward pass, just a different mask over the logits.” In SFT, the same idea collapses to simply dropping the assistant-only loss mask so the existing CE runs over environment tokens too. Headline numbers: TerminalBench-2.0 pass@1 nearly doubles at both 8B (2.7 → 5.2) and 14B (5.2 → 10.8); 2.3× speed-up to match plain GRPO on TB-Lite; recovers 50–104% of the gap that expert-SFT-then-GRPO opens over GRPO-from-base; and a “sparks of self-improvement” result where training with only the env loss (no GRPO, no verifier) still improves OOD task performance by 3.8–10.0 pp. ECHO explicitly positions itself as “the online, in-the-RL-loop, CLI-flavored version” of auxiliary world-modeling work including Agent Learning via Early Experience, VAGEN, RWML, and CWM. This is the direct empirical motivation TerminalWorld is extending. Digg writeup · original tweet
Endless Terminals: Scaling RL Environments for Terminal Agents — Gandhi, Garg, Goodman, Papailiopoulos, Stanford / MSR, Jan 2026. Procedural pipeline generating 3,255 containerized terminal tasks with tests; Qwen2.5-7B jumps 10.7→53.3% on their held-out set and gets meaningful TB 2.0 lift. Concrete demonstration that environment quantity is the bottleneck. arxiv:2601.16443
Nemotron-Terminal / “On Data Engineering for Scaling LLM Terminal Capabilities” — Pi, Lam, Shoeybi, Jannaty, Catanzaro, Ping, NVIDIA, 2026. Releases the Nemotron-Terminal-Corpus (~366k trajectories: ~226k dataset-adapter, ~140k skill-synthetic) and Nemotron-Terminal-8B/14B/32B; the 32B beats Qwen3-Coder-480B on TB 2.0 (27.4% vs 23.9%). This is the exact training corpus TerminalWorld will both consume and try to simulate. arxiv:2602.21193 · HF Corpus
OpenHands (CodeAct 2.1) and SWE-agent — All-Hands AI / Princeton-Stanford, 2025. Open-source frameworks for terminal/shell-using software-engineering agents; OpenHands SDK hits 72% SWE-bench Verified with Sonnet 4.5. Together with Warp and Claude Code, they define the consumer of any TerminalWorld-trained policy. arxiv:2511.03690

Gap. Existing terminal-agent training assumes a real Docker container behind every rollout. Endless Terminals, NemotronTerminal, R2E-Gym, and SWE-rebench all reduce the cost of building tasks, but every rollout still pays full container-startup + execution cost. Nobody has tried to replace the container with a neural net during training.

C. Training agents in learned simulators (model-based RL for LLM agents)

DreamerV3 — Hafner et al., Nature 2025. Canonical model-based RL: learn a latent world model, train the policy via imagined rollouts. Mastered Minecraft from scratch. Provides the theoretical template TerminalWorld implicitly follows. arxiv:2301.04104
DynaWeb: Model-Based Reinforcement Learning of Web Agents — Ding, Liu, Wang et al., Jan 2026. A “web world model” trained on millions of page-transition tuples that generates “dream” rollouts for on-policy RL of web agents; interleaves real-expert trajectories to combat compounding error. Closest prior work to TerminalWorld but for the browser. arxiv:2601.22149
Agent Learning via Early Experience — Oct 2025. Pre-RL stage in which the agent generates its own action-consequence data and uses (a) implicit world modeling and (b) self-reflection to ground the policy in environment dynamics without reward. Same intellectual lineage as ECHO — train on consequences before / instead of training on rewards — but staged rather than in-loop, and cited explicitly by ECHO as inspiration. arxiv:2510.08558
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents — Oct 2025 (NeurIPS ‘25). Adds a World Modeling Reward (dense, turn-level reward for state prediction) plus Bi-Level GAE to RL of VLM agents; a 3B VLM hits 0.82 across five agent benchmarks. Direct precedent for adding a world-modeling auxiliary signal during RL, in the visual domain. Cited by ECHO. arxiv:2510.16907 · GitHub
RWML: Reinforcement World Model Learning for LLM-based Agents — Feb 2026. Self-supervised next-state-embedding alignment (not next-state-token CE) as a reward signal during RL; +6.9 / +5.7 over plain task-success RL on ALFWorld / τ²-Bench. Important caution for TerminalWorld: the authors argue that token-level next-state CE is prone to model collapse and reward hacking compared with embedding-space alignment — a counterpoint to the simple “predict the bytes” approach we plan to start with. Cited by ECHO. arxiv:2602.05842
Kimi K2: Open Agentic Intelligence — Kimi Team / Moonshot AI, July 2025. A 1.04T-parameter MoE (32B active) whose post-training pipeline synthesizes large-scale agentic tool-use data via a three-stage loop: build a tool-spec repository → generate diverse agents and tasks → roll out successful multi-turn trajectories in simulated environments. The strongest existence proof to date that simulated-environment trajectories are a viable training substrate at scale; K2 hits 65.8% pass@1 on SWE-bench Verified. Simulator details aren’t open-sourced, and the shell isn’t broken out as its own track, so TerminalWorld’s narrower, open ablation is complementary. arxiv:2507.20534 · GitHub
R2E-Gym — Jain et al., COLM 2025. 8.1k synthetic SWE problems with executable tests; 51% on SWE-bench Verified for open-weight agents. Real (not neural) execution. arxiv:2504.07164
SWE-rebench — Badertdinov et al., 2025. Automated pipeline scraping real GitHub tasks for decontaminated SWE-agent training/eval. Real execution. arxiv:2505.20411
AgentGen — Hu et al., KDD 2025. LLM-generates environments + tasks for planning agents; BI-EVOL difficulty curriculum. Llama-3-8B beats GPT-3.5 after AgentGen training. arxiv:2408.00764
NNetNav — Murty et al., 2025. Unsupervised browser-agent training via 10k self-generated demonstrations; SOTA among unsupervised methods on WebArena. arxiv:2410.02907

Gap. With the single exception of DynaWeb (web, not terminal), no published work trains an LLM agent end-to-end inside another LLM-as-environment for a symbolic, execution-grounded domain. Terminal is harder than web because the “next observation” must respect deterministic POSIX semantics rather than free-form HTML.

D. Self-improvement / co-training of policy and simulator

rStar-Math — Guan et al., Microsoft, 2025. Co-evolved policy SLM and process-reward model via MCTS; lifts Qwen2.5-Math-7B from 58.8% → 90.0% on MATH. Template for joint solver/verifier training. arxiv:2501.04519
Self-Challenging Language Model Agents (SCA) — Zhou et al., Meta, June 2025. Challenger LLM proposes synthetic tool-use tasks; solver LLM is trained on them. +20.2% on tool-use evals with no human tasks. arxiv:2506.01716
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators — Princeton / ByteDance, Dec 2025. Most directly relevant: an LLM acts as an environment simulator that adapts task difficulty to the policy’s zone of proximal development; +40.3% on 7B baselines across five benchmarks. arxiv:2512.19682
Kimi K2 agentic data synthesis (revisited) — Moonshot AI, July 2025. The same three-stage pipeline cited in §C also functions as automated task generation: the tool-spec repository plus an agent/task generator produces tasks before the simulator produces trajectories. Direct prior art for our bonus experiment (LLM-generated terminal tasks). arxiv:2507.20534
SPIN / Self-Rewarding LM / DPO-style self-play. Standard reference points for closing the loop between a generator and an evaluator.

Gap. Co-evolution work to date pairs a task-generator with a policy — not a dynamics simulator with a policy. TerminalWorld’s “bonus” experiment (generating terminal tasks themselves) plus a learned dynamics model would, if both are co-trained with the policy, be the first such loop in the terminal domain.

Overview Phase 0 — SWE-ZERO derisk