Evaluation methods

Three layers of evidence have emerged in the last two years for grading a learned environment model: (1) intrinsic token- and state-level fidelity against held-out real rollouts (perplexity, exact-match, functional-equivalence in the style of CRUXEval/REval/CWM); (2) long-horizon coherence under autoregressive drift (DreamerV3, MuZero, R-WoM, DynaWeb all measure it, all show steep decay); and (3) downstream policy gains — the only metric that ultimately matters for an environment intended as a training substrate. TerminalWorld inherits the shape of these protocols but the terminal forces some new constraints: outputs are discrete byte streams with strict POSIX semantics, sessions are long, and the consumer (a Qwen3-8B coding agent) will be graded externally on Terminal-Bench 2.0. We propose a three-tier protocol — intrinsic, probe-set, downstream — patterned on what the field already accepts but specialized to shell semantics.

Intrinsic simulator-quality metrics

The default first cut for any learned generator is likelihood and surface-form similarity against held-out ground truth. Perplexity, BLEU and ROUGE are easy to compute over real-shell transcripts, but they punish trivially correct variants (e.g. timestamps, PIDs, file ordering from ls) and reward fluent hallucinations equally. The code-generation community has been moving away from surface metrics for exactly this reason: HumanEval-style pass@k discards lexical comparison entirely and asks whether the output runs and behaves correctly (Chen et al., 2021; pass@k explainer). Exact-match against a deterministic gold trace is useful for narrow probes (deterministic commands, fixed inputs) but useless on the long tail of shell output.

The more interesting metric family is execution-equivalence, where the model is asked to predict what happens, not just what is printed. CRUXEval grades models on whether they can predict the output of short Python functions (Gu et al., ICML 2024); REval extends this to four runtime-behavior tasks (coverage, program state, execution path, output) and reports an incremental-consistency score that drops to 10.3 on average across models, showing how brittle execution reasoning is even on a single function (Chen et al., 2024). Meta’s CWM is the closest analog at scale: a 32B model that emits a stack-frame trace for Python execution and scores 94.3% on CruxEval-O and 94% on its in-house HaltEval (Meta FAIR, 2025; model release). The key transferable idea — applicable directly to TerminalWorld — is probing intermediate state, not just final tokens: after mkdir foo && touch foo/bar, the simulator should be queried for whether bar would appear in ls foo as a state-equivalence check, not as a string match.

From video and physics world models we borrow the distributional metrics. FVD and its successors compare distributions of real vs. generated rollouts in a perceptual feature space, deliberately avoiding frame-level alignment (Unterthiner et al., 2018; known biases analyzed by Ge et al., CVPR 2024 and Beyond FVD, 2024). The analog for terminals is comparing summary statistics of the output distribution — line counts, command-success rates, error-message families — between real and simulated rollouts of the same command set. Finally, calibration: the simulator should know when it is guessing. The standard tools are Expected Calibration Error and Brier score, both well-defined for next-token distributions; recent work on language-model calibration shows ECE between 0.12 and 0.40 even for frontier models (Calibration of LLMs on Code Summarization, 2024; tokenized Brier score in ConfTuner, 2025).

Long-horizon / compounding-error evaluation

A terminal session is precisely the regime where autoregressive models break down — long conditioning on the model’s own previous outputs, no external oracle to correct drift. The classic exposure-bias literature (scheduled sampling, Bengio et al., 2015; exposure-bias quantification, ICLR 2020) gave us the conceptual scaffolding; the world-model literature has now produced the canonical rollout-length sweep protocol.

The pattern across DreamerV3, MuZero, R-WoM and DynaWeb is identical: report a target metric as a function of rollout horizon T, and look for the inflection point at which the curve falls off. MuZero’s learned dynamics is “highly accurate” for short horizons and “consistently grows” in error as T increases — yet still useful for planning because the search corrects local mistakes (Hafner et al., Nature 2025; What model does MuZero learn?, 2023). R-WoM makes the cleanest statement of the problem: pure-LLM world models “demonstrate strong short-term dynamics understanding but fail to maintain accuracy in full-procedure planning,” recovering +25.3% on OSWorld and +18.1% on WebArena only after retrieving relevant tutorials to ground each step (Mei et al., 2025). DynaWeb explicitly interleaves real expert trajectories with on-policy rollouts during training to combat compounding error in a learned web simulator (Ding et al., 2026). For narrative-coherence analogues from image/video models, Genie 3 reports that object state “remains coherent for about a minute” of interactive rollout (DeepMind blog) — the same kind of bounded-horizon caveat we should expect from a terminal simulator.

For TerminalWorld this implies three concrete things to measure: (a) degradation curves of exact-match and functional-equivalence as a function of session depth (turn 1, 5, 10, 25, 50); (b) state-divergence between simulated and real session state, measured via probe commands (ls, pwd, env, cat /etc/known-file) inserted at fixed depths; and (c) recovery experiments that periodically reseed the simulator with real ground truth and report how quickly fidelity returns — DynaWeb’s interleaving is the prior art and likely the right knob.

Downstream-policy / sim2real evaluation

Intrinsic fidelity is necessary but not sufficient. The clean test for an environment intended as a training substrate is: does a policy trained inside the simulator transfer to the real environment? This is the sim2real gap from robotics applied to text agents, and recent web-agent work has good templates.

DynaWeb’s headline numbers — significant improvements on WebArena and WebVoyager from agents trained “inside the dream” — are the strongest existence proof that an LLM-as-environment can yield real downstream wins (arxiv). NNetNav fine-tunes Llama-3.1-8B on 10k self-generated browser demonstrations and improves WebArena by 15 points over zero-shot (Murty et al., 2025); AgentGen’s tuned Llama-3.1-8B surpasses GPT-3.5 on AgentBoard (Hu et al., KDD 2025); GenEnv reports +40.3% across five planning benchmarks via difficulty-aligned co-evolution (Guo, Yang et al., 2025); Kimi K2’s report describes a “tool simulator” with rubric-based LLM-judge evaluation that produced agent-quality training data at trillion-parameter scale (Moonshot AI, 2025). Across these papers, the conventional ablations are: (i) matched-data baselines (same number of tasks, real env vs. simulator), (ii) data-efficiency curves (TB-2.0 score vs. number of simulated trajectories), and (iii) mixed real+simulated training to find the optimal interleaving ratio.

The classical model-based-RL view from DreamerV3 and MuZero is more pessimistic about simulator quality alone — both rely on planning or search to correct an imperfect dynamics model. We don’t have search in the SFT loop, so the bar for TerminalWorld is correspondingly higher: the simulator’s marginal token distributions need to be good enough that gradient updates against them generalize. The acid test is unambiguous: SFT Qwen3-8B inside the learned terminal, then evaluate the resulting policy on Terminal-Bench 2.0 (Merrill et al., 2026; tbench.ai), with the comparison baselines being standard assistant-masked SFT and ECHO-style unmasked SFT (loss over assistant + terminal tokens) — exactly the ablation the marin proposal is designed around.

Failure-mode characterization and qualitative eval

Aggregate numbers hide the failure modes that matter most. ToolEmu’s contribution was not its emulator but its risk taxonomy: an LLM-as-judge over agent trajectories that classifies failures into a small number of categories, validated against human raters (68.8% precision on real-world risks; Ruan et al., ICLR 2024). R-WoM uses a similar lens to identify hallucination types specific to GUI environments. The recent “Demystifying Errors in LLM Reasoning Traces” study curates 427 code snippets and develops a nine-category taxonomy of execution-reasoning failures: arithmetic, control-flow splits, index miscalculation, native-API misinterpretation, and so on (arxiv 2512.00215, 2025). These taxonomies are reusable as the evaluation rubric an LLM-judge applies to TerminalWorld outputs.

Shell-specific pathologies the simulator must be probed for:

Filesystem-state inconsistency — mkdir foo then ls failing to show foo; rm bar then cat bar returning content; cd /etc then pwd returning /home/....
Output hallucination — fabricated package versions in pip list, made-up file listings, fake process tables in ps, plausible but wrong man output.
Nondeterminism — timestamps, PIDs, hostnames, $RANDOM, mktemp paths: the simulator must produce plausibly-distributed values, not memorized strings. FVD-style distributional comparison is the right frame.
Error-message faithfulness — command not found vs. No such file or directory vs. Permission denied; correct exit codes ($?) after false, grep -q with no match, cat on a missing file.
Operator semantics — pipes (|), redirects (>, >>, 2>&1), command substitution ($(...)), backgrounding (&), signal handling.
Environment persistence — export FOO=bar visible in the next subshell? cd only within the current shell? source script.sh mutations persisting?
Adversarial probes — heredocs that look like file contents, echo of strings that resemble shell prompts, commands designed to elicit prompt-injection-style leakage.

Each pathology becomes one or more probe sequences in Tier 2 below, scored with a hard pass/fail on a real shell oracle. ShellFuzzer-style grammar-based generation (Gholamian et al., 2024) is a free source of additional probes.

Cross-domain transfer: what to borrow

Code-execution simulators (CRUXEval, REval, CWM, “Can LLMs Reason About Program Invariants”): the dominant idea is predict intermediate state, not just final output. For TerminalWorld this maps to inserting ls/cat/env probes mid-session and grading the simulator on those state read-outs, not only on the next command’s stdout. (Pei et al., ICML 2023; CRUXEval and CWM as cited above.)
Web/GUI world models (WebDreamer, R-WoM, DynaWeb, UI-Simulator): metric set converges on next-state prediction, full-procedure planning alignment, and milestone-transition recognition, plus downstream success rate on WebArena/OSWorld/WebVoyager (WebDreamer / Gu et al., TMLR 2025). The R-WoM split of “short-term dynamics” vs. “full-procedure planning” is directly portable to a terminal as “single-command fidelity” vs. “session-level state coherence.”
Game/physics world models (DreamerV3, Genie/Genie 3, UniSim): contribute the long-horizon coherence framing and the distributional metrics, but the most important transferable claim is more conceptual — these models are graded by what an agent can do inside them, not by perfect frame-by-frame reconstruction (UniSim, ICLR 2024; Genie, ICML 2024). UniSim’s RL and VLM policies that transfer zero-shot to the real world are the closest published analog of the experiment we want to run.
Tool-use simulators (ToolEmu, Kimi K2): contribute the LLM-judge methodology plus the rubric construction style. K2’s two-track verification — deterministic for verifiable outputs, LLM-judge for nuanced ones — is a sensible split for TerminalWorld where many probes have a deterministic real-shell oracle and a minority (e.g. man page paraphrases) need judge scoring.

A proposed evaluation protocol for TerminalWorld

We propose three tiers, run on every checkpoint, with clear pass thresholds.

Tier 1 — Intrinsic fidelity on held-out sessions. Take a held-out slice of the Nemotron-Terminal corpus plus newly recorded Docker rollouts that postdate training. Report (a) per-token cross-entropy and byte-level perplexity on real shell outputs; (b) exact-match rate of the simulator’s output against the real shell output for deterministic commands, broken out per command family; (c) functional-equivalence rate measured by replaying both real and simulated sessions and comparing state via probe commands; (d) calibration: ECE and tokenized Brier over the next-token distribution, with reliability diagrams. Report each as a curve over rollout depth T ∈ {1, 3, 10, 25, 50}. A positive result looks like exact-match staying above some floor (say 80% at T=1, 50% at T=10) on deterministic commands, with functional-equivalence degrading more gracefully than exact-match — confirming the simulator captures effects even when surface form diverges. Following DynaWeb, include a recovery curve: how quickly fidelity rebounds when the simulator is reseeded with ground truth every k turns.

Tier 2 — Probe-set diagnostics. A hand-curated suite of 100–200 shell behaviors organized into the categories above (filesystem state, output hallucination, nondeterminism distribution, error messages, operator semantics, environment persistence, adversarial). Each probe is a short scripted session with a known real-shell oracle. Most are graded pass/fail by direct comparison; nondeterministic ones (timestamps, PIDs) are graded by distributional similarity (range, format, plausibility) à la FVD. Adversarial and man/help-text probes get LLM-judge scoring with a rubric in the style of ToolEmu/K2. Report per-category pass rates and a single weighted aggregate. A positive result is monotonic improvement vs. checkpoint count, with no category dropping below 50% — and ideally exact-match parity with a real shell on the deterministic categories (filesystem state, exit codes, operator semantics).

Tier 3 — Downstream transfer (the acid test). SFT Qwen3-8B inside TerminalWorld using the Nemotron-Terminal task set, then evaluate on Terminal-Bench 2.0 with the Terminus harness, plus side benchmarks (SWE-bench Verified, the Endless Terminals held-out split). The three arms required by the proposal are: (a) standard assistant-masked SFT on the corpus — compute loss only on assistant-turn tokens, the conventional TRL assistant_only_loss=True recipe; (b) full-transcript SFT (the SFT-mode equivalent of ECHO) — drop the assistant-only mask so the same single cross-entropy is computed over both assistant and terminal-response tokens. In the RL setting ECHO is a hybrid L_GRPO + λ · L_env; in SFT there’s no separate policy-gradient term to weight against, so the auxiliary-loss formulation collapses to plain unmasking and there is no λ to tune; (c) SFT against TerminalWorld — train on rollouts generated against the learned simulator rather than against a real Docker. Report the headline TB-2.0 number, a data-efficiency curve (TB-2.0 vs. number of simulated trajectories at 1k/10k/100k), and a mixed-data curve (real-Docker / simulated mix swept from 0/100 to 100/0). A positive result is that the TerminalWorld arm matches or exceeds ECHO at the same compute and beats raw SFT by a margin that grows with simulated-trajectory count — replicating the qualitative shape of DynaWeb’s web-agent results in the terminal domain. A negative-but-informative result is the exposure-bias signature: the simulator-trained policy looks good on simulator-internal rewards but degrades on Terminal-Bench 2.0, which would be diagnostic of the same compounding-error failure mode R-WoM documents for GUIs.

The three tiers are deliberately complementary: Tier 1 catches gross fidelity regressions, Tier 2 catches specific shell semantics the simulator gets wrong, and Tier 3 is the only number that ultimately answers the question this project exists to answer — is the learned terminal a useful training substrate?

Phase 0.5 — closed-loop validation Risks & gap