march 31, 2026 · software testing

Deterministic Simulation Testing

How seed-dependent determinism, IO fault models, and virtual clocks let you scale testing across space and time — finding bugs that only appear after hours of real-world execution.

by Chaitanya · founder, workers io

I’ve spent most of my career as an engineer, and if I look back, nearly all of my stressful hours have been spent debugging and fixing things that broke. Honestly, some of the most frustrating moments I’ve had as a user have also come from running into buggy software.

So let’s take a step back and try to understand how we got here in the first place.

How did modern software become so complex? Why is it under-tested, and how can we break the barriers of space and time to build confidence in the code we write?

THE PROBLEMThe current state of testing

Non-determinism in modern software

It’s easy to assume our code is deterministic. But as we build on top of abstractions, we lose sight of what’s actually happening at the system level.

The software we rely on every day performs hundreds of operations of varying complexity and uses external services, often spread across machines worldwide.

In such complex systems, the same code can behave very differently depending on what’s happening at the lower levels.

Challenge: How do we define “correctness” in a system where one can’t even reason about, or control, how it’s running?

A common answer to that is to handle every possible scenario. If you’ve ever shipped real software, you know that’s basically impossible.

You Can’t Control Wall-Clock

Modern software isn’t just crunching numbers. It’s taking user inputs, talking to other services, writing to disk, handling failures, and much more.

Any software that actually matters has way more possible paths than you could ever test in a day.

A single E2E test can sometimes take minutes to complete; if one were to test 100s of such code paths, it would take hours. That’s just not practical, so we end up picking a handful of cases we think are interesting.

Challenge: Because of time constraints, we make assumptions about what real usage would look like. Not saying this isn’t helpful, but more often than not, these tests are closer to what we have implemented and are unlikely to surface any production failures.

Bad Things Happen

The last piece is the infrastructure itself. Most of us trust the APIs we call, the databases we use, and the machines our code runs on.

Even with the cloud hiding a lot of the messy details, things still fail or slow down in the real world. But our test environments are almost never as chaotic as production.

Challenge: You’re not going to catch bugs that only appear when the services you depend on slow down or stop responding. Most failures happen at the edges between systems, and those are usually the least-tested parts.

These are also the hardest bugs to reproduce. For example, maybe something only breaks when the event loop resolves things in just the right (or wrong) order, or when the packets between services are disrupted.

THE APPROACHScaling testing across space and time

Let me show you what happens if you tackle all these challenges head-on, using first principles.

Seed-Dependent Determinism

I forked Bun and made its execution deterministic based on a “seed”. The aim is to reduce the entropy of the application logic for a given seed to zero.

Bun’s runtime is probably one of the easiest places to add determinism. The event loop actually makes things simpler here than in languages like Go or C++.

In simpler words, for a given seed X, you can be certain your request will take the exact same path, will experience the exact same latencies and failures.

Control how worker threads are allocated, when they resolve, and in what order
Math.random() always returns the same sequence of numbers
setTimeout() with the same delay always resolve in the same order
IO operations, latencies, and failures are all deterministic

IO Fault Models

When we’re testing, we want to see all the edge cases. If your app falls over when a service is slow, you want to be able to trigger that on purpose.

So we add a fault model that intercepts IO requests and uses the seed to decide what happens. You can model fetches, SQLite, DNS, filesystem calls, or any combination, to create realistic scenarios.

Each fault scenario and mocked IO request depend on the seed, ensuring perfect reproducibility.

Bun.sim.fault({
  target: "fs_write",
  pattern: "/sim/*",
  fault: "errno",
  errno: "ENOSPC",
  probability: 0.01,
});

// OR

Bun.sim.fault({
  target: "net_latency",
  fault: "delay",
  delayMs: { min: 50, max: 500 },
  probability: 1.0,
});

// OR

Bun.sim.swarm({
  faults: [
    { target: "fs_open", fault: "ENOENT" },
    { target: "dns", fault: "timeout" },
    { target: "net_recv", fault: "connection_reset" },
  ],
});

We can also mock IO requests to create situations our app might hit in production.

Bun.sim.fetch("https://api.test/data", { status: 200, body: "ok" });
Bun.sim.fault({
  target: "fetch",
  pattern: "https://api.test/data",
  error: "ECONNREFUSED",
  probability: 1.0,
});

const res = await fetch("https://api.test/data");

Take Away the Wall-Clock

The last piece is running these tests as fast as possible, while modeling timeouts and network delays realistically. This lets us scale our testing time, and helps us find edge cases we’d probably never hit otherwise.

For example, maybe there’s a bug that only shows up after the app has been running for 10 hours and thousands of IO operations have interleaved.

Everything runs on a virtual clock, so we can see what production behavior would look like without actually waiting. If there’s a 30-second timeout, the virtual clock just jumps ahead by 30 seconds, and any events depending on that clock get triggered as it moves forward.

ExperimentLet’s See How It All Comes Together

Let’s try this out with a multiplayer chess game. We’ll simulate 10 hours of gameplay using different fault models and see what we find.

The Setup

A multiplayer game of chess, where each game lasts 3 minutes, and clients communicate with the game server via socket connections.

It’s a simple example, but you can still get some really interesting bugs. If we want to fix them, we need a way to reproduce them deterministically.

Our simulator will run the entire system in deterministic mode with clients and servers using the same seed, and we will play 100s of games on a sped-up clock.

It’s important to test things at a very high speed while keeping a realistic virtual clock state, because a human will not be making 1000 moves a second. A bug is valid only if it occurs in realistic human gameplay.

I had Claude Code review things several times to make sure it was really hard to break the game. This was intentional; I wanted to know if it could find bugs that code review tools cannot by just looking at the code.

Here is what our fault model looks like:

Messages might get lost
There could be network latency between clients and the server
The player can send illegal chess moves

The invariant we care about: Across runs, we want to ensure there is no state drift between clients and the server.

One nice side effect of determinism is that you can reproduce the same results every time, rewind changes, run faster than real time, and more.

This can be hard to picture, so I built a preview page where you can trigger a run for a single seed or a hundred seeds at once.

Round 1 — Single run

First, let’s run a deterministic test with two clients playing against each other, using seed 42. We’ll save the state so we can replay it if needed. The test runs at CPU speed, but the virtual clock keeps track of what wall-clock time would look like.

Valid moves come from Chess.js. Our fault model injects failures and checks simple invariants, such as clock and board consistency, between clients and the server.

No violations showed up in this run, which isn’t surprising. You’re lucky if you find a bug in one out of a hundred runs.

Round 2 — 100 parallel runs

Let’s make things interesting. We will now run the same fault model in 100 parallel game plays, all sped up 10x with 100 different seeds, and let’s see if we can find anything interesting.

Nice, we found four invariant violations with seeds 28, 59, 68, and 94. Since the runtime is fully deterministic, we can reproduce these results exactly and see where things went wrong.

This is super helpful if you want to make sure your fix actually solves a weird race condition. Just update your code and rerun with the same seed.

Round 3 — Deterministic reproduction

The last part is to see how and where things go wrong, what fault triggered the failure, and what the state of the clients and the server. All this information can help a coding agent or you, resolve the bug faster, and the seed can help you validate the fix with confidence.

Looking at the re-run and trace logs, we can see that if the connection is interrupted and a player tries to rejoin, the game gets reset.

Claude Code missed this bug both during design and code review. Of course, our toy example is much simpler than most real-world apps, and our fault model is also pretty basic compared to what’s possible.

TAKEAWAYLet’s all write better software

Layers of abstraction hide a lot of decisions that get made for you to get the outcomes you need. That’s great, it helps us ship faster and build more complex software with fewer lines of code.

But it also means we need to get better at testing, and make sure we’re doing it in an environment that’s realistic and true to what can actually happen in production.