Best Practice Release (2.3.0)

History of Hibernation and Persistence

At CodeSandbox, we built a product where users could treat their sandboxes like cloud-based laptops. When a user stepped away from a project, the sandbox would automatically hibernate after a period of inactivity. Upon resuming, the sandbox would restore to its exact previous state — both in memory and persistence — almost instantly.

Given that most sandboxes were small, short-lived projects, we introduced an automatic archiving mechanism. After 7 days of inactivity, a sandbox would be archived. This system allowed us to manage persistence without requiring user intervention. It was opinionated, reliable, and tailored to a single use case that worked well at our scale.

Additionally, the CodeSandbox product introduced a feature called Live Forking. This allowed users to fork a running sandbox with the new sandbox sharing memory from the original. This enabled seamless flows such as starting in a read-only, always-up-to-date main branch and then moving into a writable sandbox branch without any interruptions.

This approach provided:

A simple mental model for users: "My sandbox is always where I left it."
Cost-effective resource management: unused environments were automatically hibernated or archived.
Minimal user configuration: persistence and storage were handled behind the scenes.

What We Have Learned

When we pivoted the CodeSandbox product toward the SDK, the number of use cases expanded dramatically — many of which were not accounted for in our original stack or assumptions. This shift introduced both challenges in scalability and adaptability.

The core friction in the current SDK stems from our opinionated approach to hibernation and persistence in the CodeSandbox product.

One major issue is automatic hibernation, which is controlled by a timeout that SDK users can configure. However, this timeout is only extended by specific protocol messages sent from the SDK client. Other forms of activity — such as direct HTTP requests or file system operations — do not affect the timeout. This design has proven confusing and brittle across different usage scenarios.

The timeout is also managed inside the sandbox, making it fragile. We've encountered multiple cases where this internal state drifted or failed, causing sandboxes to remain alive beyond the intended timeout — or to hibernate unexpectedly.

Additionally, while SDK users can configure sandboxes to wake on HTTP or WebSocket connections, those interactions do not reset the timeout. This has led to frustration keeping the VM alive.

In response to scaling pressures, we recently shortened the automatic archive window to 4 days, with periods of 2 days during peak loads to stabilize clusters. While this helped improve system stability, it also introduced less predictable resume behavior. If a sandbox is archived, it will boot from a fresh state (CLEAN) instead of a resumed memory snapshot (RESUME).

Even though this distinction is observable via the bootupType, it adds complexity for SDK integrators and hurts user experience. SDK users now have to:

Detect which state the sandbox is starting from
Account for significantly different startup durations
Handle potential end-user confusion when a sandbox suddenly takes longer to load

Finally, the archive-recovery path increases the likelihood of edge-case failures, making integrations more error-prone and harder to support.

The Live Forking feature also introduced a significant scalability challenge. In some scenarios, thousands of sandboxes would simultaneously read from the memory of a single origin sandbox. This led to serious system bottlenecks and degraded performance across the platform.

As a final note, we’ve learned that SDK users — quite understandably — have done whatever they could to make our system work for their products. However, the wide range of new use cases has proven incompatible with our current hibernation, persistence, and forking behaviors. This fundamental mismatch is a key reason we've encountered so many reliability and scalability issues.

What We Are Shipping Today

SDK version 2.3.0 represents our Best Practices release. This is a NON-BREAKING change. We have also rewritten our docs to show you how to best take advantage of the current state of the service.

The following changes has been applied to the SDK:

Feat: Added a new delete method
Feat: Make id in connect and createSession an optional field, allowing you to default to global user
Fix: .gitignore is now included in the template build
Fix: Defaulting to public-hosts as privacy

What We Are Working On

A REST-based Sandbox Agent

The current SDK client requires websocket, which adds complexity to the interface and managing a connection in different environments. With a REST based Sandbox Agent we simplify the mental model, the interface and there is no connection to manage.

Long-term persistence

When a Sandbox is hibernated we create a snapshot. If the snapshot is not resumed within 2-7 days, depending on the health of the cluster, we archive the Sandbox. This makes resume unpredictable as it normally takes 1-3 seconds, but can take up to 60 seconds when the Sandbox is archived. With long-term snapshot persistence we aim to not have archive of Sandboxes, giving you predictable resume times of 1-3 seconds regardless of time in hibernation.

Replace the current hibernation timeout and automatic wakeup

If you want the best possible experience and control we recommend adopting active lifeecycle management. This is true for our current service and any planned future changes. That said, we also want to provide a simple and reliable timeout/wakeup mechanism. With the platform managing timeouts/wakups we want to have a default behavior that is easy to understand, but we expect that users want configuration. Here are some questions to reflect on:

Should hibernation timeout only extend when calling the Sandbox Agent? (It will have a health endpoint to use as heartbeat)
If hibernation timeout should extend on any request to the Sandbox, are there still certain requests that should not extend it?
If any request to the Sandbox wakes it up, are there still certain requests that should not wake it up?

Feedback & Collaboration

We want to take this opportunity to thank our users for sticking with us as we’ve explored and adapted to a wide variety of new use cases. One of the most important things we’ve learned is that you want to manage sandboxes like a low-level, simple resource you have full control over — not like the high-level “laptop behavior” we originally designed for the CodeSandbox product.

Your feedback has been invaluable in shaping the direction of the SDK.

As we move forward with this new model, we invite you to reach out with: • Comments or concerns • Specific use cases you’d like to discuss • Invitations to feedback sessions or implementation discussions

We’re committed to making this transition smooth and to giving you the tools and flexibility you need.

With love, The CodeSandbox SDK Team ❤️

christianalfoni/release_sdk.md

Select an option

No results found