This is an open-ended systems design and implementation challenge. There's no single correct architecture, and we don't expect you to finish every milestone in one day. We care about how you approach it: how you evaluate tools, how you use AI agents throughout, and the tradeoffs you make. We'll ask you to walk us through your thinking, so optimize for understanding and being able to explain your decisions, not just output.
You're encouraged to use AI coding agents (Claude Code, Cursor, Copilot, whatever you prefer) throughout this exercise. This isn't a test of whether you can write code from memory. We'll evaluate you on your understanding of the system you build and the decisions you made. Keep traces/logs of your agent interactions, we'll review them during the debrief.
Build a system that can autonomously develop and improve a web application using AI agents running in sandboxed environments. Think of it as a simplified version of Devin or Ramp's background agent, where you describe a feature in natural language and an agent builds it inside a sandboxed dev environment, then shows you the result.
We've provided a simple React todo app as the target application (see todo-app/).
cd todo-app
npm install
npm run devThe app should be running at localhost:5173.
We'll provide API keys for LLM calls (Anthropic, OpenRouter). If you need access to other tools or services (sandbox providers, browser infra, etc.), just ask.
Work through these in order. Each builds on the previous one.
Get the provided todo app running locally. Understand how it works well enough to direct an agent to modify it. Poke around, add a todo, get familiar with the code.
If you'd prefer to build your own from scratch or swap in a different framework, go ahead, but don't spend too long here. The interesting work is in the next milestones.
This is the core infrastructure problem. Set up a system where an AI agent can:
- Run the todo app inside an isolated sandbox
- Make code changes (edit files, install packages)
- See the result, whether that's via a URL, port forwarding, screenshots, or something else
You choose the sandboxing technology. Research what's out there and pick what makes sense. Be ready to explain your reasoning during the debrief.
Deliverable: a working sandbox where the todo app boots, is accessible, and code changes are reflected.
Connect the pieces. Build a workflow where:
- You provide a natural language prompt describing a feature (e.g. "add due dates to todos with a date picker")
- An AI agent receives the prompt, plans the changes, and implements them inside the sandbox
- The modified app is accessible so you can see the result
Some things to think about: how does the agent interact with the sandbox? How does it recover from errors? Can it verify its own work? How do you submit prompts and see results? A CLI is fine, a web UI is cool but not required.
Demonstrate your system by building at least 2 features via prompt. Pick whatever features you want.
Deliverable: a working demo where you type a feature description, the agent builds it in the sandbox, and you can see the before/after.
Make your agent record a video of itself using the app it just modified. It should open a browser, navigate to the todo app, interact with it (create todos, use the new features), and produce a video of the session.
Hint: agent-browser
Deliverable: a video showing the agent autonomously using the app it built.
Agent fluency. How well do you use AI agents throughout the exercise? Do you direct them well? Do you know when to step in vs let them run?
Technical decisions. How do you evaluate and choose between options? We want to see reasoning, not just "I picked the first thing that worked."
Systems thinking. How do the pieces fit together? What would you change with more time?
Scrappiness. When something doesn't work, how fast do you pivot?
Communication. Can you explain what you built and why?
At the end of the day, you'll present for about 10 minutes followed by Q&A:
- Live demo of what you built, walking through each milestone you completed
- Architecture walkthrough, whiteboard the system
- Walk us through 2-3 key decisions and how you arrived at them, show agent traces if relevant
- What would you do differently? What was harder than expected?
- You have the full working day. Pace yourself.
- If you're stuck for more than 20 minutes, ask for help. We'd rather see how you unblock with guidance than watch you spin.
- You can use your own machine or an interview laptop. We can provide API keys and any sandbox provider credentials you need.
- Keep a running log (even just bullet points) of decisions, pivots, and things you tried. This is part of your deliverable.
Good luck!