Background agents: autonomous app builder

This is an open-ended systems design and implementation challenge. There's no single correct architecture, and we don't expect you to finish every milestone in one day. We care about how you approach it: how you evaluate tools, how you use AI agents throughout, and the tradeoffs you make. We'll ask you to walk us through your thinking, so optimize for understanding and being able to explain your decisions, not just output.

You're encouraged to use AI coding agents (Claude Code, Cursor, Copilot, whatever you prefer) throughout this exercise. This isn't a test of whether you can write code from memory. We'll evaluate you on your understanding of the system you build and the decisions you made. Keep traces/logs of your agent interactions, we'll review them during the debrief.

The challenge

Build a system that can autonomously develop and improve a web application using AI agents running in sandboxed environments. Think of it as a simplified version of Devin or Ramp's background agent, where you describe a feature in natural language and an agent builds it inside a sandboxed dev environment, then shows you the result.

We've provided a simple React todo app as the target application (see todo-app/).

Setup

cd todo-app
npm install
npm run dev

The app should be running at localhost:5173.

We'll provide API keys for LLM calls (Anthropic, OpenRouter). If you need access to other tools or services (sandbox providers, browser infra, etc.), just ask.

Milestones

Work through these in order. Each builds on the previous one.

Milestone 1: the target app

Get the provided todo app running locally. Understand how it works well enough to direct an agent to modify it. Poke around, add a todo, get familiar with the code.

If you'd prefer to build your own from scratch or swap in a different framework, go ahead, but don't spend too long here. The interesting work is in the next milestones.

Milestone 2: sandboxed development environment

This is the core infrastructure problem. Set up a system where an AI agent can:

Run the todo app inside an isolated sandbox
Make code changes (edit files, install packages)
See the result, whether that's via a URL, port forwarding, screenshots, or something else

You choose the sandboxing technology. Research what's out there and pick what makes sense. Be ready to explain your reasoning during the debrief.

Deliverable: a working sandbox where the todo app boots, is accessible, and code changes are reflected.

Milestone 3: vibe coding

Connect the pieces. Build a workflow where:

You provide a natural language prompt describing a feature (e.g. "add due dates to todos with a date picker")
An AI agent receives the prompt, plans the changes, and implements them inside the sandbox
The modified app is accessible so you can see the result

Some things to think about: how does the agent interact with the sandbox? How does it recover from errors? Can it verify its own work? How do you submit prompts and see results? A CLI is fine, a web UI is cool but not required.

Demonstrate your system by building at least 2 features via prompt. Pick whatever features you want.

Deliverable: a working demo where you type a feature description, the agent builds it in the sandbox, and you can see the before/after.

Milestone 4 (bonus): agent self-recording

Make your agent record a video of itself using the app it just modified. It should open a browser, navigate to the todo app, interact with it (create todos, use the new features), and produce a video of the session.

Hint: agent-browser

Deliverable: a video showing the agent autonomously using the app it built.

What we're evaluating

Agent fluency. How well do you use AI agents throughout the exercise? Do you direct them well? Do you know when to step in vs let them run?

Technical decisions. How do you evaluate and choose between options? We want to see reasoning, not just "I picked the first thing that worked."

Systems thinking. How do the pieces fit together? What would you change with more time?

Scrappiness. When something doesn't work, how fast do you pivot?

Communication. Can you explain what you built and why?

Debrief

At the end of the day, you'll present for about 10 minutes followed by Q&A:

Live demo of what you built, walking through each milestone you completed
Architecture walkthrough, whiteboard the system
Walk us through 2-3 key decisions and how you arrived at them, show agent traces if relevant
What would you do differently? What was harder than expected?

Ground rules

You have the full working day. Pace yourself.
If you're stuck for more than 20 minutes, ask for help. We'd rather see how you unblock with guidance than watch you spin.
You can use your own machine or an interview laptop. We can provide API keys and any sandbox provider credentials you need.
Keep a running log (even just bullet points) of decisions, pivots, and things you tried. This is part of your deliverable.

Good luck!

raunakdoesdev/ONSITE.md

Select an option

No results found