<EPISODE 1> Overall Theme: The Dawn of Agentic AI in Software Development
The first episode of “Raising an Agent” introduces listeners to Thorsten Ball and Quinn Slack’s pioneering journey as they build an AI-powered coding assistant from the ground up. In this inaugural session, they share their excitement over the agent’s surprisingly autonomous problem-solving, reveal the foundational “inversion of control” mindset that lets the model orchestrate tasks with minimal human prompting, and highlight the emergent behaviors—like self-debugging and tool-use—that transformed their rough prototype into a capable collaborator. Throughout, they offer a candid look at both the triumphs and the early challenges of wiring together language models, developer tooling, and rich feedback loops, setting the stage for a deeper exploration of what it means to treat an LLM not just as a generator of text but as an autonomous software engineer.
1. Podcast & Project Overview:
- Podcast Name: "Raising an Agent."
- Hosts: Quinn Slack (CEO of Sourcegraph) and Thorsten Ball (Software Engineer at Sourcegraph).
- Format: A limited-run special edition podcast acting as a "diary of excitement" to share their journey and learnings while hacking on a new AI-fueled code editing tool.
- Project Status: They've been hacking for 2-3 weeks on this "agentic tool." It's still a rough prototype, not production-ready, but already showing exciting capabilities and taking shape.
- Goal: To document their daily "holy shit" moments, discoveries, and the evolution of the tool, being open about successes and failures.
2. Core Excitement & Discoveries with the AI Agent:
- Amazing Model Capabilities: The underlying AI models (specifically mentioning Claude 3 Sonnet and Haiku) are incredibly capable, especially at tool use and understanding complex intent even from minimal information.
- Example: The agent correctly interpreted Quinn's intent for a refactoring task based on a diff that consisted mostly of single-character changes.
- Agentic Approach & "Inversion of Control":
- There's a significant mindset shift from traditional prompting (where the human meticulously crafts a detailed prompt to get specific code) to an agentic approach.
- Here, the human provides the agent with tools and high-level goals, and the agent figures out how to use the tools and orchestrate steps. Thorsten calls this an "inversion of control" – "it's a big bird, it can catch its own food... you just have to present it with the food somehow."
- Quinn describes it as the agent being able to do 90% of a refactoring task if the human does the first 10% and maybe the last 3%.
- Emergent Problem-Solving & Self-Correction:
- The agent demonstrates surprising levels of reasoning and problem-solving.
- Example 1 (Building a feature for itself): When Thorsten asked the agent to build the "recording feature" (which records user edits to be mimicked by the LLM), the agent not only generated the code but also provided a plan on how Thorsten should test it. It required only minor compiler error fixes.
- Example 2 (Self-debugging): When an edit tool call failed (due to sending a string instead of a JSON object, then an indentation issue), the agent tried again, then decided on its own to write a bash script to create a new file and move it over the old one to achieve the edit. When that failed (Null Pointer Exception), it then decided to add debug statements to its own generated code, re-ran it, got the output, and then fixed the bug.
- Unexpected Capabilities: The agent sometimes performs tasks or uses tools in ways the developers didn't explicitly design for but are highly effective.
- Example: Quinn asked the agent to build a new feature, and it also performed useful refactoring along the way, a capability he hadn't seen in other tools.
3. Key Architectural & Philosophical Learnings:
- The "Three Big Pieces": The core components of their agent are:
- Tools: Specific functionalities the agent can use (e.g., file editing, search, running terminal commands).
- The Model: The underlying Large Language Model (e.g., Claude 3 Sonnet).
- The Integration/Wiring: How the tools and model are connected and orchestrated. They found this part surprisingly less difficult than anticipated.
- Importance of Rich Feedback Loops: Instead of perfecting prompts, it's more effective to give the agent rich, iterative feedback (compiler errors, linter output, test results, diagnostics). The agent uses this feedback to refine its actions.
- Opinionated Design Choices:
- Best Model, Not Model Selection: They plan to be opinionated and use what they determine to be the best model for the job, rather than offering users a model selector, to ensure the highest quality experience.
- Flexibility to Evolve: They're willing to "rip out" features (like manual checkpoints) if the agent becomes smart enough to handle those scenarios more natively (e.g., robust MCP tool usage or self-correction making explicit rollbacks less necessary).
- Developer-First Usability: They are building a tool that they (as experienced developers) want to use daily. The focus is on core utility and a great user experience for themselves first, with enterprise features (like SAML support) being secondary.
- Current AI Tooling is Suboptimal: Many existing AI coding tools present capabilities to LLMs in a generic, suboptimal way (e.g., one-size-fits-all system prompts). This doesn't leverage the full potential of specific models' strengths (like Claude's native tool use API).
- Git Commits as Rich Context: Giving the agent context from Git commits (which include commit messages, multi-file changes, test updates, and inherent relationships between files) is highly valuable.
4. Practical Challenges & Future Potential:
- Cost and Latency: Current advanced models are relatively slow and can be expensive. Quinn mentioned their prototype usage cost ~$1000 in less than a month, though they'd happily pay that for their devs if the value is there.
- Future Scaling with Faster/Cheaper Models: If/when models become significantly faster (e.g., Cerebras offering 2000 tokens/sec) and cheaper, it could enable running thousands of agent "attempts" in parallel, using a fitness function to select the best outcome. This would dramatically improve robustness and quality.
- Untapped Potential: They believe current model capabilities are far from fully tapped (Thorsten estimates <5% utilization) because of the way context is currently provided and interactions are structured.
- The "Criminal Situation": Quinn humorously states it's a "criminal situation" that every line of code isn't yet written with the assistance of an AI agent equipped with advanced tools like a perfectly instrumented time-travel debugger and 100ms full CI runs, because this is "just software engineering" away, not fundamental model research.
In essence, they are incredibly excited by the rapid progress and the emergent intelligent behavior they're seeing by treating the LLM less like a simple text generator and more like an autonomous agent that can be equipped with tools and learn from rich, iterative feedback. </EPISODE 1>
<EPISODE 2> Overall Theme: Exploring the Capabilities, Limitations, and Future of AI Coding Agents
The video is the second episode of "Raising an Agent," where Thorsten Ball (user/developer of the agent) and Quinn Slack (CEO of the company building the agent) discuss their experiences and insights from working with an AI coding agent prototype.
Key Learnings & Discussions:
-
User Experience & "Addictiveness" of the Agent:
- Thorsten finds the AI agent highly "addicting." He attributes this to its ability to overcome the initial inertia or "blank page problem" in coding.
- He contrasts his usual coding process, which requires getting into a specific "state of mind" (likened to a "breakdancer's shuffle before a dance"), with the agent's ease of use. With the agent, he can simply write a "wishlist" in a text box, send it off, and the agent starts working.
- He prefers the agent generating some code, even if imperfect, over starting with a blank editor, as it provides a scaffold to iterate upon.
-
Strengths and Effective Use Cases:
- Local, Well-Defined Tasks: The agent excels when tasks are well-defined and local to a specific component or file, especially for UI changes where a screenshot can be provided.
- Iterative Refinement & Learning: The agent can learn and improve. Thorsten described a scenario where he used an "eject handle" feature to dump the agent's "knowledge" (conversation history and reasoning) from a failed session into a markdown file (
task.md). He then started a new agent, fed it this markdown file and the current code diff, and asked it to suggest improvements on the previous agent's work. This new agent was able to identify better approaches. - No Token Limit "Magic": Both speakers agree that a key reason for the prototype's current effectiveness is the lack of aggressive optimization for token limits. This allows the agent to use more context, perform more internal reasoning steps, and self-correct, leading to better results. This is contrasted with other AI tools that might be heavily optimized for token efficiency (cost), thereby limiting their capability.
-
Limitations and Failure Modes:
- Architectural Sense: The agent struggles with tasks requiring a broader "architectural sense" of the entire system or complex interactions between different parts of a codebase.
- Specific API Knowledge: It can fail when needing to use specific, complex APIs (like VS Code extension APIs) without explicit guidance. Thorsten gave an example where the agent tried to use DOM manipulation (
document.querySelector) to interact with the VS Code editor, which is incorrect for that environment. - "Going Off the Rails": Without proper guidance or understanding of the high-level architecture, the agent can produce incorrect or nonsensical solutions.
- Diagnosing Failures: When the agent fails, especially for users with less experience in the specific domain (e.g., VS Code extension development), it can be hard to understand why it failed and how to guide it correctly. This touches on the "uncanny valley" of AI performance.
- Test Quality: The agent can write tests that are superficial or don't properly verify the intended functionality (e.g., mocking the very thing it's supposed to test, or writing tests that pass even if the underlying code is commented out).
-
Cost, Pricing Models, and Business Implications:
- Flat-Rate vs. Usage-Based: Quinn highlighted the difficulty of a "flat-rate pricing business" needing to serve both hobbyists (e.g., building "emoji games") and highly productive professional software developers. The current "magic" (due to high token usage) would be hard to maintain with aggressive cost optimization needed for a cheap flat rate.
- Value Proposition: Even if an AI-generated PR costs $5-$15 in tokens, it's significantly cheaper than a software developer's hourly rate, making it valuable for enterprises.
- Future of Pricing: They speculate that usage-based pricing or other models reflecting actual computation will become more common for powerful AI tools, moving away from simple $10-$20/month subscriptions, especially if the "no token limit" approach proves superior for complex tasks.
-
Future Directions and Potential Enhancements:
- Higher-Level Tools: The idea of providing the agent with more sophisticated, encapsulated tools (e.g., an
edit_filecommand that intrinsically handles diagnostics, formatting, testing within that file) rather than just low-level file operations. - Multiple Agents / Supervisor Agent: Exploring architectures with multiple specialized agents (e.g., one for frontend, one for backend) or a "supervisor agent" that can orchestrate tasks. This introduces complexities like inter-agent communication and managing concurrent file modifications (perhaps using Git worktrees).
- Improving Agent "Autonomy" and "Agent-ness": The ability for the agent to run in loops, self-correct, and utilize tools without tight restrictions on turns or tokens is crucial for its "agent-like" behavior and effectiveness.
- Democratizing Complex Tasks: Thorsten gave an example of how he could give the agent an email bug report about his website, and the agent fixed it. This level of abstraction is powerful and could, for instance, allow non-technical users to make website changes via email or WhatsApp messages with annotated screenshots.
- Higher-Level Tools: The idea of providing the agent with more sophisticated, encapsulated tools (e.g., an
-
Developer Adoption and Mindset:
- Skepticism: Many developers are still skeptical or haven't fully grasped how to use these tools effectively.
- "Aha!" Moments: Personal experience and seeing the agent successfully solve a real problem in one's own environment is crucial for adoption.
- Trajectory Over Snapshot: It's important to see these tools as being on a rapid improvement trajectory rather than judging them solely on their current snapshot of capabilities.
- The "Steel Man" Approach: Users need to actively try to find the best ways to make the agent succeed ("steel man" its capabilities) to understand its true potential.
The conversation highlights a dynamic period of rapid prototyping and learning, where the focus is on exploring the upper bounds of AI agent capabilities before prematurely optimizing for cost, and figuring out the right human-agent interaction patterns. </EPISODE 2>
<EPISODE 3> Overall Theme: The Evolving Nature of Coding with AI Agents & The Importance of Rich Context and Feedback
In this episode, Thorsten Ball and Quinn Slack contrast the “ghost of a senior engineer” ideal with the real-world strengths and limits of today’s AI coding agents, highlighting how a paradigm shift—from hands-on typing to guiding sub-agents with clean context windows and rich feedback loops—reshapes developer workflows. They explore the critical role of curated tools, iterative diagnostics, and evolving codebases in empowering agents to act more autonomously and effectively.
-
The "Ghost of a Senior Engineer" Ideal vs. Reality:
- Ideal: Thorsten posits the ultimate desire for a system where one can press a button, and a "ghost of a senior engineer" appears to answer any coding question perfectly. CEOs/CIOs would pay a lot for this.
- Reality: Quinn interjects that even senior engineers aren't perfect, and neither are current AI agents. However, the value is still immense even if the answers aren't always perfect.
- Current Agents: The AI agents they are building are a step towards this, but still have limitations (e.g., can be slow, don't always know if they're right or wrong).
-
Developer Experience & Paradigm Shift in Coding:
- "Vibe Coders" vs. "Traditionalists": Thorsten notes a "culture war" where traditionalists are skeptical of AI-generated code because they don't fully understand its internals and fear unmaintainability or subtle bugs.
- Quinn's Personal Shift: Quinn, a lifelong coder, finds the AI agent is now writing ~85% of his code (acting as the "first drafter"), a massive change from perhaps 40% previously. He's still processing this "forever change" to his hobby/profession, finding it thrilling.
- Thorsten's "Laziness" and Mental Model Shift: Thorsten admits he's become "lazy" and now asks the agent to make even small, five-line changes. His mental approach has shifted from direct text manipulation to guiding/instructing the agent.
- The "Paint by Numbers" Metaphor: Thorsten describes his role as "drawing the lines" (defining the scope, architecture, and constraints) and letting the agent "fill in the colors" (the actual code). He needs to be confident in the lines he draws. As long as the agent stays within those lines, the specifics of the generated code are less critical if the outcome is correct.
-
Understanding and Trusting AI-Generated Code:
- Taking the First Draft: Quinn emphasizes that developers can (and should) take the AI's output as a first draft and then invest time to understand it, especially for sensitive code (like authentication logic he worked on).
- The Cost of Human "Busy Work": Both agree that a lot of human coding time is spent on "toil" or "busy work" (formatting, imports, boilerplate, looking up syntax, minor refactoring). The agent significantly reduces this.
- Quinn's point: Changing even a single line of code involves a lot of process overhead (tickets, branches, stashing, builds, CI, PRs, reviews). An agent doing this asynchronously is highly appealing.
- Thorsten's Example: The agent refactored multiple markdown blocks in a blog post into a reusable component, handling colorization and formatting into an array of strings—a tedious task he could have done but was happy to offload.
-
The Power of Context and Feedback Loops:
- Context Window is Sacred: Thorsten stresses that whatever is in the agent's context window heavily biases its output. Irrelevant or misleading information (like old, ignored migration files) can derail it.
- Sub-Agents for Focused Context: The "search agent" (an agent called by the main agent) was created to have its own context window. This prevents the main agent's context from being "dirtied" by the search agent's potentially noisy intermediate steps. The search agent does what a human would (keyword search, find plausible files, open them, glob for related files, list directories) and returns a concise report to the main agent.
- Rich Feedback vs. Prompt Engineering: Instead of perfecting prompts, providing rich, iterative feedback (compiler errors, test results, diagnostics) is more effective for guiding the agent.
- "Knowing Better than the Model": Quinn highlights the ongoing challenge of figuring out when the human (or a deterministic tool) knows better than the LLM, and when to let the LLM take the lead. The sub-agent for search is an example where a more structured approach (human-defined tools for the sub-agent) is currently better.
- Improving Feedback Loops: The goal is to make feedback loops (tests, static analysis, browser rendering, deployment logs) more reliable and granular so the agent can use them effectively. This includes making the agent better at constructing and running the correct, minimal set of tests.
-
Future of Codebases and Developer Tooling:
- Codebases Will Adapt to Agents: Quinn hypothesizes that codebases and build systems will change to better accommodate AI agents. The incentive to create an agent-friendly environment is high because agents can potentially provide massive productivity gains.
- Curated Tools are Key: Thorsten believes a curated set of well-tuned tools is more effective than thousands of generic tools for an agent.
- Defining "Code Search": The meaning of "code search" is evolving. It's not just TF-IDF or keyword search. The agent they built performs a multi-step research process, much like a human developer trying to understand a concept.
- Transparency vs. Abstraction: There's a tension. While some users want to "just get the code change" without seeing the agent's process, the Sourcegraph team (as developers of the agent) finds it crucial to see the agent's "work" (the tools it used, the reasoning) to understand its behavior, debug it, and build trust.
-
Social and Adoption Aspects:
- Serendipitous Discovery via Shared Threads: Quinn compares sharing agent conversation threads to the way WIP (Work-In-Progress) Pull Requests in Git fostered collaboration and serendipitous discovery. Seeing how others successfully prompt and use the agent (social proof) is vital for wider adoption and learning.
- Overcoming Skepticism: Many developers are still learning how to effectively use these tools or are skeptical. The "aha!" moment often comes from seeing it solve a real problem in their own context.
In essence, they are building a system where the AI agent isn't just a code generator but a more autonomous entity that uses tools, reasons about problems, and benefits from rich, iterative feedback. This requires rethinking developer workflows, the tools provided to agents, and even the structure of codebases to maximize the agent's effectiveness. </EPISODE 3>
<EPISODE 4> Overall Theme: The Shifting Value of Code and the Evolving Role of Developers in an AI-Assisted World
In this episode, Thorsten Ball and Quinn Slack delve into how AI coding agents are fundamentally altering the perception and value of code itself, moving from a focus on meticulously crafted lines to a higher-level abstraction of intent and outcomes. They discuss the implications for developer tooling, open-source practices, and the very nature of a developer's skill set, suggesting a future where "cheaper" code generation doesn't diminish the need for expertise but rather redefines it towards guiding sophisticated AI systems and managing complex feedback loops.
1. AI-Generated Code vs. Traditional Libraries & Manual Effort:
- The "Colorize" Function Example (Thorsten): AI can generate simple utility functions, like a "colorize" function for CLI output, in less than a second.
- The "Sonar" Component Example (Quinn): For a UI component in Amp (a notification for "link copied to clipboard"), existing libraries were overly complex, incompatible, or required different styling. The AI agent wrote a very simple, exact SvelteKit component that was "way simpler and better" than available libraries. This illustrates that for small, well-defined tasks, AI can produce more tailored and efficient solutions than hunting for or adapting a library.
- Challenging Library Reliance: This capability raises the question: why pull in a library or spend time finding the perfect one if an AI can generate the needed functionality directly and quickly, tailored to the specific context?
2. The Changing Nature and Value of Code:
- Code is Getting "Cheaper" (Thorsten): The effort to produce functional code is decreasing. This means the way developers treat and value individual lines of code will change. It's less about the "preciousness" of hand-crafted code and more about the outcome.
- Spectrum of Code Quality (Thorsten): Code exists on a spectrum from beautifully handwritten and formatted to large, autogenerated, less-readable files (e.g., a 5000-line C file). AI will push more code towards the "generated" end of the spectrum, but it's generated by an agent (and modifiable by an agent) rather than being purely static/deterministic.
- The Value of Code on Average Will Change: The focus shifts. If code is easily generated, its intrinsic value as a manually created artifact decreases, while the value of the intent and the system it builds increases.
- Redefining "Bad Code" (Quinn): Traditionally, "bad code" is concerning because it implies a human misunderstanding that could be repeated, wasting significant human time and effort. If an AI generates "bad" or suboptimal code from a quick prompt for a non-critical, well-encapsulated part, and the feature still works, the cost/benefit might be acceptable. The "badness" is more random and less indicative of a persistent human flaw. The concern shifts to the agent making mistakes, but these are different in nature.
- Abstraction Level Shift (Thorsten): Worrying about minute details like camel case vs. kebab-case or exact function names becomes less critical because developers operate at a higher level of abstraction, guiding the agent. It's not "vibe coding" (ignoring quality) but a different way of interacting with code creation.
3. Evolving Developer Experience and Workflow:
- From Typing to Guiding (Thorsten & Beyang): Coding is becoming less about direct text manipulation and more about instructing or guiding an AI agent. Beyang's insight: "This is not just pair programming, this is us watching another thing write the code."
- Thorsten's "Laziness": He now asks the agent to perform even small refactors or code movements.
- Cursor Tab Example (Thorsten): Using an AI feature that suggests completing lines or navigating (like Cody's "cursor tab" for auto-edits) makes meticulous typing and Vim macro skills less central, as the AI handles much of the mechanical toil.
- Feedback Loops as First-Class Concepts (Quinn): The agent should be aware of and utilize feedback loops (tests, browser rendering, diagnostics). The rules file for an agent should specify how to run tests for a given part of the codebase.
- The "Paint by Numbers" Analogy (Thorsten): The developer's role shifts to "drawing the lines" (defining scope, architecture, constraints) and letting the agent "fill in the colors" (generate the code).
- Embracing Iteration and Imperfection: The agent doesn't need to be perfect on the first try. If it's 70% correct and can be retried quickly (especially with faster models like Cerebras potentially offering 1000-2000 tokens/sec), it's still more efficient than manual coding, especially if loops can validate against language servers.
4. Impact on Tooling and Code Ecosystem:
- Future of Code Hosts (Quinn & Thorsten): If AI generates the majority of code, the role of code hosts like GitHub might change.
- Commits might matter less; the transcript of why the code was made (the agent's conversation/thread) could become more important.
- Code hosts might become places to store not just static code but also the agents and prompts that generate/maintain it. They might become more like "calling your codebase" and having an AI edit it.
- Open Source Value Proposition (Thorsten & Quinn):
- Historically, open source made code "free" (reducing development cost) but introduced search, quality, and comprehensibility costs.
- AI-generated code is even cheaper, has a lower search cost, and potentially greater variety. This could be the "next open source."
- This raises questions about the value of things like the GitHub contribution chart if contributions are increasingly AI-assisted or generated.
- Developer Tooling is Abstract Text Transformation (Thorsten): A lot of dev tooling is fundamentally about generating or transforming code/text. LLMs are powerful text transformation engines.
- Static Site Generators (SSG) Example (Thorsten): SSGs are built to turn a specific input (e.g., Markdown) into a specific output (HTML). With an LLM, the input could be arbitrary (handwritten notes, photos, drawings, emojis, videos) and still produce the desired HTML. This changes the constraints and possibilities.
- Demystifying and Guiding AI (Quinn): It's crucial to build tools that allow developers to understand when the AI fails and how to "nudge" it. Raw demos showing these failures and corrections are valuable. This involves building in guardrails and feedback mechanisms.
- Example: If an agent is asked to make a SvelteKit page shareable without authentication and it mistakenly clobbers key framework files, the system should ideally detect this class of error or allow the user to easily guide it away from such mistakes.
5. The Evolving Skill Set of a Software Engineer:
- Shift to Product Focus (Thorsten): The emphasis will likely be more on understanding the product, the problem to be solved, and guiding the AI at a higher level, rather than on the minutiae of code syntax.
- Expertise in Guiding AI (Quinn): While the specific skills may change, there will always be a skill set that allows a developer to make an LLM's output significantly more valuable. This new composite skill set will likely still be called "software engineering."
- Efficiency Over Craftsmanship (for some tasks) (Thorsten): While the joy of crafting code exists, the need for efficiency and not wasting time often trumps the desire to manually write every line, especially for repetitive or tedious tasks.
In summary, the conversation paints a picture of a rapidly approaching future where AI agents are not just tools but collaborators, fundamentally changing how code is created, valued, and managed, and shifting the developer's role from a primary typist to a high-level guide and system architect. </EPISODE 4>
<EPISODE 5> Overall Theme: Evolution of AI Coding Agents and Their Impact on Development
In this episode of "Raising an Agent," the focus centers on how models for AI agents are evolving and how these developments are changing the approach to AI coding agents in the real world. The three participants are Beyang, the moderator, plus Quinn and Thorsten, who are building an AI-powered coding assistant.
Key Learnings and Discussions:
-
The Moat Debate:
- The speakers discuss the notion that there is no "moat" or distinct advantage in AI coding agents because their core architecture is fundamentally the same.
- Thorsten argues that while the basic structure (tools, model, integration) is transparent, the difference lies in the details: tuning, curated tool selection, and handling of failure modes.
- Quinn observes that many perceive the agent's capabilities as stemming from a "secret sauce" in the prompt design, when in reality, much of the effectiveness comes from the underlying model and its training data.
- Thorsten highlights that the tools and the prompts influence how the models will evolve.
-
User Experience and Target Audience:
- Beyang points out that while the speakers have "published their secrets," not everyone can successfully replicate their agent's performance.
- Quinn explains that this is due to the significant influence of the model and its training data. Developers trying to replicate the agent may lack the same training data or use models trained on generic tools and not have the same tight tool integration.
- Thorsten describes the target audience for their agent as experienced developers who want an AI "power tool." The focus is on enhancing the workflow of someone who already understands how to build and reason about complex systems, rather than trying to build a "one-line prompt" tool that does everything perfectly.
-
Model Selection, Customization, and Scalability:
- Beyang raises the question of whether offering users the ability to choose models would improve the product.
- Thorsten believes the better approach is a well-curated set of tools tuned for a specific model, rather than diluting model-specific advantages and adding cognitive overhead with a model selector.
- They're experimenting with Gemini as a replacement for Claude 3.7, but believe its strengths lie in different areas (e.g., better handling of instructions, but less aggressive about modifying files).
- The discussion touches upon cost and scalability. For large organizations with extensive teams, the cost of AI assistant usage could become significant even with a 10-20% cost premium over a human developer.
- They suggest that the development of faster and cheaper models will eventually enable running thousands of agents in parallel and selecting the best output, thereby improving quality and efficiency.
-
Emerging Workflows and Feedback:
- Beyang observes that users are developing more sophisticated workflows with agents, such as using multiple agents, orchestrating agents, and leveraging multi-file operations (like refactoring).
- Quinn and Thorsten discuss how they themselves now use multiple agents, sometimes delegating subtasks to specialized agents (e.g., a search agent that reports back to the main agent), and how this is pushing them to rethink the user interface and how developers interact with the agent.
- One user's "aha moment" involved building an MCP server that could interact with Emacs using textual commands, demonstrating the flexibility and potential for integrating AI with other tools.
- Another user mentioned breaking down a large, monolithic file into smaller parts to improve performance and debuggability. This aligns with their initial decision to give the agent granular access to individual components and files.
- The need for rich, explicit feedback is emphasized. Users need to understand that they can't rely on the LLM to infer their intentions from limited prompts; instead, they should provide clear instructions and feedback (including error messages, test output, and diagnostics) to guide the agent's behavior.
In summary, this episode emphasizes that while the core architecture of AI coding agents may be transparent, the real value lies in the details of model selection, tool optimization, and crafting the right workflows and feedback loops. The speakers also highlight the importance of developer education and managing expectations during this period of rapid model evolution and changing developer habits. </EPISODE 5>
<EPISODE 6> Overall Theme: The conversation focuses on the evolving philosophies and practical application of AI coding agents, particularly highlighting the impressive capabilities of the latest Anthropic model ("Sonnet 4" or "Claude 4"). Key topics include agent design, Reinforcement Learning (RL) implications, effective human-agent workflows, the role of background agents, and a comparison of different agent environment strategies (like full build environments vs. CI feedback, and the analogy to Cloud IDE adoption).
Key Learnings & Discussions in Detail:
-
Diverging Philosophies in AI Agent Design & Model Training:
- Anthropic's "Practical Iterative Agent" ("Claude Sonnet 4"): Thorsten identifies Anthropic's approach as fostering an agent that excels at "figuring stuff out" through iteration and environmental feedback. This is a practical coding partner that can tackle problems by trying, failing, and learning, rather than just a one-shot app generator.
- Benefits of Iteration & Environmental Feedback: Quinn strongly endorses this, deeming it "strictly better." It leads to superior results and is potentially more sustainable business-wise, even with higher inference costs for longer interactions.
- Reinforcement Learning (RL) and Model "Grain":
- The specific capabilities and behavioral tendencies of an LLM (its "grain") are shaped by intentional choices during pre-training, fine-tuning, and RL.
- Anthropic's RL strategy appears to cultivate agents that are more dynamic, environment-reactive, and adept at iterative tool use to solve problems. This differs from models potentially optimized more for zero-shot, single-turn generation.
- The hosts suggest that major AI labs (Anthropic, Google, OpenAI) have distinct philosophies embedded in their models, influencing how their agents perform and what interaction styles are most effective.
-
Thorsten's Evolving Coding Workflow with the Agent (Amp):
- He has moved from letting the agent "rip" on entire features to a more collaborative model:
- Agent implements a rough version of Thorsten's architectural idea.
- Thorsten manually refines and "moves the guardrails," a nuanced process hard to capture in a single prompt.
- The agent is then used for more focused tasks like UI components (Storybook) or type fixing.
- He has moved from letting the agent "rip" on entire features to a more collaborative model:
-
Quinn's Focus & Amp Development Insights:
- Waitlist Feedback & Core Development: Prioritizing bug fixes and performance alongside new features based on user feedback.
- Future: Background Agents:
- Concept: Enabling users to delegate longer-running, complex tasks (e.g., 10-15+ minutes) to an agent that works asynchronously, without requiring constant supervision. This allows users to, for example, kick off a task from their phone (like from their kid's soccer game) and get results later.
- Feedback Mechanism for Background Agents: The primary feedback loop for these agents would be Continuous Integration (CI). The agent pushes code, CI runs (tests, linters, builds), and the agent uses this pass/fail/diagnostic output to iterate and improve.
- Practicality: This is seen as more practical than trying to perfectly replicate a user's full, complex local development environment in a cloud sandbox for every background task. Leveraging existing CI is more scalable and often already in place.
- Latency Tolerance: The asynchronous nature of background agents makes the latency of CI runs more acceptable.
-
Impressions and Capabilities of Claude Sonnet 4:
- Enhanced End-to-End Feature Generation & Complexity Handling: Significantly better at complex, multi-file tasks. Thorsten can now assign it more ambitious tasks.
- Superior Tool Use (especially Sub-Agents/Task Tool): The model is more "eager" and effective at using tools, particularly the "Task tool" which spawns sub-agents for parallelizable or decomposable work.
- YAML Frontmatter Editing Example: It intelligently used
globand then spawned four sub-agents to edit ~36 blog posts, distributing the workload. Each sub-agent has its own fresh context window. Benefit of Sub-Agents: This approach is powerful because each sub-agent operates within its own context window. The main agent doesn't get overwhelmed by the context of processing all 36 files; it only needs to manage the sub-tasks.
- YAML Frontmatter Editing Example: It intelligently used
- Improved Iteration & Problem Solving: Shows better ability to recover from errors and adapt based on environmental feedback.
- Summaries & Emojis: Produces summaries with clickable links for citation (a feature by Camden), which is useful despite the model's occasional use of emojis.
-
Agent Environments: Full Build vs. CI vs. Cloud IDEs:
- Intentional Model Training: Quinn emphasizes that a model's ability to use tools or exhibit certain behaviors is an intentional choice by the model creators, baked in during training and post-training. Understanding this "grain of the model" is crucial.
- Full Build Environments (Devin, OpenAI Codex, Google Jules): These agents often run in a dedicated VM or container with a full build environment, allowing immediate command execution and feedback.
- CI as Feedback Loop (Anthropic/Amp Background Agents): This approach is lighter weight. The agent uses the existing CI system.
- Cloud IDEs – An Analogy for Complexity & Adoption:
- Quinn draws a parallel to Cloud IDEs. In theory, Cloud IDEs offer a perfect, consistent environment accessible anywhere.
- However, in practice, their adoption has been limited. Only tech giants like Meta and Google have successfully implemented them at scale internally, requiring massive investment.
- For most, Cloud IDEs often fall short due to a "long tail" of issues: missing extensions, flaky language servers, incompatibility with local tools. They're "never quite good enough."
- The difficulty lies in maintaining perfect parity across local, CI, and Cloud IDE environments; the Cloud IDE often becomes the neglected third wheel.
- Implication for Agents: If replicating dev environments perfectly is hard even for Cloud IDEs, it's an even bigger challenge for AI agents to do so for every interaction. Therefore, using more abstract or existing feedback mechanisms like CI can be more pragmatic.
- Context Window Management: The sub-agent strategy is a key way to manage context window limitations for complex, multi-file tasks.
-
User Feedback and Waitlist:
- The Amp waitlist feedback has been positive, confirming their direction towards simplicity, quality, and an unobtrusive agent.
- Users appreciate Amp's simplicity, its focus on being the "best agent" for the job (rather than offering model selectors), its emphasis on quality, and its ability to "get out of the way."
-
Practical Tips for Working with AI Agents:
- Use Playwright MCP server for browser interactions.
- Utilize Storybooks (even simple static HTML pages) for UI iteration.
- Implement an auth bypass for local dev environments for easier agent navigation.
- Have agents generate seed data using tools like
psql.
Playwright MCP Server: Use a Playwright Multi-Client Proxy (MCP) server to allow the agent to take screenshots of the browser and iterate on UI changes. Storybooks for UI Iteration: A simple /storybook page (like the one at https://ampcode.com/storybook which is just static page) is effective for UI development. Authentication Bypass for Local Dev: Use an environment variable (e.g., USERNAME=auth_bypass) to allow the agent to navigate a local development application without needing to handle complex auth flows. Seed Data Generation: Ask the agent (with psql access) to generate seed data for your development database if needed for testing.
In essence, the episode highlights the rapid advancements in AI coding agents, particularly with models like Claude 4 Sonnet. The key to unlocking their power lies in understanding their inherent capabilities ("grain"), providing rich environmental feedback, and designing agentic systems (like sub-agents) that can manage complexity and context effectively. The Sourcegraph team is focused on building a highly capable, iterative agent, and they value user feedback immensely in this process. </EPISODE 6>