🚀 Enhancing Gemini Integration in Roo Code (GSoC 2025 — Google DeepMind)

Author: Ton Hoang Nguyen (Bill)
Mentor: Paige Bailey (@dynamicwebpaige)
Org/Project: Google DeepMind × Roo Code (Matt Rubens and Hannes Rudolph)
Period: May–Aug 2025
Focus: Better Gemini integration for developer tooling (VS Code extensions/CLI), with UX improvements, long-context handling, and explicit context caching.

Contact Details:

Email: [email protected]
LinkedIn: https://www.linkedin.com/in/ton-hoang-n-11a2a0105/

0) Backstory
1) What I Actually Accomplished
2) How I Solved The Hard Technical Problems
- a) The Streaming Citations Challenge
- b) The Token Counting Accuracy vs Speed Dilemma
3) What Made This Work
4) The Numbers That Matter
- 800,000+
- 75%
- 3
- 60+
5) The Explicit Caching Deep Dive
- Current State: Token Cost Hell
- Solution: Cache Once, Reuse Forever
6) Key Learnings (What I'd Tell Future GSoC Students)
7) Milestones & PRs
8) Weekly Log
9) Acknowledgments
10) Resources & Links
11) Closing

0) Backstory

I submitted my GSoC proposal thinking I understood context engineering pretty well.

I'd read the papers. Could explain how attention mechanisms work. Had all these ideas about token optimization and caching strategies.

Honestly? I was probably a little overconfident.

When Google DeepMind accepted the proposal, Paige Bailey asked me which VSCode coding agent I wanted to work on. There were several options - some with bigger names, more obvious choices.

I picked Roo Code mostly because they seemed... different. They run these weekly podcasts where they actually talk to their community. The whole thing felt less corporate, more human. Plus I liked that users could customize everything - swap out models, change prompting styles, create custom modes.

It felt like the kind of place where you could actually build something meaningful instead of just checking boxes.

Then I had my first chat with Hannes Rudolph. The guy has this infectious enthusiasm - within five minutes he had me excited about features I hadn't even thought of yet. But also slightly terrified, because I realized this wasn't some side project. Roo Code has hundreds of thousands of users. Whatever I built would immediately be in real developers' hands.

So I did what I always do when I'm nervous about something - I dove way too deep into research. Spent weeks reading through GitHub issues, Discord conversations, listening to their podcasts. Trying to understand not just how the code worked, but what users actually needed.

That's when the theoretical knowledge started bumping up against reality.

Users weren't asking for elegant context optimization algorithms. They were frustrated that their AI responses contained outdated information. They were getting confused by model naming conventions that changed every few weeks. They were spending way too much money on token costs for basic coding tasks.

The gap between what I thought mattered and what actually mattered was... pretty humbling.

But that's also when the real work started.

1) What I Actually Accomplished

My previous GSoC (2024) with Joplin focused on summarization UX, compression strategies, and running ML locally. That experience translated well to this project’s emphasis on developer experience, prompt/context management, and cost/latency trade-offs.

This summer I helped 800,000+ developers (VSCode downloads) to have better AI agent responses with Gemini by adding Gemini tools (Grounding with Google Search and URL context), progressively migrating to new Gemini naming conventions and designed a caching system that cuts repeat token costs by 75% - turning $100 agent sessions into $25 ones.

a) Real-time Grounding with Google Search and URL Context

Impact: Roo Code users can now get AI assistance backed by live Google Search results and URL Context - directly in their editor.
The Problem: Developers were getting outdated or hallucinated information from AI coding assistants. No way to verify claims or get current information.
What I Built: Integrated Google Search grounding into Gemini's responses with reliable source citations. Solved the hard technical problem of maintaining citation accuracy during token streaming.
Result: Available feature to 800k+ developers. Some user on Reddit likes the new grounding feature and the Head of Dev Engineering, Google Map Platforms was glad that it was finally implemented.

Shipped in PR #5959

LinkedIn post by Roo Code Real Time Research with Gemini

b) 75% Token Cost Reduction via Explicit Caching

Impact: Designed infrastructure to reduce repeat context costs from ~$100 to ~$25 for typical agent sessions.
The Problem: AI agents re-upload the same code files dozens of times across conversation steps. A single debugging session could cost $50-100 in input tokens alone.
What I Designed:
- File/folder-level caching that uploads context once, reuses everywhere
- Smart invalidation (file hash changes, TTL expiry, manual refresh)
- Integration with existing @file and @folder mentions users already know
- Batch upload support to amortize API costs
- The Math: First upload = full cost. Every subsequent use = 75% discount. For a 20-step coding session, that's massive savings.

Status: Comprehensive design complete

c) Gemini Model Confusion and Configuration Hell

Impact: Eliminated configuration errors and API failures from outdated model names.
The Problem: Developers were stuck with confusing preview model names like gemini-2.5-pro-preview-0215 that Google had deprecated. Led to failed API calls and support tickets.
What I Built: Progressive migration system that automatically maps legacy model IDs to current ones, both frontend and backend. Clean, predictable model selection.
Result: Zero breaking changes for existing users. Clean slate for new users.

Approved in PR #5990

d) The Streaming Citations Race Condition

Impact: Eliminated citation state leakage and incorrect source attribution that was breaking user trust in grounded responses.
The Problem Was Deeper Than I Initially Thought:

My first solution of "buffer citations until the end" worked... until it didn't. Users started reporting citations from previous conversations bleeding into new ones. Classic race condition hell.
Root Cause Discovery:
- Gemini's streaming API sends grounding metadata in chunks mixed with content
- My initial approach collected sources within the stream processing logic
- State was persisting between messages, causing citation leakage
- Racing conditions meant sources could appear as separate messages
The Real Solution - Architectural Redesign:
- Instead of trying to patch the race condition, I rethought the entire approach
- Decoupled grounding metadata from content stream chunks
- Moved source collection from gemini.ts to Task.ts (higher level)
- Created new ApiStreamGroundingChunk type for clean separation in stream.ts
- Prevented state persistence by handling sources independently
Technical Details:

// Old: Sources mixed with content in stream processing
yield { type: "text", text: content + sources }

// New: Separate grounding chunks
yield { type: "grounding", sources: extractedSources }
yield { type: "text", text: cleanContent }

Result: Clean architecture that prevents state leakage. Citations appear consistently at the end of responses without bleeding between conversations.
Why This Matters: This wasn't just a bug fix - it was a fundamental improvement to how Roo Code handles multi-modal API responses. The pattern can be applied to other streaming APIs with metadata.

Fixed and Merged in PR #7434

2) How I Solved The Hard Technical Problems

a) The Streaming Citations Challenge

Problem: When AI responses stream token-by-token, citations can appear at random positions, breaking user trust.

My Solution at first: Buffer citations until the end of the response, then append as a clean Sources: section. Preserves streaming UX while maintaining traceability.

Why This Matters: Grounded answers are only useful if they're trustworthy. Perfect inline placement during streaming was technically impossible. End-of-message sources was the pragmatic solution that preserved both UX and accuracy.

b) The Token Counting Accuracy vs Speed Dilemma

Problem: Local token counting (tiktoken) is fast but inaccurate. API counting is accurate but slow and expensive.

My Approach: Designed hybrid strategy using initial API samples to calibrate a safety factor, then switching to local estimates with API checkpoints near thresholds.

Outcome: After extensive review, team chose to stick with API-based counting for simplicity and determinism. My analysis informed the decision and clarified maintenance priorities.

Explored in PR #5418

3) What Made This Work

Real Community Impact

Discord users immediately started sharing workflows with the grounded search feature
GitHub issues closed because my fixes resolved ongoing pain points
Maintainers highlighted the integration publicly as a success story
Zero regression reports after major model naming migration

Technical Depth That Matters

Solved streaming race conditions that could break citation placement
Designed caching invalidation that balances cost savings with data freshness
Built progressive migration that handles legacy configurations gracefully
Created UX flows that work with existing user mental models (@file, @folder)

Great Collaboration with Roo Code team

60+ comment review cycles on complex PRs - stayed constructive throughout
Aligned technical decisions with business priorities (cost vs accuracy)
Broke large changes into reviewable chunks
Documented edge cases and trade-offs for future maintainers

4) The Numbers That Matter

800,000+

developers using Roo Code (potential reach of my improvements)

75%

cost reduction for repeat context via designed caching system

3

major PRs merged or approved

60+

review comments addressed across complex integrations

5) The Explicit Caching Deep Dive

Impact: Designed caching infrastructure that reduces large codebase costs by up to 93.6% through pre-batching and context reuse.

This is the big one. The feature that changes the economics of AI coding assistants.

Current State: Token Cost Hell

Right now, every time an AI agent needs to reference your codebase:

Upload entire file contexts again
Pay full input token costs again
Wait for upload latency again

For a typical debugging session touching 10 files across 20 agent steps, you're paying to upload the same code 200 times.

That's $100+ just in context uploads.

Solution: Cache Once, Reuse Forever

User Flow

Reference @folder src/ in chat ->
System creates cached context automatically ->
All future references use cached version at 75% discount ->
Cache auto-invalidates when files change

The Math That Changes Everything

Using Gemini's context caching + smart batching:

Sequential uploads (current): 30 files × full token cost = $7.50 per session
Cached uploads (my design): First upload full cost, reuse at 75% discount = $3.81 per session
Pre-batched caching (optimal): Upload once, batch efficiently = $0.48 per session

That's 93.6% cost reduction when you do it right.

Real-World Impact:

Developer debugging 30-file codebase: $100 → $6.40
Team iterating on large project: $500/day → $32/day
Agency serving clients: $2000/month → $128/month

Technical Architecture

Named cache creation via Gemini SDK
cache invalidation with VSCode file watchers in Roo Code (change detection)
TTL expiry with user controls
Integration with existing Roo Code indexing (RAG/CAG)
Batch upload support for large projects with @file and @folder

Why This Is A Game Changer: This isn't just cost optimization. It's democratizing access to AI coding assistants.

Right now, only well-funded teams can afford to iterate quickly on large codebases. With 93.6% cost reduction, individual developers and small teams get the same superpowers.

The Distribution Strategy Insight

I researched existing implementations (OpenRouter, Requesty) and found gaps:

Requesty (London founder conversation): doesn't fully integrate Gemini features
OpenRouter: has caching but not file-level granularity
Opportunity: Roo Code can be first with native Gemini explicit caching

Implementation Path

Phase 1: File/folder caching with basic UI
Phase 2: Full project snapshots with batch optimization
Phase 3: CAG integration with vector search (Cache-Augmented Generation)

6) Key Learnings (What I'd Tell Future GSoC Students)

Lead With User Impact, Not Technical Complexity

My early proposals focused on parameter tuning and model configurations. The work that really mattered was solving actual user pain points: cost, reliability, and trust.

Simple Solutions Beat Clever Ones

My hybrid token counting strategy was technically sound but added complexity. The team's choice to stick with API-based counting was the right call - it optimized for maintainability over cleverness.

Community Validation Is Everything

The features users actually adopted and talked about weren't always the ones I thought were most impressive. Real-world usage patterns informed better design decisions.

Cost Is A Feature

Explicit caching isn't just about saving money - it changes what's possible. When context costs drop 75%, users experiment more, iterate faster, and trust the system with larger problems.

7) Milestones & PRs

Area	Link	Status
Gemini Tools integration (real-time research)	PR #5959	Merged
Progressive Gemini naming migration	PR #5990	Approved / pending merge
Grounding with Google Search Race Condition fix	PR #7434	Approved / pending merge
Gemini Tools & citations streaming UX	PR #4895	In long review
Hybrid token counting strategy	PR #5418	Closed (team chose API-based)
Gemini integration proposal (early)	Issue #4483	Discussion
Improve Gemini UX	Issue #4526	In progress
Sources placement bug	Issue #6372	Investigating
Additional Gemini settings	Issue #6616	Planned
Misc. context/indexing follow-ups	Issue #5444	In progress
Early PR (duplicate/conflict)	PR #4360	Superseded

8) Weekly Log

Week 1 (Jun 2–9): Onboarded Roo Code; opened #4360; proposal #4483; scouted long-context & caching approach.
Week 2 (Jun 16–20): Continued on Gemini controls; handled duplication conflict with #4360.
Week 3 (Jun 23–27): Pushed #4895; began #4526 branch (i/improve-gemini-ux-4526); drafted explicit caching plan.
Week 4 (Jun 30–Jul 4): Review cycles for #4895; expanded docs/explanations for large PR review.
Week 5 (Jul 7–11): Implemented #5418 (hybrid token counting); reached consensus on citations strategy (sources at end).
Week 6 (Jul 14–18): Finalized hybrid strategy; more #4895 iterations; opened #5444.
Week 7 (Jul 14–18, continued): Same sprint; prepared explicit caching proposal; discussed cost optimization.
Week 8 (Jul 21–25): Merged #5959 🎉; started #5990 (model naming migration).
Week 9 (Jul 28–Aug 1): Advanced #5990; investigated #6372; studied OpenRouter/Requesty caching approaches.
Week 10 & 11 (Aug 4–15): #5990 approved; deep-dive into #6372 race/state issues; finalized explicit caching proposal.
Week 12 (Aug 18–22): The scope of solving the race/state condition problem in #6372 was too large. Therefore, it was the best to lean towards to code design change.
Week 13 (Aug 25–31): Finalizing #6372 fixed by #7434 and working on the Final Report for GSoC 2025 at Google Deepmind x Roo Code.

9) Acknowledgments

My deepest thanks to incredible mentor Paige Bailey for putting up with my overengineered solutions, constant support and guidance not only for GSoC but beyond it too.

Hannes Rudolph - I loved your enthusiasm and excitement. Thank you for introducing me to Roo Code and for keeping me updated and helping me.

Matt Rubens and Daniel - Those 60+ review comments were brutal but necessary. However, I enjoyed every minute of it and the great discussions we had.

The Roo Code community who tested stuff and reported bugs.

And thanks to Google DeepMind for the opportunity to work on something that actually ships, instead of just staying in a research repo somewhere.

10) Resources & Links

Roo Code PRs: #5959, #5990, #4895, #5418, #7434
Roo Code Issues: #4483, #4526, #5444, #6372, #6616
Google Summer of Code Project Link: https://summerofcode.withgoogle.com/programs/2025/projects/XxTqA1r9
Context & notes: caching design, RAG/CAG plans, hierarchical summarization (chonkie-ts)

11) Closing

The grounded search feature is working. Users are actually using it, which still feels weird to say.

The race condition fix was probably the most satisfying to solve, even though it took way longer than I expected. Streaming APIs are tricky, and citation placement during real-time responses is one of those problems that seems simple until you actually try to implement it. The caching design... well, it's designed. Whether it gets built is up to the team now.

Honestly, the biggest thing I learned was how different real software development is from academic projects. Users don't care about elegant algorithms - they care about whether the thing works reliably and doesn't cost them a fortune.

That's probably obvious to most people. Took me a summer to figure it out.

Anyway, these features are out there now, helping developers get better AI responses. That's pretty cool.

HahaBill/GSoC-2025-Google-Deepmind-Final-Report.md