Skip to content

Instantly share code, notes, and snippets.

@ariel-frischer
Created January 25, 2026 20:58
Show Gist options
  • Select an option

  • Save ariel-frischer/891aef9ae4059862f7edd71307255153 to your computer and use it in GitHub Desktop.

Select an option

Save ariel-frischer/891aef9ae4059862f7edd71307255153 to your computer and use it in GitHub Desktop.

Coding Agents: Semantic RAG Search vs Traditional Retrieval Methods

A comprehensive analysis of code retrieval approaches for AI coding agents, comparing semantic/RAG-based search against traditional methods (grep, keyword search) and agentic tool use. Focus: empirical data and benchmarks over opinions.


Executive Summary

Approach Accuracy Speed Cost Scalability Best For
Semantic RAG High (+12-40%) Medium Medium High Large/complex codebases
Grep/Keyword Moderate Fast Low Medium Well-organized code, exact matches
Agentic Tool Use Highest Slow High Low Complex multi-file reasoning
Hybrid (RAG + Grep) Highest Medium Medium High Production systems

1. Benchmark Results: Hard Numbers

1.1 Cursor's Semantic Search (Production A/B Tests)

Source: Cursor Blog - Improving Agent with Semantic Search

Metric Result Notes
Average accuracy improvement +12.5% Range: 6.5% to 23.5% depending on model
Code retention increase +0.3% overall +2.6% on large codebases (1000+ files)
Dissatisfied follow-ups reduction -2.2% When semantic search available

Key finding: Semantic search benefits scale with codebase size. The 2.6% improvement on large codebases vs 0.3% overall shows diminishing returns for small projects.

1.2 CodeRAG-Bench Results (Academic Benchmark)

Source: CodeRAG-Bench Paper, Project Page

Improvement with retrieved context vs no retrieval:

Task Type Improvement Notes
MBPP (basic programming) +15.6-17.8% Even outperformed oracle setups
ODEX (open-domain) +20.3-40.1% Unfamiliar library tasks
DS-1000 (data science) +20.3-40.1% Complex API usage
SWE-Bench +27.4% (GPT-4o) Repository-level tasks
ODEX-hard subset +6.9% (GPT-4o) Hardest retrieval scenarios

Critical gap: 9-point gap remains between RAG performance and oracle (gold documents), indicating retrieval quality limits downstream generation.

1.3 SWE-Bench Evolution (2024-2025)

Source: SWE-bench, Scale AI Leaderboard

Period Best Performance Method
2024 (early) 1.96% BM25 retrieval + Claude 2
2024 (with oracle) 4.8% Gold files provided
2024 (SWE-agent) 12.47% Agentic approach (GPT-4 Turbo)
2024 (Devin) 13.86% Autonomous agent
2025 (Verified) ~75% Modern agents + verified subset
2025 (Full) 29% OpenHands (#1 open source)
2025 (Pro) 23.26% GPT-5

Key insight: Agent-based approaches (12.47%+) dramatically outperformed pure RAG baselines (1.96-3.8%) on SWE-bench, a 3-6x improvement.

1.4 CoIR Code Retrieval Benchmark

Source: CoIR Paper

Model Mean Score Variance Notes
Voyage-Code-002 56.26 High Best overall, weaker generalization
BGE-M3 Moderate Lowest Best robustness
E5-Mistral High - 1840ms latency, 2.3GB index
E5-Base Moderate - 7.4ms latency, 0.3GB index

Trade-off: 250x latency difference between fast (E5-Base) and accurate (E5-Mistral) models.


2. Tool Comparisons: Production Data

2.1 GitHub Copilot vs Sourcegraph Cody

Source: Augment Code Comparison

Metric Copilot Cody
Usable code rate 68% 82%
Context scope Single file Full repository
Failure mode Missing cross-file dependencies Slower response

Root cause of gap: Copilot couldn't locate imports and internal utilities not visible in immediate context.

2.2 DeepCS Semantic Search Study

Source: Kilo.ai Blog

Metric Semantic Search Keyword Search
Average result rank 3.5 6.0
Top-5 hit rate 76% Lower
MRR improvement (Samsung) +41% Baseline

2.3 Jolt AI Large Codebase Benchmark

Source: Jolt AI Blog

Tested: Cursor, Windsurf, GitHub Copilot, Claude Code, OpenAI Codex, Augment, Jolt Codebases: Django, Grafana, Kubernetes (6 closed PRs each)

Tool Type Speed Thoroughness Accuracy
Pure vector (Copilot) <1 min Low Low (wrong files)
Agentic (Codex, Claude Code) 3-5 min High High
Hybrid Medium High High

Trade-off: 3-5x slower for agentic but significantly more accurate file discovery.


3. Grep vs Semantic Search: When Each Wins

3.1 Augment Code's SWE-Bench Finding

Source: Jason Liu - Why Grep Beat Embeddings

"We explored adding various embedding-based retrieval tools, but found that for SWE-bench tasks this was not the bottleneck – grep and find were sufficient."

Why grep worked for SWE-bench:

  • Repositories smaller than real-world codebases
  • 90% of problems solvable by good engineer in <1 hour
  • Agent persistence compensated for simpler tools
  • Code has distinctive keywords (function names, error messages)

But: "In practice, we find that embedding-based tools are critical to deliver a great product experience."

3.2 Grep Query Expansion Hack

Source: Nuss and Bolts

Using GPT-3.5-mini to expand grep keywords improved performance nearly 10x by generating semantically-derived terms not present in original queries.

3.3 Token Efficiency Comparison

Source: Milvus Blog

Approach Token Usage Notes
Grep-only High (bloat) Returns all matching lines without semantic filtering
Semantic + AST chunking -40% tokens Same recall, preserves function boundaries

3.4 When to Use Each

Use Case Recommended Approach
Known function/variable name Grep
Exact error message Grep
"How does authentication work?" Semantic
Finding usage patterns Semantic
Cross-file dependencies Semantic/Agentic
Small, well-organized codebase Grep sufficient
Large enterprise codebase Semantic essential
Multimodal (diagrams, images) Semantic only

4. Cost and Efficiency Analysis

4.1 RAG vs Long Context

Source: LightOn Blog

Test scenario: 1,000 pages knowledge base (~600K tokens), 1,000 daily requests

Approach Cost Multiplier
Long context 8-82x more expensive
RAG Baseline

RAG also provides better latency due to processing only relevant chunks.

4.2 Context Window Degradation

Source: Chroma Research - Context Rot

  • LLM performance degrades as context length increases
  • Irrelevant information introduces noise weakening attention to relevant portions
  • GPT-4/5: Dramatic performance degradation around 400K characters (~133K tokens)
  • Response times jump 50x near context limits

Implication: Larger context windows don't guarantee better performance; retrieval remains essential.

4.3 Embedding Model Efficiency

Source: Voyage AI Blog

Voyage-3: 7.55% better than OpenAI v3 large while 2.2x cheaper

Optimization Benefit
Distillation 50-80% cost reduction, <3% recall drop
Quantization 40% latency reduction, <2% recall drop

5. Agentic Approaches

5.1 Agent vs RAG Performance

Approach Quality Speed Cost Reliability
Traditional RAG Decent Excellent Low Low (no correction)
Agent + Grep Excellent Poor High High
Agent + Embeddings Excellent Poor High High

Source: RAGFlow Blog

5.2 Token Efficiency of Agentic Code Execution

Source: Anthropic Engineering

Agents processing results in execution environments: 98.7% token reduction vs returning raw data to model context.

5.3 Multi-Hop Retrieval Necessity

For repository-level code generation, single-document retrieval insufficient. Multi-hop approaches:

  1. Retrieve initial candidates
  2. Extract entities/evidence
  3. Retrieve additional documents based on discovered relationships

Source: ACL Anthology


6. Hybrid Approaches (State of the Art)

6.1 Cursor's Architecture

Combines:

  • Custom embedding model (trained on agent traces)
  • Grep for exact matches
  • Turbopuffer vector database

Result: Best outcomes from combination, not either alone.

6.2 Rank Fusion Methods

Method Use Case
Min-max normalization Comparable score distributions
Reciprocal Rank Fusion (RRF) Incompatible score distributions
Cross-encoder reranking Final precision boost

Formula: RRF_score = 1/(rank + k) where k typically = 60

Source: Microsoft Azure Docs

6.3 Production Recommendations

  1. Start with grep for well-structured, smaller codebases
  2. Add semantic search when:
    • Codebase exceeds 1,000 files
    • Developers ask conceptual questions
    • Cross-file reasoning needed
  3. Use hybrid for production systems
  4. Expose both as agent tools for maximum flexibility

7. Key Statistics Summary

Claim Statistic Source
Semantic search accuracy gain +12.5% average Cursor A/B tests
Large codebase benefit +2.6% code retention Cursor (1000+ files)
Cody vs Copilot accuracy 82% vs 68% Augment comparison
RAG vs long context cost 8-82x cheaper LightOn
Semantic vs keyword rank 3.5 vs 6.0 DeepCS study
Top-5 hit rate (semantic) 76% DeepCS
MRR improvement +41% Samsung deployment
Agent token reduction 98.7% Anthropic MCP
SWE-bench agent vs RAG 12.47% vs 1.96% SWE-agent vs baseline
Retrieval gain (open-domain) +20-40% CodeRAG-Bench
Context rot threshold ~133K tokens GPT-4/5 testing

8. Conclusions

  1. Neither grep nor semantic search is universally better - context determines optimal approach

  2. Semantic search advantages increase with:

    • Codebase size (inflection point ~1,000 files)
    • Code complexity and architectural depth
    • Natural language queries about intent
  3. Grep advantages:

    • Speed and simplicity
    • Exact matches (function names, errors)
    • No index maintenance overhead
    • Sufficient for SWE-bench-style tasks
  4. Production best practice: Hybrid systems combining both, exposed as agent tools

  5. The retrieval gap: 9-point gap between RAG and oracle on SWE-bench indicates retrieval quality remains a bottleneck for hardest problems

  6. Agents change the equation: Persistence and multi-step reasoning can compensate for simpler retrieval tools, but at cost of speed and tokens


References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment