Skip to content

Instantly share code, notes, and snippets.

@PandaWhoCodes
Created April 2, 2026 07:08
Show Gist options
  • Select an option

  • Save PandaWhoCodes/62982e1037bb86b8ffef178cfdeb2907 to your computer and use it in GitHub Desktop.

Select an option

Save PandaWhoCodes/62982e1037bb86b8ffef178cfdeb2907 to your computer and use it in GitHub Desktop.
Vertex AI RAG Ingestion Pipeline — Research & Plan (Saama)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Vertex AI RAG Ingestion Pipeline — Research & Plan</title>
<link href="https://fonts.googleapis.com/css2?family=Crimson+Pro:ital,wght@0,300;0,400;0,500;0,600;1,300;1,400&family=Overpass+Mono:wght@300;400;500&family=Nunito+Sans:wght@300;400;600;700&display=swap" rel="stylesheet">
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
:root {
--bg: #f7f4ef;
--bg-card: #ffffff;
--bg-code: #1e1e2e;
--text: #2c2a26;
--text-soft: #5c5850;
--text-dim: #9c978d;
--accent: #2563eb;
--accent-dark: #1d4ed8;
--accent-green: #16a34a;
--accent-red: #dc2626;
--accent-orange: #ea580c;
--accent-purple: #7c3aed;
--accent-dim: rgba(37, 99, 235, 0.08);
--accent-green-dim: rgba(22, 163, 74, 0.08);
--serif: 'Crimson Pro', Georgia, serif;
--sans: 'Nunito Sans', -apple-system, sans-serif;
--mono: 'Overpass Mono', monospace;
}
body {
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-size: 16px;
line-height: 1.7;
-webkit-font-smoothing: antialiased;
}
body::after {
content: '';
position: fixed;
inset: 0;
background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 256 256' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noise'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.65' numOctaves='3' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noise)' opacity='0.03'/%3E%3C/svg%3E");
pointer-events: none;
z-index: 9999;
}
.page { max-width: 720px; margin: 0 auto; padding: 3rem 1.5rem; }
/* Header */
.header { text-align: center; margin-bottom: 3rem; }
.header .tag {
font-family: var(--mono);
font-size: 0.55rem;
letter-spacing: 0.2em;
text-transform: uppercase;
color: var(--accent);
margin-bottom: 1rem;
display: block;
}
.header h1 {
font-family: var(--serif);
font-size: clamp(1.8rem, 5vw, 2.6rem);
font-weight: 300;
color: var(--text);
line-height: 1.2;
margin-bottom: 0.8rem;
}
.header .sub {
font-size: 0.85rem;
color: var(--text-dim);
max-width: 500px;
margin: 0 auto;
}
/* Ribbon */
.ribbon { width: 2px; height: 50px; background: var(--accent); margin: 2.5rem auto; opacity: 0.2; position: relative; }
.ribbon::after { content: ''; position: absolute; bottom: -6px; left: -3px; width: 8px; height: 8px; background: var(--accent); opacity: 0.4; border-radius: 50%; }
.divider { text-align: center; color: var(--text-dim); font-size: 1rem; letter-spacing: 0.5em; margin: 2rem 0; }
/* Section */
.section { margin-bottom: 3rem; }
.section h2 {
font-family: var(--serif);
font-size: 1.6rem;
font-weight: 400;
color: var(--text);
margin-bottom: 0.3rem;
}
.section .note {
font-size: 0.78rem;
color: var(--text-dim);
margin-bottom: 1.5rem;
}
.section h3 {
font-family: var(--sans);
font-size: 0.95rem;
font-weight: 700;
color: var(--text);
margin: 1.5rem 0 0.5rem;
}
p { margin-bottom: 1rem; color: var(--text-soft); font-size: 0.9rem; }
/* Architecture box */
.arch-box {
background: var(--bg-code);
color: #cdd6f4;
font-family: var(--mono);
font-size: 0.65rem;
line-height: 1.8;
padding: 1.5rem;
border-radius: 8px;
overflow-x: auto;
white-space: pre;
margin: 1.5rem 0;
}
.arch-box .hl { color: #89b4fa; }
.arch-box .hlg { color: #a6e3a1; }
.arch-box .hlo { color: #fab387; }
/* Cards */
.card {
background: var(--bg-card);
border: 1px solid rgba(0,0,0,0.05);
border-radius: 8px;
padding: 1.2rem 1.4rem;
margin-bottom: 1rem;
box-shadow: 0 1px 8px rgba(0,0,0,0.03);
}
.card-title {
font-family: var(--sans);
font-weight: 700;
font-size: 0.88rem;
color: var(--text);
margin-bottom: 0.3rem;
}
.card p { margin-bottom: 0.5rem; font-size: 0.82rem; }
/* Phase indicator */
.phase {
display: inline-flex;
align-items: center;
gap: 0.4rem;
font-family: var(--mono);
font-size: 0.55rem;
letter-spacing: 0.1em;
text-transform: uppercase;
padding: 0.2em 0.7em;
border-radius: 3px;
margin-bottom: 0.8rem;
}
.phase.blue { background: var(--accent-dim); color: var(--accent); }
.phase.green { background: var(--accent-green-dim); color: var(--accent-green); }
.phase.orange { background: rgba(234,88,12,0.08); color: var(--accent-orange); }
.phase.purple { background: rgba(124,58,237,0.08); color: var(--accent-purple); }
/* Code blocks */
.code-block {
background: var(--bg-code);
color: #cdd6f4;
font-family: var(--mono);
font-size: 0.68rem;
line-height: 1.7;
padding: 1.2rem 1.4rem;
border-radius: 6px;
overflow-x: auto;
margin: 1rem 0;
}
.code-label {
font-family: var(--mono);
font-size: 0.5rem;
letter-spacing: 0.1em;
text-transform: uppercase;
color: var(--text-dim);
margin-bottom: 0.3rem;
}
.code-block .kw { color: #cba6f7; }
.code-block .str { color: #a6e3a1; }
.code-block .cm { color: #6c7086; }
.code-block .fn { color: #89b4fa; }
/* Limits table */
.limits-table {
width: 100%;
border-collapse: collapse;
margin: 1rem 0;
font-size: 0.82rem;
}
.limits-table th {
font-family: var(--mono);
font-size: 0.55rem;
letter-spacing: 0.1em;
text-transform: uppercase;
color: var(--text-dim);
text-align: left;
padding: 0.5rem 0.8rem;
border-bottom: 2px solid rgba(0,0,0,0.08);
}
.limits-table td {
padding: 0.5rem 0.8rem;
border-bottom: 1px solid rgba(0,0,0,0.04);
color: var(--text-soft);
}
.limits-table td:last-child { font-family: var(--mono); font-size: 0.78rem; }
/* Decision cards */
.decision {
background: var(--bg-card);
border: 1px solid rgba(0,0,0,0.05);
border-left: 3px solid var(--accent);
border-radius: 0 6px 6px 0;
padding: 1rem 1.2rem;
margin-bottom: 0.8rem;
}
.decision .q {
font-family: var(--sans);
font-weight: 700;
font-size: 0.85rem;
color: var(--text);
margin-bottom: 0.3rem;
}
.decision .a {
font-size: 0.82rem;
color: var(--text-soft);
margin: 0;
}
.decision .verdict {
font-family: var(--mono);
font-size: 0.6rem;
color: var(--accent-green);
text-transform: uppercase;
letter-spacing: 0.1em;
margin-top: 0.3rem;
}
/* Timeline */
.timeline { margin: 1.5rem 0; }
.timeline-week {
display: flex;
gap: 1rem;
margin-bottom: 1.2rem;
align-items: flex-start;
}
.timeline-marker {
width: 36px;
height: 36px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
font-family: var(--mono);
font-size: 0.6rem;
font-weight: 700;
color: white;
flex-shrink: 0;
}
.timeline-marker.w1 { background: var(--accent); }
.timeline-marker.w2 { background: var(--accent-green); }
.timeline-marker.w3 { background: var(--accent-orange); }
.timeline-marker.w4 { background: var(--accent-purple); }
.timeline-content { flex: 1; }
.timeline-content h4 {
font-size: 0.88rem;
font-weight: 700;
margin-bottom: 0.3rem;
}
.timeline-content ul {
list-style: none;
padding: 0;
}
.timeline-content li {
font-size: 0.8rem;
color: var(--text-soft);
padding: 0.15rem 0;
padding-left: 1rem;
position: relative;
}
.timeline-content li::before {
content: '→';
position: absolute;
left: 0;
color: var(--text-dim);
}
/* Industry patterns */
.pattern {
display: flex;
gap: 0.8rem;
align-items: flex-start;
margin-bottom: 1rem;
}
.pattern-num {
width: 28px;
height: 28px;
border-radius: 50%;
background: var(--accent-dim);
color: var(--accent);
display: flex;
align-items: center;
justify-content: center;
font-family: var(--mono);
font-size: 0.6rem;
font-weight: 700;
flex-shrink: 0;
}
.pattern-text { flex: 1; }
.pattern-text strong { font-size: 0.85rem; }
.pattern-text p { font-size: 0.8rem; margin: 0.2rem 0 0; }
/* Status badge */
.badge {
display: inline-block;
font-family: var(--mono);
font-size: 0.5rem;
letter-spacing: 0.08em;
text-transform: uppercase;
padding: 0.15em 0.5em;
border-radius: 3px;
margin-right: 0.3rem;
}
.badge.done { background: rgba(22,163,74,0.1); color: var(--accent-green); }
.badge.todo { background: rgba(37,99,235,0.1); color: var(--accent); }
.badge.rec { background: rgba(234,88,12,0.1); color: var(--accent-orange); }
/* Footer */
.footer {
margin-top: 3rem;
padding-top: 1.5rem;
border-top: 1px solid rgba(0,0,0,0.06);
text-align: center;
}
.footer p { font-size: 0.7rem; color: var(--text-dim); margin-bottom: 0.2rem; }
/* Scroll */
.fade-in { opacity: 0; transform: translateY(14px); transition: opacity 0.6s ease, transform 0.6s ease; }
.fade-in.visible { opacity: 1; transform: translateY(0); }
</style>
</head>
<body>
<div class="page">
<div class="header">
<span class="tag">Saama · Vertex AI · ADK · Research Notes</span>
<h1>RAG Ingestion Pipeline</h1>
<p class="sub">Research & plan for file upload → processing → Vertex AI RAG Engine. Based on Ashish & Abhijit's call, April 2, 2026.</p>
</div>
<div class="ribbon"></div>
<!-- PROBLEM -->
<div class="section fade-in">
<h2>The Problem</h2>
<p>Users upload files (up to <strong>1 GB</strong>, including zip archives) from a React frontend. These need to land in GCS without hitting the backend pod, get extracted/validated/split, maintain parent-child relationships in the UI, track status in the DB, and be ingested into Vertex AI RAG Engine with proper context and metadata.</p>
<p>Abhijit tried Airflow GCS hooks before — they didn't work reliably. Need a better trigger mechanism.</p>
</div>
<!-- ARCHITECTURE -->
<div class="section fade-in">
<h2>Architecture</h2>
<div class="arch-box"><span class="hl">React App</span> → Backend (<span class="hlg">signed URL</span>) → <span class="hlo">GCS /uploads/</span>
<span class="hlg">Eventarc trigger</span>
<span class="hl">Cloud Run processor</span>
┌─────────┴─────────┐
│ │
<span class="hlo">Zip? Extract</span> <span class="hlo">Single file</span>
│ │
└─────────┬─────────┘
<span class="hlg">Validate + Split</span>
(PDF by page, DOCX by heading)
<span class="hlo">GCS /processed/</span>
<span class="hl">Update DB status</span>
<span class="hlg">ImportRagFiles API</span>
(+ Layout Parser)
<span class="hl">Vertex AI RAG Corpus</span></div>
</div>
<!-- VERTEX LIMITS -->
<div class="section fade-in">
<h2>Vertex AI RAG Limits</h2>
<p class="note">These are the hard constraints your processing pipeline must respect</p>
<table class="limits-table">
<thead><tr><th>File Type</th><th>Max Size</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>PDF</td><td>50 MB</td><td>500 pages max with Layout Parser</td></tr>
<tr><td>DOCX</td><td>50 MB</td><td>Split by heading structure</td></tr>
<tr><td>Text / Markdown</td><td>10 MB</td><td></td></tr>
<tr><td>HTML / JSON</td><td>10 MB</td><td></td></tr>
<tr><td>ZIP</td><td>❌ Not supported</td><td>Must extract first</td></tr>
<tr><td>Chunking default</td><td>1024 tokens</td><td>256 token overlap</td></tr>
</tbody>
</table>
</div>
<div class="divider">· · ·</div>
<!-- PHASE 1 -->
<div class="section fade-in">
<span class="phase blue">Phase 1 — Upload</span>
<h2>Browser → GCS via Signed URLs</h2>
<p>React requests a V4 signed URL from backend → uploads directly to GCS. <strong>Zero backend memory usage.</strong> Works for files up to 1GB+. For >5GB, use resumable upload protocol.</p>
<div class="card">
<div class="card-title">How it works</div>
<p>1. React calls <code>/api/upload/signed-url</code> with filename & content type</p>
<p>2. Backend generates V4 signed PUT URL (15 min expiry, scoped to user folder)</p>
<p>3. React does <code>fetch(signed_url, { method: 'PUT', body: file })</code> — direct to GCS</p>
<p>4. On completion, React notifies backend with the object path</p>
</div>
<div class="card">
<div class="card-title">Don't forget</div>
<p>• Set CORS on the GCS bucket to allow PUT from your frontend origin</p>
<p>• For resumable uploads (large files): initiate via JSON API, then upload in chunks</p>
<p>• Abhijit confirmed he's seen demos of this working — it's production-proven</p>
</div>
</div>
<!-- PHASE 2 -->
<div class="section fade-in">
<span class="phase green">Phase 2 — Trigger</span>
<h2>Eventarc, Not Airflow Polling</h2>
<div class="decision">
<div class="q">Option A: Eventarc + Cloud Run <span class="badge rec">Recommended</span></div>
<p class="a">GCS <code>object.finalized</code> event → Cloud Run service. No polling, no missed files. Google's recommended pattern. This is the "TCP connector / GCS hook" you were asking about — but managed by Google.</p>
<div class="verdict">→ Use this as primary trigger</div>
</div>
<div class="decision">
<div class="q">Option B: Airflow / Cloud Composer</div>
<p class="a"><code>GCSObjectExistenceSensor</code> or trigger DAG via Cloud Functions. Better for complex orchestration. But Abhijit already tried GCS hooks in Airflow and they weren't reliable.</p>
<div class="verdict">→ Use only if you need complex DAG orchestration on top</div>
</div>
<div class="decision">
<div class="q">Option C: Backend notification (fallback)</div>
<p class="a">Frontend notifies backend after upload → backend triggers processing. Simple but what if frontend crashes mid-upload?</p>
<div class="verdict">→ Use as belt-and-suspenders alongside Eventarc</div>
</div>
</div>
<!-- PHASE 3 -->
<div class="section fade-in">
<span class="phase orange">Phase 3 — Processing</span>
<h2>Extract, Validate, Split</h2>
<p>Cloud Run service handles all post-upload processing. Cloud Run supports up to 32GB memory — enough for 1GB zips.</p>
<h3>Step 1: Zip Extraction</h3>
<p>Download zip from GCS → extract to /tmp → filter supported types → upload individual files to <code>/processed/{user_id}/{parent_file_id}/</code></p>
<h3>Step 2: Validation</h3>
<p>Check file types, reject unsupported formats, scan for corruption. Update DB status to "failed" with reason if rejected.</p>
<h3>Step 3: Split Large Files</h3>
<div class="card">
<div class="card-title">PDF splitting</div>
<p>Split by pages (max 100 pages per chunk). Use <code>pypdf</code>. Preserve metadata: original title, page range, total pages.</p>
</div>
<div class="card">
<div class="card-title">DOCX splitting — the hard one ⚡</div>
<p>Abhijit's concern is valid: arbitrary byte splitting loses context. <strong>Split by heading structure</strong> (Heading 1 boundaries), not arbitrary cuts. Add a context preamble to each sub-doc: document title + section headers covered.</p>
</div>
<h3>Step 4: Context Preservation (THE KEY)</h3>
<div class="card">
<div class="card-title">Document AI Layout Parser <span class="badge rec">Google's solution</span></div>
<p>Integrates directly with RAG Engine. Understands headings, tables, lists, sections. Creates context-aware chunks that respect layout. Just pass <code>layout_parser</code> config during import — one line. Max 20MB/500 pages per PDF.</p>
</div>
<div class="card">
<div class="card-title">Metadata-enriched chunks (custom splitting)</div>
<p>When splitting a 200-page PDF into 2×100-page chunks: add context preamble as metadata (doc title, section headings, page range). Store original document ID for traceability. RAG Engine's <code>chunk_overlap=256</code> tokens bridges boundaries.</p>
</div>
<h3>Step 5: Parent-Child Tracking</h3>
<div class="card">
<div class="card-title">DB Model (Abhijit already committed table changes)</div>
<p><strong>files table:</strong> id, user_id, original_filename, status, gcs_path, size, type, created_at</p>
<p><strong>file_chunks table:</strong> id, parent_file_id, chunk_index, gcs_path, page_range, size, status</p>
<p>User sees parent file in UI → system sends all child chunks to Vertex AI for retrieval</p>
</div>
</div>
<!-- PHASE 4 -->
<div class="section fade-in">
<span class="phase purple">Phase 4 — Ingest</span>
<h2>Vertex AI RAG Import</h2>
<p>Call <code>rag.import_files()</code> with Layout Parser config. Built-in deduplication handles re-uploads. Log results to BigQuery for debugging.</p>
<div class="card">
<div class="card-title">Key config</div>
<p><code>chunk_size=1024</code> tokens, <code>chunk_overlap=256</code> tokens</p>
<p>Layout Parser processor for PDFs with tables/charts</p>
<p><code>import_result_sink</code> → BigQuery table for failure debugging</p>
<p><code>max_embedding_requests_per_min=900</code> (rate limiting)</p>
</div>
</div>
<div class="ribbon"></div>
<!-- INDUSTRY -->
<div class="section fade-in">
<h2>What the Industry Does</h2>
<p class="note">You don't have to figure this out from scratch — it's been done</p>
<div class="pattern">
<div class="pattern-num">1</div>
<div class="pattern-text">
<strong>Unstructured.io</strong> (most popular OSS)
<p>Parses PDFs, DOCX, PPTX, emails, HTML. Partitions by document elements. Preserves hierarchy + metadata. Used by LangChain, LlamaIndex, most RAG pipelines.</p>
</div>
</div>
<div class="pattern">
<div class="pattern-num">2</div>
<div class="pattern-text">
<strong>LangChain Splitters</strong>
<p><code>RecursiveCharacterTextSplitter</code> with chunk_size + overlap. <code>PyPDFLoader</code> splits by pages with metadata. Good for simple documents.</p>
</div>
</div>
<div class="pattern">
<div class="pattern-num">3</div>
<div class="pattern-text">
<strong>Google DIY RAG Reference</strong>
<p>Document AI Layout Parser → context-aware chunking → Vertex AI Vector Search → Check Grounding API. Full Colab notebook available.</p>
</div>
</div>
<div class="pattern">
<div class="pattern-num">4</div>
<div class="pattern-text">
<strong>Event-driven Auto-sync</strong> (production pattern)
<p>GCS upload → Eventarc → Cloud Run → ImportRagFiles. Terraform for infra, Pub/Sub for batching. Blog: "Auto-Sync RAG Pipeline" by Suhas Mallesh.</p>
</div>
</div>
</div>
<div class="divider">· · ·</div>
<!-- PLAN -->
<div class="section fade-in">
<h2>4-Week Plan</h2>
<div class="timeline">
<div class="timeline-week">
<div class="timeline-marker w1">W1</div>
<div class="timeline-content">
<h4>Upload + Extraction</h4>
<ul>
<li>Signed URL endpoint (V4, 15-min expiry, per-user folder)</li>
<li>GCS bucket: <code>/uploads/{user_id}/{ts}/</code> raw, <code>/processed/{user_id}/{parent_id}/</code> extracted</li>
<li>Eventarc trigger on <code>object.finalized</code></li>
<li>Cloud Run processor: zip extraction + file splitting</li>
</ul>
</div>
</div>
<div class="timeline-week">
<div class="timeline-marker w2">W2</div>
<div class="timeline-content">
<h4>Validation + Processing + DB</h4>
<ul>
<li>Validation layer: check types, reject unsupported, scan corruption</li>
<li>Folder/subfolder creation for UI</li>
<li>Parent-child tracking in DB <span class="badge done">Abhijit committed tables</span></li>
<li>Real-time status updates (processing → available → failed)</li>
</ul>
</div>
</div>
<div class="timeline-week">
<div class="timeline-marker w3">W3</div>
<div class="timeline-content">
<h4>Ingestion Pipeline</h4>
<ul>
<li>Enable Document AI Layout Parser</li>
<li>RAG corpus: 1024 tokens, 256 overlap</li>
<li>ImportRagFiles from /processed/ with Layout Parser</li>
<li>Parent → child chunk mapping for UI retrieval</li>
<li>Import result sink to BigQuery</li>
</ul>
</div>
</div>
<div class="timeline-week">
<div class="timeline-marker w4">W4</div>
<div class="timeline-content">
<h4>Integration + Testing</h4>
<ul>
<li>ADK agent with <code>vertex_ai_rag_retrieval</code></li>
<li>Error handling: retry, dead letter queue</li>
<li>E2E test: 1GB zip → extract → split → ingest → chat</li>
</ul>
</div>
</div>
</div>
</div>
<!-- DECISIONS -->
<div class="section fade-in">
<h2>Decisions for Monday</h2>
<div class="decision">
<div class="q">1. Eventarc vs Airflow?</div>
<p class="a">Eventarc for file-arrival triggers. Abhijit's Airflow GCS hooks weren't reliable. Eventarc can optionally trigger an Airflow DAG if complex orchestration is needed.</p>
<div class="verdict">→ Eventarc primary, Airflow optional orchestration layer</div>
</div>
<div class="decision">
<div class="q">2. Cloud Run vs dedicated pod?</div>
<p class="a">Cloud Run scales to zero, handles burst, 32GB memory. Dedicated pod only if processing >60 min per file.</p>
<div class="verdict">→ Cloud Run for most cases</div>
</div>
<div class="decision">
<div class="q">3. Layout Parser vs default?</div>
<p class="a">Layout Parser for PDFs with tables/charts (higher accuracy, Document AI pricing). Default for plain text.</p>
<div class="verdict">→ Layout Parser for PDFs, default for text</div>
</div>
<div class="decision">
<div class="q">4. DOCX splitting strategy?</div>
<p class="a">Split by Heading 1 boundaries, not arbitrary bytes. Context preamble per chunk. This preserves the context Abhijit was worried about.</p>
<div class="verdict">→ Heading-based splitting + context preamble</div>
</div>
<div class="decision">
<div class="q">5. Parent-child UI model?</div>
<p class="a">User sees parent file, system sends child chunks to Vertex. DB: files + file_chunks tables. Abhijit already committed the schema changes.</p>
<div class="verdict">→ Build on Abhijit's committed tables</div>
</div>
<div class="decision">
<div class="q">6. Trigger approach?</div>
<p class="a">Ashish suggested TCP connector / GCS hooks. Eventarc is exactly this — event-driven, Google-managed, not polling.</p>
<div class="verdict">→ Eventarc = the GCS hook that actually works</div>
</div>
</div>
<div class="footer">
<p>Research compiled by Claw 🐾 for Ashish · April 2, 2026</p>
<p>Sources: Google Cloud docs, Vertex AI RAG Engine docs, industry patterns</p>
</div>
</div>
<script>
const observer = new IntersectionObserver(entries => {
entries.forEach(e => { if (e.isIntersecting) e.target.classList.add('visible'); });
}, { threshold: 0.1 });
document.querySelectorAll('.fade-in').forEach(el => observer.observe(el));
</script>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment