Created
April 2, 2026 07:08
-
-
Save PandaWhoCodes/62982e1037bb86b8ffef178cfdeb2907 to your computer and use it in GitHub Desktop.
Vertex AI RAG Ingestion Pipeline — Research & Plan (Saama)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!DOCTYPE html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>Vertex AI RAG Ingestion Pipeline — Research & Plan</title> | |
| <link href="https://fonts.googleapis.com/css2?family=Crimson+Pro:ital,wght@0,300;0,400;0,500;0,600;1,300;1,400&family=Overpass+Mono:wght@300;400;500&family=Nunito+Sans:wght@300;400;600;700&display=swap" rel="stylesheet"> | |
| <style> | |
| * { margin: 0; padding: 0; box-sizing: border-box; } | |
| :root { | |
| --bg: #f7f4ef; | |
| --bg-card: #ffffff; | |
| --bg-code: #1e1e2e; | |
| --text: #2c2a26; | |
| --text-soft: #5c5850; | |
| --text-dim: #9c978d; | |
| --accent: #2563eb; | |
| --accent-dark: #1d4ed8; | |
| --accent-green: #16a34a; | |
| --accent-red: #dc2626; | |
| --accent-orange: #ea580c; | |
| --accent-purple: #7c3aed; | |
| --accent-dim: rgba(37, 99, 235, 0.08); | |
| --accent-green-dim: rgba(22, 163, 74, 0.08); | |
| --serif: 'Crimson Pro', Georgia, serif; | |
| --sans: 'Nunito Sans', -apple-system, sans-serif; | |
| --mono: 'Overpass Mono', monospace; | |
| } | |
| body { | |
| background: var(--bg); | |
| color: var(--text); | |
| font-family: var(--sans); | |
| font-size: 16px; | |
| line-height: 1.7; | |
| -webkit-font-smoothing: antialiased; | |
| } | |
| body::after { | |
| content: ''; | |
| position: fixed; | |
| inset: 0; | |
| background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 256 256' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noise'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.65' numOctaves='3' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noise)' opacity='0.03'/%3E%3C/svg%3E"); | |
| pointer-events: none; | |
| z-index: 9999; | |
| } | |
| .page { max-width: 720px; margin: 0 auto; padding: 3rem 1.5rem; } | |
| /* Header */ | |
| .header { text-align: center; margin-bottom: 3rem; } | |
| .header .tag { | |
| font-family: var(--mono); | |
| font-size: 0.55rem; | |
| letter-spacing: 0.2em; | |
| text-transform: uppercase; | |
| color: var(--accent); | |
| margin-bottom: 1rem; | |
| display: block; | |
| } | |
| .header h1 { | |
| font-family: var(--serif); | |
| font-size: clamp(1.8rem, 5vw, 2.6rem); | |
| font-weight: 300; | |
| color: var(--text); | |
| line-height: 1.2; | |
| margin-bottom: 0.8rem; | |
| } | |
| .header .sub { | |
| font-size: 0.85rem; | |
| color: var(--text-dim); | |
| max-width: 500px; | |
| margin: 0 auto; | |
| } | |
| /* Ribbon */ | |
| .ribbon { width: 2px; height: 50px; background: var(--accent); margin: 2.5rem auto; opacity: 0.2; position: relative; } | |
| .ribbon::after { content: ''; position: absolute; bottom: -6px; left: -3px; width: 8px; height: 8px; background: var(--accent); opacity: 0.4; border-radius: 50%; } | |
| .divider { text-align: center; color: var(--text-dim); font-size: 1rem; letter-spacing: 0.5em; margin: 2rem 0; } | |
| /* Section */ | |
| .section { margin-bottom: 3rem; } | |
| .section h2 { | |
| font-family: var(--serif); | |
| font-size: 1.6rem; | |
| font-weight: 400; | |
| color: var(--text); | |
| margin-bottom: 0.3rem; | |
| } | |
| .section .note { | |
| font-size: 0.78rem; | |
| color: var(--text-dim); | |
| margin-bottom: 1.5rem; | |
| } | |
| .section h3 { | |
| font-family: var(--sans); | |
| font-size: 0.95rem; | |
| font-weight: 700; | |
| color: var(--text); | |
| margin: 1.5rem 0 0.5rem; | |
| } | |
| p { margin-bottom: 1rem; color: var(--text-soft); font-size: 0.9rem; } | |
| /* Architecture box */ | |
| .arch-box { | |
| background: var(--bg-code); | |
| color: #cdd6f4; | |
| font-family: var(--mono); | |
| font-size: 0.65rem; | |
| line-height: 1.8; | |
| padding: 1.5rem; | |
| border-radius: 8px; | |
| overflow-x: auto; | |
| white-space: pre; | |
| margin: 1.5rem 0; | |
| } | |
| .arch-box .hl { color: #89b4fa; } | |
| .arch-box .hlg { color: #a6e3a1; } | |
| .arch-box .hlo { color: #fab387; } | |
| /* Cards */ | |
| .card { | |
| background: var(--bg-card); | |
| border: 1px solid rgba(0,0,0,0.05); | |
| border-radius: 8px; | |
| padding: 1.2rem 1.4rem; | |
| margin-bottom: 1rem; | |
| box-shadow: 0 1px 8px rgba(0,0,0,0.03); | |
| } | |
| .card-title { | |
| font-family: var(--sans); | |
| font-weight: 700; | |
| font-size: 0.88rem; | |
| color: var(--text); | |
| margin-bottom: 0.3rem; | |
| } | |
| .card p { margin-bottom: 0.5rem; font-size: 0.82rem; } | |
| /* Phase indicator */ | |
| .phase { | |
| display: inline-flex; | |
| align-items: center; | |
| gap: 0.4rem; | |
| font-family: var(--mono); | |
| font-size: 0.55rem; | |
| letter-spacing: 0.1em; | |
| text-transform: uppercase; | |
| padding: 0.2em 0.7em; | |
| border-radius: 3px; | |
| margin-bottom: 0.8rem; | |
| } | |
| .phase.blue { background: var(--accent-dim); color: var(--accent); } | |
| .phase.green { background: var(--accent-green-dim); color: var(--accent-green); } | |
| .phase.orange { background: rgba(234,88,12,0.08); color: var(--accent-orange); } | |
| .phase.purple { background: rgba(124,58,237,0.08); color: var(--accent-purple); } | |
| /* Code blocks */ | |
| .code-block { | |
| background: var(--bg-code); | |
| color: #cdd6f4; | |
| font-family: var(--mono); | |
| font-size: 0.68rem; | |
| line-height: 1.7; | |
| padding: 1.2rem 1.4rem; | |
| border-radius: 6px; | |
| overflow-x: auto; | |
| margin: 1rem 0; | |
| } | |
| .code-label { | |
| font-family: var(--mono); | |
| font-size: 0.5rem; | |
| letter-spacing: 0.1em; | |
| text-transform: uppercase; | |
| color: var(--text-dim); | |
| margin-bottom: 0.3rem; | |
| } | |
| .code-block .kw { color: #cba6f7; } | |
| .code-block .str { color: #a6e3a1; } | |
| .code-block .cm { color: #6c7086; } | |
| .code-block .fn { color: #89b4fa; } | |
| /* Limits table */ | |
| .limits-table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| margin: 1rem 0; | |
| font-size: 0.82rem; | |
| } | |
| .limits-table th { | |
| font-family: var(--mono); | |
| font-size: 0.55rem; | |
| letter-spacing: 0.1em; | |
| text-transform: uppercase; | |
| color: var(--text-dim); | |
| text-align: left; | |
| padding: 0.5rem 0.8rem; | |
| border-bottom: 2px solid rgba(0,0,0,0.08); | |
| } | |
| .limits-table td { | |
| padding: 0.5rem 0.8rem; | |
| border-bottom: 1px solid rgba(0,0,0,0.04); | |
| color: var(--text-soft); | |
| } | |
| .limits-table td:last-child { font-family: var(--mono); font-size: 0.78rem; } | |
| /* Decision cards */ | |
| .decision { | |
| background: var(--bg-card); | |
| border: 1px solid rgba(0,0,0,0.05); | |
| border-left: 3px solid var(--accent); | |
| border-radius: 0 6px 6px 0; | |
| padding: 1rem 1.2rem; | |
| margin-bottom: 0.8rem; | |
| } | |
| .decision .q { | |
| font-family: var(--sans); | |
| font-weight: 700; | |
| font-size: 0.85rem; | |
| color: var(--text); | |
| margin-bottom: 0.3rem; | |
| } | |
| .decision .a { | |
| font-size: 0.82rem; | |
| color: var(--text-soft); | |
| margin: 0; | |
| } | |
| .decision .verdict { | |
| font-family: var(--mono); | |
| font-size: 0.6rem; | |
| color: var(--accent-green); | |
| text-transform: uppercase; | |
| letter-spacing: 0.1em; | |
| margin-top: 0.3rem; | |
| } | |
| /* Timeline */ | |
| .timeline { margin: 1.5rem 0; } | |
| .timeline-week { | |
| display: flex; | |
| gap: 1rem; | |
| margin-bottom: 1.2rem; | |
| align-items: flex-start; | |
| } | |
| .timeline-marker { | |
| width: 36px; | |
| height: 36px; | |
| border-radius: 50%; | |
| display: flex; | |
| align-items: center; | |
| justify-content: center; | |
| font-family: var(--mono); | |
| font-size: 0.6rem; | |
| font-weight: 700; | |
| color: white; | |
| flex-shrink: 0; | |
| } | |
| .timeline-marker.w1 { background: var(--accent); } | |
| .timeline-marker.w2 { background: var(--accent-green); } | |
| .timeline-marker.w3 { background: var(--accent-orange); } | |
| .timeline-marker.w4 { background: var(--accent-purple); } | |
| .timeline-content { flex: 1; } | |
| .timeline-content h4 { | |
| font-size: 0.88rem; | |
| font-weight: 700; | |
| margin-bottom: 0.3rem; | |
| } | |
| .timeline-content ul { | |
| list-style: none; | |
| padding: 0; | |
| } | |
| .timeline-content li { | |
| font-size: 0.8rem; | |
| color: var(--text-soft); | |
| padding: 0.15rem 0; | |
| padding-left: 1rem; | |
| position: relative; | |
| } | |
| .timeline-content li::before { | |
| content: '→'; | |
| position: absolute; | |
| left: 0; | |
| color: var(--text-dim); | |
| } | |
| /* Industry patterns */ | |
| .pattern { | |
| display: flex; | |
| gap: 0.8rem; | |
| align-items: flex-start; | |
| margin-bottom: 1rem; | |
| } | |
| .pattern-num { | |
| width: 28px; | |
| height: 28px; | |
| border-radius: 50%; | |
| background: var(--accent-dim); | |
| color: var(--accent); | |
| display: flex; | |
| align-items: center; | |
| justify-content: center; | |
| font-family: var(--mono); | |
| font-size: 0.6rem; | |
| font-weight: 700; | |
| flex-shrink: 0; | |
| } | |
| .pattern-text { flex: 1; } | |
| .pattern-text strong { font-size: 0.85rem; } | |
| .pattern-text p { font-size: 0.8rem; margin: 0.2rem 0 0; } | |
| /* Status badge */ | |
| .badge { | |
| display: inline-block; | |
| font-family: var(--mono); | |
| font-size: 0.5rem; | |
| letter-spacing: 0.08em; | |
| text-transform: uppercase; | |
| padding: 0.15em 0.5em; | |
| border-radius: 3px; | |
| margin-right: 0.3rem; | |
| } | |
| .badge.done { background: rgba(22,163,74,0.1); color: var(--accent-green); } | |
| .badge.todo { background: rgba(37,99,235,0.1); color: var(--accent); } | |
| .badge.rec { background: rgba(234,88,12,0.1); color: var(--accent-orange); } | |
| /* Footer */ | |
| .footer { | |
| margin-top: 3rem; | |
| padding-top: 1.5rem; | |
| border-top: 1px solid rgba(0,0,0,0.06); | |
| text-align: center; | |
| } | |
| .footer p { font-size: 0.7rem; color: var(--text-dim); margin-bottom: 0.2rem; } | |
| /* Scroll */ | |
| .fade-in { opacity: 0; transform: translateY(14px); transition: opacity 0.6s ease, transform 0.6s ease; } | |
| .fade-in.visible { opacity: 1; transform: translateY(0); } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="page"> | |
| <div class="header"> | |
| <span class="tag">Saama · Vertex AI · ADK · Research Notes</span> | |
| <h1>RAG Ingestion Pipeline</h1> | |
| <p class="sub">Research & plan for file upload → processing → Vertex AI RAG Engine. Based on Ashish & Abhijit's call, April 2, 2026.</p> | |
| </div> | |
| <div class="ribbon"></div> | |
| <!-- PROBLEM --> | |
| <div class="section fade-in"> | |
| <h2>The Problem</h2> | |
| <p>Users upload files (up to <strong>1 GB</strong>, including zip archives) from a React frontend. These need to land in GCS without hitting the backend pod, get extracted/validated/split, maintain parent-child relationships in the UI, track status in the DB, and be ingested into Vertex AI RAG Engine with proper context and metadata.</p> | |
| <p>Abhijit tried Airflow GCS hooks before — they didn't work reliably. Need a better trigger mechanism.</p> | |
| </div> | |
| <!-- ARCHITECTURE --> | |
| <div class="section fade-in"> | |
| <h2>Architecture</h2> | |
| <div class="arch-box"><span class="hl">React App</span> → Backend (<span class="hlg">signed URL</span>) → <span class="hlo">GCS /uploads/</span> | |
| │ | |
| <span class="hlg">Eventarc trigger</span> | |
| │ | |
| <span class="hl">Cloud Run processor</span> | |
| │ | |
| ┌─────────┴─────────┐ | |
| │ │ | |
| <span class="hlo">Zip? Extract</span> <span class="hlo">Single file</span> | |
| │ │ | |
| └─────────┬─────────┘ | |
| │ | |
| <span class="hlg">Validate + Split</span> | |
| (PDF by page, DOCX by heading) | |
| │ | |
| <span class="hlo">GCS /processed/</span> | |
| │ | |
| <span class="hl">Update DB status</span> | |
| │ | |
| <span class="hlg">ImportRagFiles API</span> | |
| (+ Layout Parser) | |
| │ | |
| <span class="hl">Vertex AI RAG Corpus</span></div> | |
| </div> | |
| <!-- VERTEX LIMITS --> | |
| <div class="section fade-in"> | |
| <h2>Vertex AI RAG Limits</h2> | |
| <p class="note">These are the hard constraints your processing pipeline must respect</p> | |
| <table class="limits-table"> | |
| <thead><tr><th>File Type</th><th>Max Size</th><th>Notes</th></tr></thead> | |
| <tbody> | |
| <tr><td>PDF</td><td>50 MB</td><td>500 pages max with Layout Parser</td></tr> | |
| <tr><td>DOCX</td><td>50 MB</td><td>Split by heading structure</td></tr> | |
| <tr><td>Text / Markdown</td><td>10 MB</td><td></td></tr> | |
| <tr><td>HTML / JSON</td><td>10 MB</td><td></td></tr> | |
| <tr><td>ZIP</td><td>❌ Not supported</td><td>Must extract first</td></tr> | |
| <tr><td>Chunking default</td><td>1024 tokens</td><td>256 token overlap</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <div class="divider">· · ·</div> | |
| <!-- PHASE 1 --> | |
| <div class="section fade-in"> | |
| <span class="phase blue">Phase 1 — Upload</span> | |
| <h2>Browser → GCS via Signed URLs</h2> | |
| <p>React requests a V4 signed URL from backend → uploads directly to GCS. <strong>Zero backend memory usage.</strong> Works for files up to 1GB+. For >5GB, use resumable upload protocol.</p> | |
| <div class="card"> | |
| <div class="card-title">How it works</div> | |
| <p>1. React calls <code>/api/upload/signed-url</code> with filename & content type</p> | |
| <p>2. Backend generates V4 signed PUT URL (15 min expiry, scoped to user folder)</p> | |
| <p>3. React does <code>fetch(signed_url, { method: 'PUT', body: file })</code> — direct to GCS</p> | |
| <p>4. On completion, React notifies backend with the object path</p> | |
| </div> | |
| <div class="card"> | |
| <div class="card-title">Don't forget</div> | |
| <p>• Set CORS on the GCS bucket to allow PUT from your frontend origin</p> | |
| <p>• For resumable uploads (large files): initiate via JSON API, then upload in chunks</p> | |
| <p>• Abhijit confirmed he's seen demos of this working — it's production-proven</p> | |
| </div> | |
| </div> | |
| <!-- PHASE 2 --> | |
| <div class="section fade-in"> | |
| <span class="phase green">Phase 2 — Trigger</span> | |
| <h2>Eventarc, Not Airflow Polling</h2> | |
| <div class="decision"> | |
| <div class="q">Option A: Eventarc + Cloud Run <span class="badge rec">Recommended</span></div> | |
| <p class="a">GCS <code>object.finalized</code> event → Cloud Run service. No polling, no missed files. Google's recommended pattern. This is the "TCP connector / GCS hook" you were asking about — but managed by Google.</p> | |
| <div class="verdict">→ Use this as primary trigger</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">Option B: Airflow / Cloud Composer</div> | |
| <p class="a"><code>GCSObjectExistenceSensor</code> or trigger DAG via Cloud Functions. Better for complex orchestration. But Abhijit already tried GCS hooks in Airflow and they weren't reliable.</p> | |
| <div class="verdict">→ Use only if you need complex DAG orchestration on top</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">Option C: Backend notification (fallback)</div> | |
| <p class="a">Frontend notifies backend after upload → backend triggers processing. Simple but what if frontend crashes mid-upload?</p> | |
| <div class="verdict">→ Use as belt-and-suspenders alongside Eventarc</div> | |
| </div> | |
| </div> | |
| <!-- PHASE 3 --> | |
| <div class="section fade-in"> | |
| <span class="phase orange">Phase 3 — Processing</span> | |
| <h2>Extract, Validate, Split</h2> | |
| <p>Cloud Run service handles all post-upload processing. Cloud Run supports up to 32GB memory — enough for 1GB zips.</p> | |
| <h3>Step 1: Zip Extraction</h3> | |
| <p>Download zip from GCS → extract to /tmp → filter supported types → upload individual files to <code>/processed/{user_id}/{parent_file_id}/</code></p> | |
| <h3>Step 2: Validation</h3> | |
| <p>Check file types, reject unsupported formats, scan for corruption. Update DB status to "failed" with reason if rejected.</p> | |
| <h3>Step 3: Split Large Files</h3> | |
| <div class="card"> | |
| <div class="card-title">PDF splitting</div> | |
| <p>Split by pages (max 100 pages per chunk). Use <code>pypdf</code>. Preserve metadata: original title, page range, total pages.</p> | |
| </div> | |
| <div class="card"> | |
| <div class="card-title">DOCX splitting — the hard one ⚡</div> | |
| <p>Abhijit's concern is valid: arbitrary byte splitting loses context. <strong>Split by heading structure</strong> (Heading 1 boundaries), not arbitrary cuts. Add a context preamble to each sub-doc: document title + section headers covered.</p> | |
| </div> | |
| <h3>Step 4: Context Preservation (THE KEY)</h3> | |
| <div class="card"> | |
| <div class="card-title">Document AI Layout Parser <span class="badge rec">Google's solution</span></div> | |
| <p>Integrates directly with RAG Engine. Understands headings, tables, lists, sections. Creates context-aware chunks that respect layout. Just pass <code>layout_parser</code> config during import — one line. Max 20MB/500 pages per PDF.</p> | |
| </div> | |
| <div class="card"> | |
| <div class="card-title">Metadata-enriched chunks (custom splitting)</div> | |
| <p>When splitting a 200-page PDF into 2×100-page chunks: add context preamble as metadata (doc title, section headings, page range). Store original document ID for traceability. RAG Engine's <code>chunk_overlap=256</code> tokens bridges boundaries.</p> | |
| </div> | |
| <h3>Step 5: Parent-Child Tracking</h3> | |
| <div class="card"> | |
| <div class="card-title">DB Model (Abhijit already committed table changes)</div> | |
| <p><strong>files table:</strong> id, user_id, original_filename, status, gcs_path, size, type, created_at</p> | |
| <p><strong>file_chunks table:</strong> id, parent_file_id, chunk_index, gcs_path, page_range, size, status</p> | |
| <p>User sees parent file in UI → system sends all child chunks to Vertex AI for retrieval</p> | |
| </div> | |
| </div> | |
| <!-- PHASE 4 --> | |
| <div class="section fade-in"> | |
| <span class="phase purple">Phase 4 — Ingest</span> | |
| <h2>Vertex AI RAG Import</h2> | |
| <p>Call <code>rag.import_files()</code> with Layout Parser config. Built-in deduplication handles re-uploads. Log results to BigQuery for debugging.</p> | |
| <div class="card"> | |
| <div class="card-title">Key config</div> | |
| <p><code>chunk_size=1024</code> tokens, <code>chunk_overlap=256</code> tokens</p> | |
| <p>Layout Parser processor for PDFs with tables/charts</p> | |
| <p><code>import_result_sink</code> → BigQuery table for failure debugging</p> | |
| <p><code>max_embedding_requests_per_min=900</code> (rate limiting)</p> | |
| </div> | |
| </div> | |
| <div class="ribbon"></div> | |
| <!-- INDUSTRY --> | |
| <div class="section fade-in"> | |
| <h2>What the Industry Does</h2> | |
| <p class="note">You don't have to figure this out from scratch — it's been done</p> | |
| <div class="pattern"> | |
| <div class="pattern-num">1</div> | |
| <div class="pattern-text"> | |
| <strong>Unstructured.io</strong> (most popular OSS) | |
| <p>Parses PDFs, DOCX, PPTX, emails, HTML. Partitions by document elements. Preserves hierarchy + metadata. Used by LangChain, LlamaIndex, most RAG pipelines.</p> | |
| </div> | |
| </div> | |
| <div class="pattern"> | |
| <div class="pattern-num">2</div> | |
| <div class="pattern-text"> | |
| <strong>LangChain Splitters</strong> | |
| <p><code>RecursiveCharacterTextSplitter</code> with chunk_size + overlap. <code>PyPDFLoader</code> splits by pages with metadata. Good for simple documents.</p> | |
| </div> | |
| </div> | |
| <div class="pattern"> | |
| <div class="pattern-num">3</div> | |
| <div class="pattern-text"> | |
| <strong>Google DIY RAG Reference</strong> | |
| <p>Document AI Layout Parser → context-aware chunking → Vertex AI Vector Search → Check Grounding API. Full Colab notebook available.</p> | |
| </div> | |
| </div> | |
| <div class="pattern"> | |
| <div class="pattern-num">4</div> | |
| <div class="pattern-text"> | |
| <strong>Event-driven Auto-sync</strong> (production pattern) | |
| <p>GCS upload → Eventarc → Cloud Run → ImportRagFiles. Terraform for infra, Pub/Sub for batching. Blog: "Auto-Sync RAG Pipeline" by Suhas Mallesh.</p> | |
| </div> | |
| </div> | |
| </div> | |
| <div class="divider">· · ·</div> | |
| <!-- PLAN --> | |
| <div class="section fade-in"> | |
| <h2>4-Week Plan</h2> | |
| <div class="timeline"> | |
| <div class="timeline-week"> | |
| <div class="timeline-marker w1">W1</div> | |
| <div class="timeline-content"> | |
| <h4>Upload + Extraction</h4> | |
| <ul> | |
| <li>Signed URL endpoint (V4, 15-min expiry, per-user folder)</li> | |
| <li>GCS bucket: <code>/uploads/{user_id}/{ts}/</code> raw, <code>/processed/{user_id}/{parent_id}/</code> extracted</li> | |
| <li>Eventarc trigger on <code>object.finalized</code></li> | |
| <li>Cloud Run processor: zip extraction + file splitting</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <div class="timeline-week"> | |
| <div class="timeline-marker w2">W2</div> | |
| <div class="timeline-content"> | |
| <h4>Validation + Processing + DB</h4> | |
| <ul> | |
| <li>Validation layer: check types, reject unsupported, scan corruption</li> | |
| <li>Folder/subfolder creation for UI</li> | |
| <li>Parent-child tracking in DB <span class="badge done">Abhijit committed tables</span></li> | |
| <li>Real-time status updates (processing → available → failed)</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <div class="timeline-week"> | |
| <div class="timeline-marker w3">W3</div> | |
| <div class="timeline-content"> | |
| <h4>Ingestion Pipeline</h4> | |
| <ul> | |
| <li>Enable Document AI Layout Parser</li> | |
| <li>RAG corpus: 1024 tokens, 256 overlap</li> | |
| <li>ImportRagFiles from /processed/ with Layout Parser</li> | |
| <li>Parent → child chunk mapping for UI retrieval</li> | |
| <li>Import result sink to BigQuery</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <div class="timeline-week"> | |
| <div class="timeline-marker w4">W4</div> | |
| <div class="timeline-content"> | |
| <h4>Integration + Testing</h4> | |
| <ul> | |
| <li>ADK agent with <code>vertex_ai_rag_retrieval</code></li> | |
| <li>Error handling: retry, dead letter queue</li> | |
| <li>E2E test: 1GB zip → extract → split → ingest → chat</li> | |
| </ul> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| <!-- DECISIONS --> | |
| <div class="section fade-in"> | |
| <h2>Decisions for Monday</h2> | |
| <div class="decision"> | |
| <div class="q">1. Eventarc vs Airflow?</div> | |
| <p class="a">Eventarc for file-arrival triggers. Abhijit's Airflow GCS hooks weren't reliable. Eventarc can optionally trigger an Airflow DAG if complex orchestration is needed.</p> | |
| <div class="verdict">→ Eventarc primary, Airflow optional orchestration layer</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">2. Cloud Run vs dedicated pod?</div> | |
| <p class="a">Cloud Run scales to zero, handles burst, 32GB memory. Dedicated pod only if processing >60 min per file.</p> | |
| <div class="verdict">→ Cloud Run for most cases</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">3. Layout Parser vs default?</div> | |
| <p class="a">Layout Parser for PDFs with tables/charts (higher accuracy, Document AI pricing). Default for plain text.</p> | |
| <div class="verdict">→ Layout Parser for PDFs, default for text</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">4. DOCX splitting strategy?</div> | |
| <p class="a">Split by Heading 1 boundaries, not arbitrary bytes. Context preamble per chunk. This preserves the context Abhijit was worried about.</p> | |
| <div class="verdict">→ Heading-based splitting + context preamble</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">5. Parent-child UI model?</div> | |
| <p class="a">User sees parent file, system sends child chunks to Vertex. DB: files + file_chunks tables. Abhijit already committed the schema changes.</p> | |
| <div class="verdict">→ Build on Abhijit's committed tables</div> | |
| </div> | |
| <div class="decision"> | |
| <div class="q">6. Trigger approach?</div> | |
| <p class="a">Ashish suggested TCP connector / GCS hooks. Eventarc is exactly this — event-driven, Google-managed, not polling.</p> | |
| <div class="verdict">→ Eventarc = the GCS hook that actually works</div> | |
| </div> | |
| </div> | |
| <div class="footer"> | |
| <p>Research compiled by Claw 🐾 for Ashish · April 2, 2026</p> | |
| <p>Sources: Google Cloud docs, Vertex AI RAG Engine docs, industry patterns</p> | |
| </div> | |
| </div> | |
| <script> | |
| const observer = new IntersectionObserver(entries => { | |
| entries.forEach(e => { if (e.isIntersecting) e.target.classList.add('visible'); }); | |
| }, { threshold: 0.1 }); | |
| document.querySelectorAll('.fade-in').forEach(el => observer.observe(el)); | |
| </script> | |
| </body> | |
| </html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment