Skip to content

Instantly share code, notes, and snippets.

@i8ramin
Last active January 20, 2026 04:48
Show Gist options
  • Select an option

  • Save i8ramin/fd16ad640e8ad7c1648b5bc52924c163 to your computer and use it in GitHub Desktop.

Select an option

Save i8ramin/fd16ad640e8ad7c1648b5bc52924c163 to your computer and use it in GitHub Desktop.
Migration Plan: PR #2370 fixes → PR #2375 (DUP-40 SDK Pipeline)

Migration Plan: PR #2370 → PR #2375 (DUP-40)

Executive Summary

PR #2375 (DUP-40) introduces a new Claude Agents SDK-based pipeline that runs in parallel with the existing CLI-based pipeline. Our PR #2370 contains critical fixes and performance optimizations for the old CLI pipeline that need to be selectively migrated to the new SDK pipeline.

Key Insight: Not all changes from #2370 apply to #2375. The SDK pipeline has a fundamentally different architecture that makes some changes unnecessary while others remain critical.


Architecture Comparison

Aspect Old Pipeline (PR #2370) New SDK Pipeline (PR #2375)
Orchestration CLI-based, file polling SDK query(), event-driven
Agent Location sandbox/e2b/.claude/agents/ sandbox/e2b/pipeline/.claude/agents/
Skills Location /home/user/skills/ /home/user/pipeline/skills/
Completion Signal .status/ files Structured JSON returns
Artifact Persistence Files → S3 sync Direct database writes
State Management meta.json polling StateMachine class
Event Emission Observer parsing logs EventEmitter class

Migration Priority Matrix

🔴 CRITICAL (Must Apply)

These are bugs/safety issues that affect both pipelines:

1. Gather Agent "DO NOT do video work" Fix

Location: pipeline/.claude/agents/gather.md Issue: DUP-40's gather.md is missing the critical section preventing gather agents from doing video work themselves.

Apply this section after "Performance Rules":

## DO NOT Do Video Work (CRITICAL)

**You are a GATHER agent, not a video-reviewer agent.**

NEVER:
- Download YouTube transcripts
- Write to `video-insights/` directory
- Analyze video content
- Use youtube-transcript skill

Video work is handled by dedicated `video-reviewer` agents launched in parallel by the orchestrator.
If video-insights files don't exist yet, that's expected - video-reviewers may still be running.

**Just do your gather work and write to `gathered/{slug}.md`.**

Why Critical: Without this, gather agents improvise and do video work, causing duplicate effort and inconsistent results.


2. PreToolUse Hooks for Safety

Create: pipeline/.claude/hooks/validate_bash.py Create: pipeline/.claude/hooks/validate_read.py
Create: pipeline/.claude/settings.json

These hooks block dangerous operations:

  • validate_bash.py: Blocks mkdir commands (20-70s latency on S3)
  • validate_read.py: Validates sandbox path boundaries

Copy from our branch:

# From PR #2370 branch:
cp sandbox/e2b/.claude/hooks/validate_bash.py → pipeline/.claude/hooks/
cp sandbox/e2b/.claude/hooks/validate_read.py → pipeline/.claude/hooks/

Create settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [".claude/hooks/validate_bash.py"]
      },
      {
        "matcher": "Read",
        "hooks": [".claude/hooks/validate_read.py"]
      }
    ]
  }
}

Why Critical: Without hooks, agents can mkdir (causing massive latency) or read outside sandbox.


🟠 HIGH (Should Apply)

3. Video-Reviewer Parallel Downloads

Location: pipeline/.claude/agents/video-reviewer.md

Current DUP-40: Downloads transcripts sequentially Our Fix: Downloads both transcripts in parallel using & and wait

Apply to Process section:

## Process

1. Read `candidates.json` for video URLs - **select only the 2 most relevant videos**.

2. **Download transcripts in PARALLEL (MANDATORY)**:

   ```bash
   mkdir -p {TASK_WORKSPACE}/tmp
   
   # Download both transcripts in parallel
   timeout 30 /home/user/.venv/bin/python /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "https://www.youtube.com/watch?v=VIDEO_ID_1" \
     "{TASK_WORKSPACE}/tmp/transcript_1.txt" &
   PID1=$!
   
   timeout 30 /home/user/.venv/bin/python /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "https://www.youtube.com/watch?v=VIDEO_ID_2" \
     "{TASK_WORKSPACE}/tmp/transcript_2.txt" &
   PID2=$!
   
   # Wait for both and check exit codes
   wait $PID1; EXIT1=$?
   wait $PID2; EXIT2=$?
  1. Process whichever transcripts succeeded (handle partial failures).

**Impact**: 6-12min → 3-6min per video-reviewer (2x faster)

---

#### 4. TASK_AGENT Identifier for Observer

**Location**: `pipeline/.claude/agents/video-reviewer.md` and `shortlist-enricher.md`

**Add to frontmatter or prompt:**
```markdown
## Agent Identification

When starting, emit your agent type for observer tracking:
- Set `TASK_AGENT=video-reviewer` in your first bash command
- This enables per-agent usage tracking and debugging

Why: Without TASK_AGENT, observer can't distinguish which agent generated which usage/events.


🟡 MEDIUM (Recommend Applying)

5. YouTube Transcript Script Fixes

Issue: DUP-40's download-transcript.py uses yt-dlp directly but may have issues with:

  • youtube-transcript-api v0.6.0+ API changes
  • Whisper fallback path
  • Error handling

Compare and merge fixes from our 8-commit fix sequence:

Commit Fix Applies to DUP-40?
venv Python usage Use /home/user/.venv/bin/python directly ✅ Different but related
ENOENT log suppression Don't log expected missing files ✅ Check observer
Node.js for yt-dlp Add --js-runtimes node flag ✅ If using yt-dlp
youtube-transcript-api v0.6.0+ Fix API method calls ❓ DUP-40 uses yt-dlp directly
SKIP_WHISPER_FALLBACK A/B testing env var ✅ Useful for DUP-40

Recommendation: Review DUP-40's download-transcript.py and ensure it handles:

  • Timeout properly (30s default)
  • Fallback to Whisper when captions unavailable
  • Clean error messages for observer

6. Env Var Consistency

Issue: DUP-40 has GROQ_API_KEY as optional; we made it required.

DUP-40 current:

const SANDBOX_ENV_KEYS_OPTIONAL = [
  'GROQ_API_KEY', // For fast inference (agents fall back to Claude if missing)
  ...
]

Our change:

const SANDBOX_ENV_KEYS_REQUIRED = [
  'GROQ_API_KEY',  // Required for cost-efficient inference
  'GOOGLE_CSE_API_KEY',
  'GOOGLE_CSE_CX',
  'DD_API_KEY',
  ...
]

Recommendation:

  • If SDK pipeline uses Groq for schema/video-reviewer: Make GROQ required
  • If SDK pipeline uses direct Haiku: Keep optional (accept higher cost)
  • Keep GOOGLE_CSE_* required if shortlist-enricher needs image search

🟢 LOW (May Not Apply)

7. Local Workspace + S3 Sync Architecture

Our change: Local files → bulk S3 sync at end DUP-40 approach: Direct database writes via ArtifactWriter

Analysis: DUP-40's architecture may not need this optimization because:

  • ArtifactWriter writes directly to database
  • No S3 FUSE mount latency for artifacts
  • Events go via EventEmitter, not file writes

Recommendation:

  • Check if DUP-40 still uses S3 for any file operations
  • If yes, apply sync architecture
  • If no (pure DB writes), skip this change

8. Poll Script Glob Counting

Our change: Replace ls | wc -l with bash glob arrays DUP-40 approach: Event-driven, no polling

Analysis: SDK pipeline uses EventEmitter and StateMachine, not file polling.

Recommendation: Skip this change for SDK pipeline.


9. Observer Usage Cost Tracking

Our change: Enhanced sandbox-observer.ts with usage cost collection DUP-40: Has observability.ts and artifact-writer.ts

Check:

  • Does DUP-40's observability.ts track usage costs?
  • Does artifact-writer.ts persist usage_cost_usd?

If not, the API-side fix (researchArtifacts.ts) may still be needed.


File-by-File Migration Checklist

Must Create/Copy

File Source (PR #2370) Target (PR #2375) Priority
validate_bash.py .claude/hooks/ pipeline/.claude/hooks/ 🔴 CRITICAL
validate_read.py .claude/hooks/ pipeline/.claude/hooks/ 🔴 CRITICAL
settings.json .claude/settings.json pipeline/.claude/settings.json 🔴 CRITICAL

Must Modify

File Changes Needed Priority
pipeline/.claude/agents/gather.md Add "DO NOT do video work" section 🔴 CRITICAL
pipeline/.claude/agents/video-reviewer.md Add parallel downloads, TASK_AGENT 🟠 HIGH
pipeline/.claude/agents/shortlist-enricher.md Add TASK_AGENT 🟠 HIGH
sandbox/e2b/sandbox.ts Review env var requirements 🟡 MEDIUM

May Need Review

File Check For Priority
control-plane/observability.ts Usage cost tracking 🟡 MEDIUM
control-plane/artifact-writer.ts usage_cost_usd persistence 🟡 MEDIUM
pipeline/.claude/skills/youtube-transcript/ Script compatibility 🟡 MEDIUM

Implementation Steps

Step 1: Apply Critical Safety Fixes (Day 1)

# From dupe-com repo root, on DUP-40 branch:

# 1. Create hooks directory
mkdir -p packages/researcher/sandbox/e2b/pipeline/.claude/hooks

# 2. Copy hook files from main (already merged from PR #2370)
cp packages/researcher/sandbox/e2b/.claude/hooks/validate_bash.py \
   packages/researcher/sandbox/e2b/pipeline/.claude/hooks/

cp packages/researcher/sandbox/e2b/.claude/hooks/validate_read.py \
   packages/researcher/sandbox/e2b/pipeline/.claude/hooks/

# 3. Create settings.json (see content above)

Step 2: Update Agent Markdown Files (Day 1-2)

  1. Edit gather.md - Add "DO NOT do video work" section
  2. Edit video-reviewer.md - Add parallel downloads and TASK_AGENT
  3. Edit shortlist-enricher.md - Add TASK_AGENT

Step 3: Review SDK Components (Day 2-3)

  1. Check observability.ts for usage cost tracking
  2. Check artifact-writer.ts for cost persistence
  3. Review download-transcript.py for compatibility

Step 4: Test Integration (Day 3-4)

  1. Run a test task with SDK pipeline
  2. Verify gather agents don't do video work
  3. Verify hooks block mkdir commands
  4. Verify usage costs are tracked

Questions for DUP-40 Branch Owner

  1. S3 Usage: Does the SDK pipeline use S3 for any file operations, or is everything direct-to-database?

  2. Transcript Approach: The SDK download-transcript.py uses yt-dlp directly. Has it been tested with videos that have:

    • No captions (requires Whisper fallback)?
    • Rate limiting?
    • Timeout scenarios?
  3. Usage Tracking: Where does the SDK pipeline track agent usage costs? Is it in observability.ts or elsewhere?

  4. Env Var Strategy: Should GROQ_API_KEY be required (for cost efficiency) or optional (with Haiku fallback)?

  5. Hook Registration: Is there a reason hooks aren't registered in the pipeline .claude/settings.json? Was this intentional or an oversight?


Risk Assessment

Risk Likelihood Impact Mitigation
Gather agents do video work HIGH HIGH Apply "DO NOT" section immediately
mkdir latency in SDK pipeline MEDIUM HIGH Add validate_bash.py hook
Missing usage cost tracking MEDIUM MEDIUM Verify observability.ts handles it
Transcript download failures LOW MEDIUM Test download-transcript.py thoroughly

Summary

Must Do (Before Merge):

  1. ✅ Add "DO NOT do video work" to gather.md
  2. ✅ Add PreToolUse hooks (validate_bash.py, validate_read.py, settings.json)
  3. ✅ Add parallel downloads to video-reviewer.md

Should Do (Soon After):

  1. Add TASK_AGENT identifiers for observability
  2. Review env var requirements
  3. Test youtube transcript edge cases

Can Skip (SDK handles differently):

  1. Local workspace + S3 sync (SDK writes to DB)
  2. Poll script optimizations (SDK is event-driven)
  3. Observer parsing changes (SDK uses EventEmitter)

Generated by analyzing PR #2370 (video-reviewer orchestration + local workspace performance) and PR #2375 (Claude Agents SDK migration)


Appendix A: Code Snippets to Apply

A.1 - gather.md "DO NOT do video work" Section

Add this section after "Performance Rules" in pipeline/.claude/agents/gather.md:

## DO NOT Do Video Work (CRITICAL)

**You are a GATHER agent, not a video-reviewer agent.**

NEVER:
- Download YouTube transcripts
- Write to `video-insights/` directory  
- Analyze video content
- Use youtube-transcript skill
- Reference or wait for video data

Video work is handled by dedicated `video-reviewer` agents launched in parallel by the orchestrator.
If `video-insights/` files don't exist yet, that's expected - video-reviewers may still be running.

**Just do your gather work and write to `gathered/{slug}.md`.**

If you find yourself wanting to process YouTube videos: STOP. That's not your job.
Write what you have and let video-reviewer handle videos.

A.2 - validate_bash.py (Full File)

Create pipeline/.claude/hooks/validate_bash.py:

#!/usr/bin/env python3
"""
PreToolUse hook that blocks mkdir commands in Bash tool calls.

mkdir on S3 FUSE mounts takes 20-70 seconds. All directories are pre-created
by init-task.sh, so mkdir is never needed. This hook provides hard enforcement
rather than relying on agent compliance with markdown instructions.
"""

import json
import re
import sys


def main():
    # Read hook input from stdin
    hook_input = json.load(sys.stdin)
    
    tool_name = hook_input.get("tool_name", "")
    tool_input = hook_input.get("tool_input", {})
    
    # Only check Bash tool
    if tool_name != "Bash":
        # Allow other tools
        print(json.dumps({"decision": "approve"}))
        return
    
    command = tool_input.get("command", "")
    
    # Check for mkdir in command
    # Match: mkdir, mkdir -p, mkdir -m, etc.
    mkdir_pattern = r'\bmkdir\b'
    
    if re.search(mkdir_pattern, command, re.IGNORECASE):
        # Block the command
        result = {
            "decision": "block",
            "reason": (
                "mkdir is blocked for performance. "
                "All directories are pre-created by init-task.sh. "
                "If you need a directory, it should already exist. "
                "Check your path or use an existing directory."
            )
        }
        print(json.dumps(result))
        return
    
    # Allow the command
    print(json.dumps({"decision": "approve"}))


if __name__ == "__main__":
    main()

A.3 - validate_read.py (Full File)

Create pipeline/.claude/hooks/validate_read.py:

#!/usr/bin/env python3
"""
PreToolUse hook that validates Read tool file paths stay within sandbox.
"""

import json
import os
import sys


ALLOWED_PREFIXES = [
    "/home/user/",
    "/tmp/",
]


def main():
    hook_input = json.load(sys.stdin)
    
    tool_name = hook_input.get("tool_name", "")
    tool_input = hook_input.get("tool_input", {})
    
    # Only check Read tool
    if tool_name != "Read":
        print(json.dumps({"decision": "approve"}))
        return
    
    file_path = tool_input.get("file_path", "") or tool_input.get("filePath", "")
    
    if not file_path:
        print(json.dumps({"decision": "approve"}))
        return
    
    # Normalize path
    normalized = os.path.normpath(file_path)
    
    # Check if path is within allowed prefixes
    allowed = any(normalized.startswith(prefix) for prefix in ALLOWED_PREFIXES)
    
    if not allowed:
        result = {
            "decision": "block",
            "reason": (
                f"Read path '{file_path}' is outside allowed directories. "
                f"Allowed: {', '.join(ALLOWED_PREFIXES)}"
            )
        }
        print(json.dumps(result))
        return
    
    print(json.dumps({"decision": "approve"}))


if __name__ == "__main__":
    main()

A.4 - settings.json (Full File)

Create pipeline/.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [".claude/hooks/validate_bash.py"]
      },
      {
        "matcher": "Read", 
        "hooks": [".claude/hooks/validate_read.py"]
      }
    ]
  }
}

A.5 - video-reviewer.md Parallel Downloads Section

Replace the transcript download section in pipeline/.claude/agents/video-reviewer.md:

## Process

1. Read `candidates.json` for video URLs - **select only the 2 most relevant videos**.

2. **Download transcripts in PARALLEL (MANDATORY)**:

   All YouTube transcript downloads MUST use the helper script. Download both videos simultaneously:

   ```bash
   mkdir -p {TASK_WORKSPACE}/tmp
   
   # Download transcript 1 in background
   timeout 30 /home/user/.venv/bin/python \
     /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "URL_1" "{TASK_WORKSPACE}/tmp/transcript_1.txt" 2>&1 &
   PID1=$!
   
   # Download transcript 2 in background  
   timeout 30 /home/user/.venv/bin/python \
     /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "URL_2" "{TASK_WORKSPACE}/tmp/transcript_2.txt" 2>&1 &
   PID2=$!
   
   # Wait for both downloads
   wait $PID1
   EXIT1=$?
   wait $PID2
   EXIT2=$?
   
   echo "Download 1 exit: $EXIT1, Download 2 exit: $EXIT2"

Handle partial failures: If one download fails, process the successful one. If both fail, return success=false with error reason.

  1. Summarize each successful transcript using Groq or Haiku fallback.

  2. Write combined insights to video-insights/{product-slug}.md.

  3. Clean up: rm -f {TASK_WORKSPACE}/tmp/transcript_*.txt


---

### A.6 - TASK_AGENT Identifier

Add to video-reviewer.md and shortlist-enricher.md frontmatter or first bash command:

```markdown
## Agent Identification (for Observer)

Your first bash command should set TASK_AGENT for observer tracking:

```bash
export TASK_AGENT=video-reviewer  # or shortlist-enricher
echo "[agent:${TASK_AGENT}] Starting work on {product-slug}"

This enables per-agent usage tracking and debugging in the observer.


---

## Appendix B: Quick Reference

### Files in Our PR #2370

packages/researcher/sandbox/e2b/ ├── .claude/ │ ├── agents/ │ │ ├── gather.md # DO NOT do video work │ │ ├── video-reviewer.md # Parallel downloads, TASK_AGENT │ │ ├── schema.md # Simplified (no Groq) │ │ └── ... │ ├── hooks/ │ │ ├── validate_bash.py # Block mkdir │ │ └── validate_read.py # Path validation │ ├── settings.json # Hook registration │ └── skills/ │ └── youtube-transcript/ │ ├── SKILL.md # Updated docs │ ├── reference.md # NEW troubleshooting │ └── scripts/ │ └── download-transcript.py # v0.6.0+ fixes ├── scripts/ │ ├── init-task.sh # Local workspace (may not apply) │ ├── sync-to-s3.sh # NEW bulk sync (may not apply) │ ├── sync-phase-to-s3.sh # NEW phase sync (may not apply) │ └── poll-*.sh # Glob counting (may not apply) ├── sandbox.ts # Env var promotion └── cli.ts # Enhanced error handling


### Files in DUP-40 That Need Changes

packages/researcher/sandbox/e2b/ ├── pipeline/ │ └── .claude/ │ ├── agents/ │ │ ├── gather.md # ← ADD "DO NOT do video work" │ │ ├── video-reviewer.md # ← ADD parallel downloads │ │ └── shortlist-enricher.md # ← ADD TASK_AGENT │ ├── hooks/ # ← CREATE directory │ │ ├── validate_bash.py # ← CREATE │ │ └── validate_read.py # ← CREATE │ └── settings.json # ← CREATE └── sandbox.ts # ← REVIEW env vars


---

## ⚠️ IMPORTANT CORRECTION: mkdir Conflict

**Issue Discovered**: The migration plan includes `validate_bash.py` that blocks ALL mkdir commands, but DUP-40's `video-reviewer.md` contains `mkdir -p {TASK_WORKSPACE}/tmp`.

**This will cause a conflict!** The hook will block the mkdir that agents are instructed to run.

### Resolution

**Step 1: Update `init-task.sh` to pre-create `tmp/`**

In `packages/researcher/sandbox/e2b/scripts/init-task.sh`, change line 75:

```bash
# FROM:
mkdir -p "$WORKSPACE"/{gathered,summaries,structured,video-insights,repairs}

# TO:
mkdir -p "$WORKSPACE"/{gathered,summaries,structured,video-insights,repairs,tmp}

Step 2: Remove mkdir from agent markdown files

In video-reviewer.md, remove the mkdir line and add a note:

## Process

1. Read `candidates.json` for video URLs.

2. **Download transcripts in PARALLEL**:

   Note: `tmp/` directory is pre-created by init-task.sh. Do NOT run mkdir.

   ```bash
   # tmp/ already exists - download both transcripts in parallel
   timeout 30 /home/user/.venv/bin/python \
     /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "URL_1" "{TASK_WORKSPACE}/tmp/transcript_1.txt" 2>&1 &
   PID1=$!
   # ... rest of parallel download code

**Step 3: Update all agent files to note directories pre-exist**

Add to each agent's "Performance Rules" or similar section:

```markdown
- **Directories pre-exist** - `mkdir` is blocked by hook. All directories (gathered/, tmp/, etc.) are created by init-task.sh.

Why This Matters

  • mkdir on S3 FUSE mounts takes 20-70 seconds per call
  • Old pipeline had 31+ mkdir calls = 10-36 minutes wasted
  • By pre-creating directories and blocking mkdir, we eliminate this latency entirely
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment