Migration Plan: PR #2370 → PR #2375 (DUP-40)

Executive Summary

PR #2375 (DUP-40) introduces a new Claude Agents SDK-based pipeline that runs in parallel with the existing CLI-based pipeline. Our PR #2370 contains critical fixes and performance optimizations for the old CLI pipeline that need to be selectively migrated to the new SDK pipeline.

Key Insight: Not all changes from #2370 apply to #2375. The SDK pipeline has a fundamentally different architecture that makes some changes unnecessary while others remain critical.

Architecture Comparison

Aspect	Old Pipeline (PR #2370)	New SDK Pipeline (PR #2375)
Orchestration	CLI-based, file polling	SDK query(), event-driven
Agent Location	`sandbox/e2b/.claude/agents/`	`sandbox/e2b/pipeline/.claude/agents/`
Skills Location	`/home/user/skills/`	`/home/user/pipeline/skills/`
Completion Signal	`.status/` files	Structured JSON returns
Artifact Persistence	Files → S3 sync	Direct database writes
State Management	meta.json polling	StateMachine class
Event Emission	Observer parsing logs	EventEmitter class

Migration Priority Matrix

🔴 CRITICAL (Must Apply)

These are bugs/safety issues that affect both pipelines:

1. Gather Agent "DO NOT do video work" Fix

Location: pipeline/.claude/agents/gather.md Issue: DUP-40's gather.md is missing the critical section preventing gather agents from doing video work themselves.

Apply this section after "Performance Rules":

## DO NOT Do Video Work (CRITICAL)

**You are a GATHER agent, not a video-reviewer agent.**

NEVER:
- Download YouTube transcripts
- Write to `video-insights/` directory
- Analyze video content
- Use youtube-transcript skill

Video work is handled by dedicated `video-reviewer` agents launched in parallel by the orchestrator.
If video-insights files don't exist yet, that's expected - video-reviewers may still be running.

**Just do your gather work and write to `gathered/{slug}.md`.**

Why Critical: Without this, gather agents improvise and do video work, causing duplicate effort and inconsistent results.

2. PreToolUse Hooks for Safety

Create: pipeline/.claude/hooks/validate_bash.py Create: pipeline/.claude/hooks/validate_read.py
Create: pipeline/.claude/settings.json

These hooks block dangerous operations:

validate_bash.py: Blocks mkdir commands (20-70s latency on S3)
validate_read.py: Validates sandbox path boundaries

Copy from our branch:

# From PR #2370 branch:
cp sandbox/e2b/.claude/hooks/validate_bash.py → pipeline/.claude/hooks/
cp sandbox/e2b/.claude/hooks/validate_read.py → pipeline/.claude/hooks/

Create settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [".claude/hooks/validate_bash.py"]
      },
      {
        "matcher": "Read",
        "hooks": [".claude/hooks/validate_read.py"]
      }
    ]
  }
}

Why Critical: Without hooks, agents can mkdir (causing massive latency) or read outside sandbox.

🟠 HIGH (Should Apply)

3. Video-Reviewer Parallel Downloads

Location: pipeline/.claude/agents/video-reviewer.md

Current DUP-40: Downloads transcripts sequentially Our Fix: Downloads both transcripts in parallel using & and wait

Apply to Process section:

## Process

1. Read `candidates.json` for video URLs - **select only the 2 most relevant videos**.

2. **Download transcripts in PARALLEL (MANDATORY)**:

   ```bash
   mkdir -p {TASK_WORKSPACE}/tmp
   
   # Download both transcripts in parallel
   timeout 30 /home/user/.venv/bin/python /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "https://www.youtube.com/watch?v=VIDEO_ID_1" \
     "{TASK_WORKSPACE}/tmp/transcript_1.txt" &
   PID1=$!
   
   timeout 30 /home/user/.venv/bin/python /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "https://www.youtube.com/watch?v=VIDEO_ID_2" \
     "{TASK_WORKSPACE}/tmp/transcript_2.txt" &
   PID2=$!
   
   # Wait for both and check exit codes
   wait $PID1; EXIT1=$?
   wait $PID2; EXIT2=$?

Process whichever transcripts succeeded (handle partial failures).


**Impact**: 6-12min → 3-6min per video-reviewer (2x faster)

---

#### 4. TASK_AGENT Identifier for Observer

**Location**: `pipeline/.claude/agents/video-reviewer.md` and `shortlist-enricher.md`

**Add to frontmatter or prompt:**
```markdown
## Agent Identification

When starting, emit your agent type for observer tracking:
- Set `TASK_AGENT=video-reviewer` in your first bash command
- This enables per-agent usage tracking and debugging

Why: Without TASK_AGENT, observer can't distinguish which agent generated which usage/events.

🟡 MEDIUM (Recommend Applying)

5. YouTube Transcript Script Fixes

Issue: DUP-40's download-transcript.py uses yt-dlp directly but may have issues with:

youtube-transcript-api v0.6.0+ API changes
Whisper fallback path
Error handling

Compare and merge fixes from our 8-commit fix sequence:

Commit	Fix	Applies to DUP-40?
venv Python usage	Use `/home/user/.venv/bin/python` directly	✅ Different but related
ENOENT log suppression	Don't log expected missing files	✅ Check observer
Node.js for yt-dlp	Add `--js-runtimes node` flag	✅ If using yt-dlp
youtube-transcript-api v0.6.0+	Fix API method calls	❓ DUP-40 uses yt-dlp directly
SKIP_WHISPER_FALLBACK	A/B testing env var	✅ Useful for DUP-40

Recommendation: Review DUP-40's download-transcript.py and ensure it handles:

Timeout properly (30s default)
Fallback to Whisper when captions unavailable
Clean error messages for observer

6. Env Var Consistency

Issue: DUP-40 has GROQ_API_KEY as optional; we made it required.

DUP-40 current:

const SANDBOX_ENV_KEYS_OPTIONAL = [
  'GROQ_API_KEY', // For fast inference (agents fall back to Claude if missing)
  ...
]

Our change:

const SANDBOX_ENV_KEYS_REQUIRED = [
  'GROQ_API_KEY',  // Required for cost-efficient inference
  'GOOGLE_CSE_API_KEY',
  'GOOGLE_CSE_CX',
  'DD_API_KEY',
  ...
]

Recommendation:

If SDK pipeline uses Groq for schema/video-reviewer: Make GROQ required
If SDK pipeline uses direct Haiku: Keep optional (accept higher cost)
Keep GOOGLE_CSE_* required if shortlist-enricher needs image search

🟢 LOW (May Not Apply)

7. Local Workspace + S3 Sync Architecture

Our change: Local files → bulk S3 sync at end DUP-40 approach: Direct database writes via ArtifactWriter

Analysis: DUP-40's architecture may not need this optimization because:

ArtifactWriter writes directly to database
No S3 FUSE mount latency for artifacts
Events go via EventEmitter, not file writes

Recommendation:

Check if DUP-40 still uses S3 for any file operations
If yes, apply sync architecture
If no (pure DB writes), skip this change

8. Poll Script Glob Counting

Our change: Replace ls | wc -l with bash glob arrays DUP-40 approach: Event-driven, no polling

Analysis: SDK pipeline uses EventEmitter and StateMachine, not file polling.

Recommendation: Skip this change for SDK pipeline.

9. Observer Usage Cost Tracking

Our change: Enhanced sandbox-observer.ts with usage cost collection DUP-40: Has observability.ts and artifact-writer.ts

Check:

Does DUP-40's observability.ts track usage costs?
Does artifact-writer.ts persist usage_cost_usd?

If not, the API-side fix (researchArtifacts.ts) may still be needed.

File-by-File Migration Checklist

Must Create/Copy

File	Source (PR #2370)	Target (PR #2375)	Priority
`validate_bash.py`	`.claude/hooks/`	`pipeline/.claude/hooks/`	🔴 CRITICAL
`validate_read.py`	`.claude/hooks/`	`pipeline/.claude/hooks/`	🔴 CRITICAL
`settings.json`	`.claude/settings.json`	`pipeline/.claude/settings.json`	🔴 CRITICAL

Must Modify

File	Changes Needed	Priority
`pipeline/.claude/agents/gather.md`	Add "DO NOT do video work" section	🔴 CRITICAL
`pipeline/.claude/agents/video-reviewer.md`	Add parallel downloads, TASK_AGENT	🟠 HIGH
`pipeline/.claude/agents/shortlist-enricher.md`	Add TASK_AGENT	🟠 HIGH
`sandbox/e2b/sandbox.ts`	Review env var requirements	🟡 MEDIUM

May Need Review

File	Check For	Priority
`control-plane/observability.ts`	Usage cost tracking	🟡 MEDIUM
`control-plane/artifact-writer.ts`	usage_cost_usd persistence	🟡 MEDIUM
`pipeline/.claude/skills/youtube-transcript/`	Script compatibility	🟡 MEDIUM

Implementation Steps

Step 1: Apply Critical Safety Fixes (Day 1)

# From dupe-com repo root, on DUP-40 branch:

# 1. Create hooks directory
mkdir -p packages/researcher/sandbox/e2b/pipeline/.claude/hooks

# 2. Copy hook files from main (already merged from PR #2370)
cp packages/researcher/sandbox/e2b/.claude/hooks/validate_bash.py \
   packages/researcher/sandbox/e2b/pipeline/.claude/hooks/

cp packages/researcher/sandbox/e2b/.claude/hooks/validate_read.py \
   packages/researcher/sandbox/e2b/pipeline/.claude/hooks/

# 3. Create settings.json (see content above)

Step 2: Update Agent Markdown Files (Day 1-2)

Edit gather.md - Add "DO NOT do video work" section
Edit video-reviewer.md - Add parallel downloads and TASK_AGENT
Edit shortlist-enricher.md - Add TASK_AGENT

Step 3: Review SDK Components (Day 2-3)

Check observability.ts for usage cost tracking
Check artifact-writer.ts for cost persistence
Review download-transcript.py for compatibility

Step 4: Test Integration (Day 3-4)

Run a test task with SDK pipeline
Verify gather agents don't do video work
Verify hooks block mkdir commands
Verify usage costs are tracked

Questions for DUP-40 Branch Owner

S3 Usage: Does the SDK pipeline use S3 for any file operations, or is everything direct-to-database?
Transcript Approach: The SDK download-transcript.py uses yt-dlp directly. Has it been tested with videos that have:
- No captions (requires Whisper fallback)?
- Rate limiting?
- Timeout scenarios?
Usage Tracking: Where does the SDK pipeline track agent usage costs? Is it in observability.ts or elsewhere?
Env Var Strategy: Should GROQ_API_KEY be required (for cost efficiency) or optional (with Haiku fallback)?
Hook Registration: Is there a reason hooks aren't registered in the pipeline .claude/settings.json? Was this intentional or an oversight?

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Gather agents do video work	HIGH	HIGH	Apply "DO NOT" section immediately
mkdir latency in SDK pipeline	MEDIUM	HIGH	Add validate_bash.py hook
Missing usage cost tracking	MEDIUM	MEDIUM	Verify observability.ts handles it
Transcript download failures	LOW	MEDIUM	Test download-transcript.py thoroughly

Summary

Must Do (Before Merge):

✅ Add "DO NOT do video work" to gather.md
✅ Add PreToolUse hooks (validate_bash.py, validate_read.py, settings.json)
✅ Add parallel downloads to video-reviewer.md

Should Do (Soon After):

Add TASK_AGENT identifiers for observability
Review env var requirements
Test youtube transcript edge cases

Can Skip (SDK handles differently):

Local workspace + S3 sync (SDK writes to DB)
Poll script optimizations (SDK is event-driven)
Observer parsing changes (SDK uses EventEmitter)

Generated by analyzing PR #2370 (video-reviewer orchestration + local workspace performance) and PR #2375 (Claude Agents SDK migration)

Appendix A: Code Snippets to Apply

A.1 - gather.md "DO NOT do video work" Section

Add this section after "Performance Rules" in pipeline/.claude/agents/gather.md:

## DO NOT Do Video Work (CRITICAL)

**You are a GATHER agent, not a video-reviewer agent.**

NEVER:
- Download YouTube transcripts
- Write to `video-insights/` directory  
- Analyze video content
- Use youtube-transcript skill
- Reference or wait for video data

Video work is handled by dedicated `video-reviewer` agents launched in parallel by the orchestrator.
If `video-insights/` files don't exist yet, that's expected - video-reviewers may still be running.

**Just do your gather work and write to `gathered/{slug}.md`.**

If you find yourself wanting to process YouTube videos: STOP. That's not your job.
Write what you have and let video-reviewer handle videos.

A.2 - validate_bash.py (Full File)

Create pipeline/.claude/hooks/validate_bash.py:

#!/usr/bin/env python3
"""
PreToolUse hook that blocks mkdir commands in Bash tool calls.

mkdir on S3 FUSE mounts takes 20-70 seconds. All directories are pre-created
by init-task.sh, so mkdir is never needed. This hook provides hard enforcement
rather than relying on agent compliance with markdown instructions.
"""

import json
import re
import sys


def main():
    # Read hook input from stdin
    hook_input = json.load(sys.stdin)
    
    tool_name = hook_input.get("tool_name", "")
    tool_input = hook_input.get("tool_input", {})
    
    # Only check Bash tool
    if tool_name != "Bash":
        # Allow other tools
        print(json.dumps({"decision": "approve"}))
        return
    
    command = tool_input.get("command", "")
    
    # Check for mkdir in command
    # Match: mkdir, mkdir -p, mkdir -m, etc.
    mkdir_pattern = r'\bmkdir\b'
    
    if re.search(mkdir_pattern, command, re.IGNORECASE):
        # Block the command
        result = {
            "decision": "block",
            "reason": (
                "mkdir is blocked for performance. "
                "All directories are pre-created by init-task.sh. "
                "If you need a directory, it should already exist. "
                "Check your path or use an existing directory."
            )
        }
        print(json.dumps(result))
        return
    
    # Allow the command
    print(json.dumps({"decision": "approve"}))


if __name__ == "__main__":
    main()

A.3 - validate_read.py (Full File)

Create pipeline/.claude/hooks/validate_read.py:

#!/usr/bin/env python3
"""
PreToolUse hook that validates Read tool file paths stay within sandbox.
"""

import json
import os
import sys


ALLOWED_PREFIXES = [
    "/home/user/",
    "/tmp/",
]


def main():
    hook_input = json.load(sys.stdin)
    
    tool_name = hook_input.get("tool_name", "")
    tool_input = hook_input.get("tool_input", {})
    
    # Only check Read tool
    if tool_name != "Read":
        print(json.dumps({"decision": "approve"}))
        return
    
    file_path = tool_input.get("file_path", "") or tool_input.get("filePath", "")
    
    if not file_path:
        print(json.dumps({"decision": "approve"}))
        return
    
    # Normalize path
    normalized = os.path.normpath(file_path)
    
    # Check if path is within allowed prefixes
    allowed = any(normalized.startswith(prefix) for prefix in ALLOWED_PREFIXES)
    
    if not allowed:
        result = {
            "decision": "block",
            "reason": (
                f"Read path '{file_path}' is outside allowed directories. "
                f"Allowed: {', '.join(ALLOWED_PREFIXES)}"
            )
        }
        print(json.dumps(result))
        return
    
    print(json.dumps({"decision": "approve"}))


if __name__ == "__main__":
    main()

A.4 - settings.json (Full File)

Create pipeline/.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [".claude/hooks/validate_bash.py"]
      },
      {
        "matcher": "Read", 
        "hooks": [".claude/hooks/validate_read.py"]
      }
    ]
  }
}

A.5 - video-reviewer.md Parallel Downloads Section

Replace the transcript download section in pipeline/.claude/agents/video-reviewer.md:

## Process

1. Read `candidates.json` for video URLs - **select only the 2 most relevant videos**.

2. **Download transcripts in PARALLEL (MANDATORY)**:

   All YouTube transcript downloads MUST use the helper script. Download both videos simultaneously:

   ```bash
   mkdir -p {TASK_WORKSPACE}/tmp
   
   # Download transcript 1 in background
   timeout 30 /home/user/.venv/bin/python \
     /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "URL_1" "{TASK_WORKSPACE}/tmp/transcript_1.txt" 2>&1 &
   PID1=$!
   
   # Download transcript 2 in background  
   timeout 30 /home/user/.venv/bin/python \
     /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "URL_2" "{TASK_WORKSPACE}/tmp/transcript_2.txt" 2>&1 &
   PID2=$!
   
   # Wait for both downloads
   wait $PID1
   EXIT1=$?
   wait $PID2
   EXIT2=$?
   
   echo "Download 1 exit: $EXIT1, Download 2 exit: $EXIT2"

Handle partial failures: If one download fails, process the successful one. If both fail, return success=false with error reason.

Summarize each successful transcript using Groq or Haiku fallback.
Write combined insights to video-insights/{product-slug}.md.
Clean up: rm -f {TASK_WORKSPACE}/tmp/transcript_*.txt


---

### A.6 - TASK_AGENT Identifier

Add to video-reviewer.md and shortlist-enricher.md frontmatter or first bash command:

```markdown
## Agent Identification (for Observer)

Your first bash command should set TASK_AGENT for observer tracking:

```bash
export TASK_AGENT=video-reviewer  # or shortlist-enricher
echo "[agent:${TASK_AGENT}] Starting work on {product-slug}"

This enables per-agent usage tracking and debugging in the observer.


---

## Appendix B: Quick Reference

### Files in Our PR #2370

packages/researcher/sandbox/e2b/ ├── .claude/ │ ├── agents/ │ │ ├── gather.md # DO NOT do video work │ │ ├── video-reviewer.md # Parallel downloads, TASK_AGENT │ │ ├── schema.md # Simplified (no Groq) │ │ └── ... │ ├── hooks/ │ │ ├── validate_bash.py # Block mkdir │ │ └── validate_read.py # Path validation │ ├── settings.json # Hook registration │ └── skills/ │ └── youtube-transcript/ │ ├── SKILL.md # Updated docs │ ├── reference.md # NEW troubleshooting │ └── scripts/ │ └── download-transcript.py # v0.6.0+ fixes ├── scripts/ │ ├── init-task.sh # Local workspace (may not apply) │ ├── sync-to-s3.sh # NEW bulk sync (may not apply) │ ├── sync-phase-to-s3.sh # NEW phase sync (may not apply) │ └── poll-*.sh # Glob counting (may not apply) ├── sandbox.ts # Env var promotion └── cli.ts # Enhanced error handling


### Files in DUP-40 That Need Changes

packages/researcher/sandbox/e2b/ ├── pipeline/ │ └── .claude/ │ ├── agents/ │ │ ├── gather.md # ← ADD "DO NOT do video work" │ │ ├── video-reviewer.md # ← ADD parallel downloads │ │ └── shortlist-enricher.md # ← ADD TASK_AGENT │ ├── hooks/ # ← CREATE directory │ │ ├── validate_bash.py # ← CREATE │ │ └── validate_read.py # ← CREATE │ └── settings.json # ← CREATE └── sandbox.ts # ← REVIEW env vars


---

## ⚠️ IMPORTANT CORRECTION: mkdir Conflict

**Issue Discovered**: The migration plan includes `validate_bash.py` that blocks ALL mkdir commands, but DUP-40's `video-reviewer.md` contains `mkdir -p {TASK_WORKSPACE}/tmp`.

**This will cause a conflict!** The hook will block the mkdir that agents are instructed to run.

### Resolution

**Step 1: Update `init-task.sh` to pre-create `tmp/`**

In `packages/researcher/sandbox/e2b/scripts/init-task.sh`, change line 75:

```bash
# FROM:
mkdir -p "$WORKSPACE"/{gathered,summaries,structured,video-insights,repairs}

# TO:
mkdir -p "$WORKSPACE"/{gathered,summaries,structured,video-insights,repairs,tmp}

Step 2: Remove mkdir from agent markdown files

In video-reviewer.md, remove the mkdir line and add a note:

## Process

1. Read `candidates.json` for video URLs.

2. **Download transcripts in PARALLEL**:

   Note: `tmp/` directory is pre-created by init-task.sh. Do NOT run mkdir.

   ```bash
   # tmp/ already exists - download both transcripts in parallel
   timeout 30 /home/user/.venv/bin/python \
     /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
     "URL_1" "{TASK_WORKSPACE}/tmp/transcript_1.txt" 2>&1 &
   PID1=$!
   # ... rest of parallel download code


**Step 3: Update all agent files to note directories pre-exist**

Add to each agent's "Performance Rules" or similar section:

```markdown
- **Directories pre-exist** - `mkdir` is blocked by hook. All directories (gathered/, tmp/, etc.) are created by init-task.sh.

Why This Matters

mkdir on S3 FUSE mounts takes 20-70 seconds per call
Old pipeline had 31+ mkdir calls = 10-36 minutes wasted
By pre-creating directories and blocking mkdir, we eliminate this latency entirely

i8ramin/migration-plan.md

Select an option

No results found