PR #2375 (DUP-40) introduces a new Claude Agents SDK-based pipeline that runs in parallel with the existing CLI-based pipeline. Our PR #2370 contains critical fixes and performance optimizations for the old CLI pipeline that need to be selectively migrated to the new SDK pipeline.
Key Insight: Not all changes from #2370 apply to #2375. The SDK pipeline has a fundamentally different architecture that makes some changes unnecessary while others remain critical.
| Aspect | Old Pipeline (PR #2370) | New SDK Pipeline (PR #2375) |
|---|---|---|
| Orchestration | CLI-based, file polling | SDK query(), event-driven |
| Agent Location | sandbox/e2b/.claude/agents/ |
sandbox/e2b/pipeline/.claude/agents/ |
| Skills Location | /home/user/skills/ |
/home/user/pipeline/skills/ |
| Completion Signal | .status/ files |
Structured JSON returns |
| Artifact Persistence | Files → S3 sync | Direct database writes |
| State Management | meta.json polling | StateMachine class |
| Event Emission | Observer parsing logs | EventEmitter class |
These are bugs/safety issues that affect both pipelines:
Location: pipeline/.claude/agents/gather.md
Issue: DUP-40's gather.md is missing the critical section preventing gather agents from doing video work themselves.
Apply this section after "Performance Rules":
## DO NOT Do Video Work (CRITICAL)
**You are a GATHER agent, not a video-reviewer agent.**
NEVER:
- Download YouTube transcripts
- Write to `video-insights/` directory
- Analyze video content
- Use youtube-transcript skill
Video work is handled by dedicated `video-reviewer` agents launched in parallel by the orchestrator.
If video-insights files don't exist yet, that's expected - video-reviewers may still be running.
**Just do your gather work and write to `gathered/{slug}.md`.**Why Critical: Without this, gather agents improvise and do video work, causing duplicate effort and inconsistent results.
Create: pipeline/.claude/hooks/validate_bash.py
Create: pipeline/.claude/hooks/validate_read.py
Create: pipeline/.claude/settings.json
These hooks block dangerous operations:
validate_bash.py: Blocksmkdircommands (20-70s latency on S3)validate_read.py: Validates sandbox path boundaries
Copy from our branch:
# From PR #2370 branch:
cp sandbox/e2b/.claude/hooks/validate_bash.py → pipeline/.claude/hooks/
cp sandbox/e2b/.claude/hooks/validate_read.py → pipeline/.claude/hooks/Create settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [".claude/hooks/validate_bash.py"]
},
{
"matcher": "Read",
"hooks": [".claude/hooks/validate_read.py"]
}
]
}
}Why Critical: Without hooks, agents can mkdir (causing massive latency) or read outside sandbox.
Location: pipeline/.claude/agents/video-reviewer.md
Current DUP-40: Downloads transcripts sequentially
Our Fix: Downloads both transcripts in parallel using & and wait
Apply to Process section:
## Process
1. Read `candidates.json` for video URLs - **select only the 2 most relevant videos**.
2. **Download transcripts in PARALLEL (MANDATORY)**:
```bash
mkdir -p {TASK_WORKSPACE}/tmp
# Download both transcripts in parallel
timeout 30 /home/user/.venv/bin/python /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
"https://www.youtube.com/watch?v=VIDEO_ID_1" \
"{TASK_WORKSPACE}/tmp/transcript_1.txt" &
PID1=$!
timeout 30 /home/user/.venv/bin/python /home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
"https://www.youtube.com/watch?v=VIDEO_ID_2" \
"{TASK_WORKSPACE}/tmp/transcript_2.txt" &
PID2=$!
# Wait for both and check exit codes
wait $PID1; EXIT1=$?
wait $PID2; EXIT2=$?- Process whichever transcripts succeeded (handle partial failures).
**Impact**: 6-12min → 3-6min per video-reviewer (2x faster)
---
#### 4. TASK_AGENT Identifier for Observer
**Location**: `pipeline/.claude/agents/video-reviewer.md` and `shortlist-enricher.md`
**Add to frontmatter or prompt:**
```markdown
## Agent Identification
When starting, emit your agent type for observer tracking:
- Set `TASK_AGENT=video-reviewer` in your first bash command
- This enables per-agent usage tracking and debugging
Why: Without TASK_AGENT, observer can't distinguish which agent generated which usage/events.
Issue: DUP-40's download-transcript.py uses yt-dlp directly but may have issues with:
- youtube-transcript-api v0.6.0+ API changes
- Whisper fallback path
- Error handling
Compare and merge fixes from our 8-commit fix sequence:
| Commit | Fix | Applies to DUP-40? |
|---|---|---|
| venv Python usage | Use /home/user/.venv/bin/python directly |
✅ Different but related |
| ENOENT log suppression | Don't log expected missing files | ✅ Check observer |
| Node.js for yt-dlp | Add --js-runtimes node flag |
✅ If using yt-dlp |
| youtube-transcript-api v0.6.0+ | Fix API method calls | ❓ DUP-40 uses yt-dlp directly |
| SKIP_WHISPER_FALLBACK | A/B testing env var | ✅ Useful for DUP-40 |
Recommendation: Review DUP-40's download-transcript.py and ensure it handles:
- Timeout properly (30s default)
- Fallback to Whisper when captions unavailable
- Clean error messages for observer
Issue: DUP-40 has GROQ_API_KEY as optional; we made it required.
DUP-40 current:
const SANDBOX_ENV_KEYS_OPTIONAL = [
'GROQ_API_KEY', // For fast inference (agents fall back to Claude if missing)
...
]Our change:
const SANDBOX_ENV_KEYS_REQUIRED = [
'GROQ_API_KEY', // Required for cost-efficient inference
'GOOGLE_CSE_API_KEY',
'GOOGLE_CSE_CX',
'DD_API_KEY',
...
]Recommendation:
- If SDK pipeline uses Groq for schema/video-reviewer: Make GROQ required
- If SDK pipeline uses direct Haiku: Keep optional (accept higher cost)
- Keep GOOGLE_CSE_* required if shortlist-enricher needs image search
Our change: Local files → bulk S3 sync at end DUP-40 approach: Direct database writes via ArtifactWriter
Analysis: DUP-40's architecture may not need this optimization because:
- ArtifactWriter writes directly to database
- No S3 FUSE mount latency for artifacts
- Events go via EventEmitter, not file writes
Recommendation:
- Check if DUP-40 still uses S3 for any file operations
- If yes, apply sync architecture
- If no (pure DB writes), skip this change
Our change: Replace ls | wc -l with bash glob arrays
DUP-40 approach: Event-driven, no polling
Analysis: SDK pipeline uses EventEmitter and StateMachine, not file polling.
Recommendation: Skip this change for SDK pipeline.
Our change: Enhanced sandbox-observer.ts with usage cost collection DUP-40: Has observability.ts and artifact-writer.ts
Check:
- Does DUP-40's
observability.tstrack usage costs? - Does
artifact-writer.tspersist usage_cost_usd?
If not, the API-side fix (researchArtifacts.ts) may still be needed.
| File | Source (PR #2370) | Target (PR #2375) | Priority |
|---|---|---|---|
validate_bash.py |
.claude/hooks/ |
pipeline/.claude/hooks/ |
🔴 CRITICAL |
validate_read.py |
.claude/hooks/ |
pipeline/.claude/hooks/ |
🔴 CRITICAL |
settings.json |
.claude/settings.json |
pipeline/.claude/settings.json |
🔴 CRITICAL |
| File | Changes Needed | Priority |
|---|---|---|
pipeline/.claude/agents/gather.md |
Add "DO NOT do video work" section | 🔴 CRITICAL |
pipeline/.claude/agents/video-reviewer.md |
Add parallel downloads, TASK_AGENT | 🟠 HIGH |
pipeline/.claude/agents/shortlist-enricher.md |
Add TASK_AGENT | 🟠 HIGH |
sandbox/e2b/sandbox.ts |
Review env var requirements | 🟡 MEDIUM |
| File | Check For | Priority |
|---|---|---|
control-plane/observability.ts |
Usage cost tracking | 🟡 MEDIUM |
control-plane/artifact-writer.ts |
usage_cost_usd persistence | 🟡 MEDIUM |
pipeline/.claude/skills/youtube-transcript/ |
Script compatibility | 🟡 MEDIUM |
# From dupe-com repo root, on DUP-40 branch:
# 1. Create hooks directory
mkdir -p packages/researcher/sandbox/e2b/pipeline/.claude/hooks
# 2. Copy hook files from main (already merged from PR #2370)
cp packages/researcher/sandbox/e2b/.claude/hooks/validate_bash.py \
packages/researcher/sandbox/e2b/pipeline/.claude/hooks/
cp packages/researcher/sandbox/e2b/.claude/hooks/validate_read.py \
packages/researcher/sandbox/e2b/pipeline/.claude/hooks/
# 3. Create settings.json (see content above)- Edit
gather.md- Add "DO NOT do video work" section - Edit
video-reviewer.md- Add parallel downloads and TASK_AGENT - Edit
shortlist-enricher.md- Add TASK_AGENT
- Check
observability.tsfor usage cost tracking - Check
artifact-writer.tsfor cost persistence - Review
download-transcript.pyfor compatibility
- Run a test task with SDK pipeline
- Verify gather agents don't do video work
- Verify hooks block mkdir commands
- Verify usage costs are tracked
-
S3 Usage: Does the SDK pipeline use S3 for any file operations, or is everything direct-to-database?
-
Transcript Approach: The SDK
download-transcript.pyusesyt-dlpdirectly. Has it been tested with videos that have:- No captions (requires Whisper fallback)?
- Rate limiting?
- Timeout scenarios?
-
Usage Tracking: Where does the SDK pipeline track agent usage costs? Is it in
observability.tsor elsewhere? -
Env Var Strategy: Should GROQ_API_KEY be required (for cost efficiency) or optional (with Haiku fallback)?
-
Hook Registration: Is there a reason hooks aren't registered in the pipeline
.claude/settings.json? Was this intentional or an oversight?
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Gather agents do video work | HIGH | HIGH | Apply "DO NOT" section immediately |
| mkdir latency in SDK pipeline | MEDIUM | HIGH | Add validate_bash.py hook |
| Missing usage cost tracking | MEDIUM | MEDIUM | Verify observability.ts handles it |
| Transcript download failures | LOW | MEDIUM | Test download-transcript.py thoroughly |
Must Do (Before Merge):
- ✅ Add "DO NOT do video work" to gather.md
- ✅ Add PreToolUse hooks (validate_bash.py, validate_read.py, settings.json)
- ✅ Add parallel downloads to video-reviewer.md
Should Do (Soon After):
- Add TASK_AGENT identifiers for observability
- Review env var requirements
- Test youtube transcript edge cases
Can Skip (SDK handles differently):
- Local workspace + S3 sync (SDK writes to DB)
- Poll script optimizations (SDK is event-driven)
- Observer parsing changes (SDK uses EventEmitter)
Generated by analyzing PR #2370 (video-reviewer orchestration + local workspace performance) and PR #2375 (Claude Agents SDK migration)
Add this section after "Performance Rules" in pipeline/.claude/agents/gather.md:
## DO NOT Do Video Work (CRITICAL)
**You are a GATHER agent, not a video-reviewer agent.**
NEVER:
- Download YouTube transcripts
- Write to `video-insights/` directory
- Analyze video content
- Use youtube-transcript skill
- Reference or wait for video data
Video work is handled by dedicated `video-reviewer` agents launched in parallel by the orchestrator.
If `video-insights/` files don't exist yet, that's expected - video-reviewers may still be running.
**Just do your gather work and write to `gathered/{slug}.md`.**
If you find yourself wanting to process YouTube videos: STOP. That's not your job.
Write what you have and let video-reviewer handle videos.Create pipeline/.claude/hooks/validate_bash.py:
#!/usr/bin/env python3
"""
PreToolUse hook that blocks mkdir commands in Bash tool calls.
mkdir on S3 FUSE mounts takes 20-70 seconds. All directories are pre-created
by init-task.sh, so mkdir is never needed. This hook provides hard enforcement
rather than relying on agent compliance with markdown instructions.
"""
import json
import re
import sys
def main():
# Read hook input from stdin
hook_input = json.load(sys.stdin)
tool_name = hook_input.get("tool_name", "")
tool_input = hook_input.get("tool_input", {})
# Only check Bash tool
if tool_name != "Bash":
# Allow other tools
print(json.dumps({"decision": "approve"}))
return
command = tool_input.get("command", "")
# Check for mkdir in command
# Match: mkdir, mkdir -p, mkdir -m, etc.
mkdir_pattern = r'\bmkdir\b'
if re.search(mkdir_pattern, command, re.IGNORECASE):
# Block the command
result = {
"decision": "block",
"reason": (
"mkdir is blocked for performance. "
"All directories are pre-created by init-task.sh. "
"If you need a directory, it should already exist. "
"Check your path or use an existing directory."
)
}
print(json.dumps(result))
return
# Allow the command
print(json.dumps({"decision": "approve"}))
if __name__ == "__main__":
main()Create pipeline/.claude/hooks/validate_read.py:
#!/usr/bin/env python3
"""
PreToolUse hook that validates Read tool file paths stay within sandbox.
"""
import json
import os
import sys
ALLOWED_PREFIXES = [
"/home/user/",
"/tmp/",
]
def main():
hook_input = json.load(sys.stdin)
tool_name = hook_input.get("tool_name", "")
tool_input = hook_input.get("tool_input", {})
# Only check Read tool
if tool_name != "Read":
print(json.dumps({"decision": "approve"}))
return
file_path = tool_input.get("file_path", "") or tool_input.get("filePath", "")
if not file_path:
print(json.dumps({"decision": "approve"}))
return
# Normalize path
normalized = os.path.normpath(file_path)
# Check if path is within allowed prefixes
allowed = any(normalized.startswith(prefix) for prefix in ALLOWED_PREFIXES)
if not allowed:
result = {
"decision": "block",
"reason": (
f"Read path '{file_path}' is outside allowed directories. "
f"Allowed: {', '.join(ALLOWED_PREFIXES)}"
)
}
print(json.dumps(result))
return
print(json.dumps({"decision": "approve"}))
if __name__ == "__main__":
main()Create pipeline/.claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [".claude/hooks/validate_bash.py"]
},
{
"matcher": "Read",
"hooks": [".claude/hooks/validate_read.py"]
}
]
}
}Replace the transcript download section in pipeline/.claude/agents/video-reviewer.md:
## Process
1. Read `candidates.json` for video URLs - **select only the 2 most relevant videos**.
2. **Download transcripts in PARALLEL (MANDATORY)**:
All YouTube transcript downloads MUST use the helper script. Download both videos simultaneously:
```bash
mkdir -p {TASK_WORKSPACE}/tmp
# Download transcript 1 in background
timeout 30 /home/user/.venv/bin/python \
/home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
"URL_1" "{TASK_WORKSPACE}/tmp/transcript_1.txt" 2>&1 &
PID1=$!
# Download transcript 2 in background
timeout 30 /home/user/.venv/bin/python \
/home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
"URL_2" "{TASK_WORKSPACE}/tmp/transcript_2.txt" 2>&1 &
PID2=$!
# Wait for both downloads
wait $PID1
EXIT1=$?
wait $PID2
EXIT2=$?
echo "Download 1 exit: $EXIT1, Download 2 exit: $EXIT2"Handle partial failures: If one download fails, process the successful one.
If both fail, return success=false with error reason.
-
Summarize each successful transcript using Groq or Haiku fallback.
-
Write combined insights to
video-insights/{product-slug}.md. -
Clean up:
rm -f {TASK_WORKSPACE}/tmp/transcript_*.txt
---
### A.6 - TASK_AGENT Identifier
Add to video-reviewer.md and shortlist-enricher.md frontmatter or first bash command:
```markdown
## Agent Identification (for Observer)
Your first bash command should set TASK_AGENT for observer tracking:
```bash
export TASK_AGENT=video-reviewer # or shortlist-enricher
echo "[agent:${TASK_AGENT}] Starting work on {product-slug}"
This enables per-agent usage tracking and debugging in the observer.
---
## Appendix B: Quick Reference
### Files in Our PR #2370
packages/researcher/sandbox/e2b/ ├── .claude/ │ ├── agents/ │ │ ├── gather.md # DO NOT do video work │ │ ├── video-reviewer.md # Parallel downloads, TASK_AGENT │ │ ├── schema.md # Simplified (no Groq) │ │ └── ... │ ├── hooks/ │ │ ├── validate_bash.py # Block mkdir │ │ └── validate_read.py # Path validation │ ├── settings.json # Hook registration │ └── skills/ │ └── youtube-transcript/ │ ├── SKILL.md # Updated docs │ ├── reference.md # NEW troubleshooting │ └── scripts/ │ └── download-transcript.py # v0.6.0+ fixes ├── scripts/ │ ├── init-task.sh # Local workspace (may not apply) │ ├── sync-to-s3.sh # NEW bulk sync (may not apply) │ ├── sync-phase-to-s3.sh # NEW phase sync (may not apply) │ └── poll-*.sh # Glob counting (may not apply) ├── sandbox.ts # Env var promotion └── cli.ts # Enhanced error handling
### Files in DUP-40 That Need Changes
packages/researcher/sandbox/e2b/ ├── pipeline/ │ └── .claude/ │ ├── agents/ │ │ ├── gather.md # ← ADD "DO NOT do video work" │ │ ├── video-reviewer.md # ← ADD parallel downloads │ │ └── shortlist-enricher.md # ← ADD TASK_AGENT │ ├── hooks/ # ← CREATE directory │ │ ├── validate_bash.py # ← CREATE │ │ └── validate_read.py # ← CREATE │ └── settings.json # ← CREATE └── sandbox.ts # ← REVIEW env vars
---
## ⚠️ IMPORTANT CORRECTION: mkdir Conflict
**Issue Discovered**: The migration plan includes `validate_bash.py` that blocks ALL mkdir commands, but DUP-40's `video-reviewer.md` contains `mkdir -p {TASK_WORKSPACE}/tmp`.
**This will cause a conflict!** The hook will block the mkdir that agents are instructed to run.
### Resolution
**Step 1: Update `init-task.sh` to pre-create `tmp/`**
In `packages/researcher/sandbox/e2b/scripts/init-task.sh`, change line 75:
```bash
# FROM:
mkdir -p "$WORKSPACE"/{gathered,summaries,structured,video-insights,repairs}
# TO:
mkdir -p "$WORKSPACE"/{gathered,summaries,structured,video-insights,repairs,tmp}
Step 2: Remove mkdir from agent markdown files
In video-reviewer.md, remove the mkdir line and add a note:
## Process
1. Read `candidates.json` for video URLs.
2. **Download transcripts in PARALLEL**:
Note: `tmp/` directory is pre-created by init-task.sh. Do NOT run mkdir.
```bash
# tmp/ already exists - download both transcripts in parallel
timeout 30 /home/user/.venv/bin/python \
/home/user/pipeline/skills/youtube-transcript/scripts/download-transcript.py \
"URL_1" "{TASK_WORKSPACE}/tmp/transcript_1.txt" 2>&1 &
PID1=$!
# ... rest of parallel download code
**Step 3: Update all agent files to note directories pre-exist**
Add to each agent's "Performance Rules" or similar section:
```markdown
- **Directories pre-exist** - `mkdir` is blocked by hook. All directories (gathered/, tmp/, etc.) are created by init-task.sh.
mkdiron S3 FUSE mounts takes 20-70 seconds per call- Old pipeline had 31+ mkdir calls = 10-36 minutes wasted
- By pre-creating directories and blocking mkdir, we eliminate this latency entirely