eevmanu · August 4, 2025 02:25
diff --git a/1-prompt b/1-prompt
 You are an expert software architect tasked with creating detailed technical specifications for software development projects.

 Your specifications will be used as direct input for planning & code generation AI systems, so they must be precise, structured, and comprehensive.

 First, carefully review the project request:

 <project_request> 
 ROLE
 You are a system architect and AI engineer. Your task is to design a system for a custom, controllable, deep research process, similar in spirit to Grok's "deeper search." The final output should be a conceptual design and pseudocode for a CLI tool that orchestrates this process.

 OBJECTIVE
 Design a CLI tool that performs deep, multi-source, multi-layer research on a given query. The system must be highly configurable, allowing the user to adapt the "budget" in every aspect: thinking time, tool call limits, and iteration cycles. It should be optimized for both cost and effectiveness.

 CORE CONCEPTS
 - Knobs & Budget: The user must be able to control the research process through parameters. This includes:
  - `depth`: Number of iterative refinement loops.
  - `breadth`: Number of parallel search queries or draft generations.
  - `tool_calls`: Maximum number of calls to external tools (e.g., web search).
  - `thinking_budget`: If the underlying model supports it (like extended thinking in Claude), use this to control processing time per step. If not, use the model's maximum capacity.
 - Model Selection: Assume we can use two types of models:
  - "Thinking Model": A powerful, state-of-the-art model for complex tasks like synthesis, critique, and in-depth analysis. (e.g., GPT-4, Claude 3 Opus).
  - "Non-Thinking Model": A smaller, faster, and cheaper model for simple, routine tasks like query generation, planning, and ranking. (e.g., Haiku, a fine-tuned local model).
 - Tool Integration: The web search tool can be the one provided natively by the model OR a third-party tool accessed via an API (what I call an "mcp server").

 PROPOSED WORKFLOW & CHOREOGRAPHY
 The system's logic must be inspired by the following multi-step choreography. Design the system to execute these steps:

 Step 0: Plan
 - The 'Planner/Orchestrator' (a "Non-Thinking Model") receives the user query.
 - It performs a cheap internal 'extended thinking' pass to decide if web access is even necessary. If not, it can answer directly.

 Step 1: Breadth Pass (Fan-Out Retrieval)
 - The Planner generates `n` sub-queries or rewritten variants of the original query in parallel (`n` is controlled by the `breadth` parameter).
 - It calls a search API for each sub-query, fetching the top-k URLs per query to ensure wide coverage.

 Step 2: Ranking & Pruning
 - A "Non-Thinking Model" (acting as a cross-encoder or LLM judge) scores all the retrieved snippets/sources for relevance to the original query.
 - It prunes the list, keeping only the best 30-50 candidate sources.

 Step 3: Depth Loops (Iterative Gap-Filling)
 - This is a loop that runs up to `depth` times or until the budget is exhausted.
 - In each iteration:
  - The Planner (a "Thinking Model" this time) reads the currently synthesized information and the ranked sources.
  - It identifies gaps, missing evidence, or contradictions.
  - It issues new, targeted follow-up search queries to fill these gaps.
  - It can optionally delegate tasks, like asking a 'specialist' LLM to extract specific facts from a long PDF.
 - The loop terminates when the `depth` limit, `tool_calls` limit, or a confidence threshold is met.

 Step 4: Synthesis & Self-Critique
 - A "Thinking Model" drafts a comprehensive report based on all verified information, including inline citations.
 - It then enters a self-critique loop to check for and repair contradictions, hallucinations, or logical flaws.
 - Optional Breadth: For maximum quality, the system can generate 2-3 candidate drafts in parallel and use a "Non-Thinking Model" as a judge to select the best one.

 Step 5: Final Output
 - The system returns the final, polished report along with a full bibliography of the sources used.

 CLI TOOL DESIGN
 Propose a design for this as a CLI tool. Include pseudocode for the main orchestration loop and define the command-line arguments. For example:

 `python deep_research.py "What are the latest advancements in solid-state battery technology?" --depth 3 --breadth 5 --tool-calls 15`

 - `--depth`: Controls the number of iterative gap-filling cycles.
 - `--breadth`: Controls the number of initial parallel sub-queries.
 - `--tool-calls`: Sets the hard limit for external API calls.
 - `--model-fast`: Specifies the "non-thinking" model to use.
 - `--model-smart`: Specifies the "thinking" model to use.

 Your final output should be a complete conceptual breakdown and the requested pseudocode that brings this entire vision to life.
 </project_request>

 Next, carefully review the project rules:

 <project_rules> 
 - use python when possible
 - if you can base from good tools like llm cli from simonw that's great
 </project_rules>

 Finally, carefully review the starter template:

 <starter_template> 
 ...
 </starter_template>

 Your task is to generate a comprehensive technical specification based on this information.

 Before creating the final specification, analyze the project requirements and plan your approach. Wrap your thought process in <specification_planning> tags, considering the following:

 Core system architecture and key workflows
 Project structure and organization
 Detailed feature specifications
 Database schema design
 Server actions and integrations
 Design system and component architecture
 Authentication and authorization implementation
 Data flow and state management
 Payment implementation
 Analytics implementation
 Testing strategy
 For each of these areas:

 Provide a step-by-step breakdown of what needs to be included
 List potential challenges or areas needing clarification
 Consider potential edge cases and error handling scenarios
 In your analysis, be sure to:

 Break down complex features into step-by-step flows
 Identify areas that require further clarification or have potential risks
 Propose solutions or alternatives for any identified challenges
 After your analysis, generate the technical specification using the following markdown structure:

 # {Project Name} Technical Specification

 ## 1. System Overview
 - Core purpose and value proposition
 - Key workflows
 - System architecture

 ## 2. Project Structure
 - Detailed breakdown of project structure & organization

 ## 3. Feature Specification
 For each feature:
 ### 3.1 Feature Name
 - User story and requirements
 - Detailed implementation steps
 - Error handling and edge cases

 ## 4. Database Schema
 ### 4.1 Tables
 For each table:
 - Complete table schema (field names, types, constraints)
 - Relationships and indexes

 ## 5. Server Actions
 ### 5.1 Database Actions
 For each action:
 - Detailed description of the action
 - Input parameters and return values
 - SQL queries or ORM operations

 ### 5.2 Other Actions
 - External API integrations (endpoints, authentication, data formats)
 - File handling procedures
 - Data processing algorithms

 ## 6. Design System
 ### 6.1 Visual Style
 - Color palette (with hex codes)
 - Typography (font families, sizes, weights)
 - Component styling patterns
 - Spacing and layout principles

 ### 6.2 Core Components
 - Layout structure (with examples)
 - Navigation patterns
 - Shared components (with props and usage examples)
 - Interactive states (hover, active, disabled)

 ## 7. Component Architecture
 ### 7.1 Server Components
 - Data fetching strategy
 - Suspense boundaries
 - Error handling
 - Props interface (with TypeScript types)

 ### 7.2 Client Components
 - State management approach
 - Event handlers
 - UI interactions
 - Props interface (with TypeScript types)

 ## 8. Authentication & Authorization
 - Clerk implementation details
 - Protected routes configuration
 - Session management strategy

 ## 9. Data Flow
 - Server/client data passing mechanisms
 - State management architecture

 ## 10. Stripe Integration
 - Payment flow diagram
 - Webhook handling process
 - Product/Price configuration details

 ## 11. PostHog Analytics
 - Analytics strategy
 - Event tracking implementation
 - Custom property definitions

 ## 12. Testing
 - Unit tests with Jest (example test cases)
 - e2e tests with Playwright (key user flows to test)
 Ensure that your specification is extremely detailed, providing specific implementation guidance wherever possible. Include concrete examples for complex features and clearly define interfaces between components.

 Begin your response with your specification planning, then proceed to the full technical specification in the markdown output format.

 Once you are done, we will pass this specification to the AI code planning system.
diff --git a/2-response.md b/2-response.md
diff --git a/3-understanding b/3-understanding
 decoding = how i turn logits into a final string
 - how many forward passes (total compute)?
 - do the passes talk to each other or only to the prompt?
 - who or what decides which text the user finally sees?

 deep serial refinement - step feeds on the single previous draft
 - iterative refinement
  - any loop that feeds the same answer back to the lm for improvement.
  - condition types / budget limit :
    - time (ex: 15 min)
    - tool calls (3)
    - confidence threshold (ask it via prompt and parse from llm prompt response)
 - self-critique / reflection (reflexion, self-refine),
 - self-refinement / self-reflective decoding (madaan et al., 2023)
  - answer the question.
  - critique your answer; list flaws.
  - improve it based on your critique.
  - iterative refinement where the lm explicitly critiques itself in natural language
 - chain-of-thought with reflection
 - refinement prompting
 - multi-pass or two-pass decoding (deliberation networks)
 - deliberate decoding (shazeer, 2019)
  - draft pass (cheap, low-temperature, maybe truncated).
  - refinement pass conditioned on the entire draft.
  - iterative refinement with exactly one draft → polish hop
 - iterative editing (model alternates writer ↔ editor roles)
 - depth: 10 sequential passes over a growing context (draft + critiques)
  - quadratic-ish context cost and no parallelism.

 wide parallel generation + final judge
 - diverse beam search + rerank in older mt literature
 - generate-then-rank (a.k.a. n-best list + reranker)
 - self-consistency decoding (diverse chain-of-thought sampling + majority vote)
 - self-consistency with llm-as-judge
 - self-consistency sampling (wang et al., 2022)
  - sample k independent chain-of-thoughts.
  - strip the final answers.
  - let the plurality vote decide, or pick via lm-judge. key point: each branch never sees the other branches.
  - run k independent chain-of-thoughts, aggregate answers by vote or judge
  - aka parallel fan-out
 - ensemble-of-reasoners with an llm-as-a-judge
  - many models or many stochastic copies of one model
  - marketing phrase for self-consistency with different temperature seeds or different model checkpoints.
 - multi-agent debate
  - differs from plain self-consistency because the agents can reference and attack the other branches, not just vote blindly.
  - k agents attack each other’s answers; a judge (often another lm) decides
 - ensemble multi-agent methods
 - judge-synthesize patterns
  - same as debate, but instead of picking a winner the judge writes a new composite answer
 - breadth: 10 independent 1× calls + 1 extra aggregation pass
  - linear context cost and full parallelism.

 both are techniques that could be use in reasoning / thinking models

 deep research

 step 0 user query ➜ 'planner / orchestrator' lm
 - decides whether it needs the web at all
  (cheap internal 'extended thinking' pass).
 step 1 breadth pass: 'fan-out retrieval'
 - generate n sub-queries or rewrite variants in parallel
  (gemini literally calls this _query fan-out_).
 - hit a search api or internal index;
  fetch top-k urls per sub-query (wide coverage).
 step 2 ranking / pruning
 - cross-encoder or llm judge scores snippets
  - keep the best 30-50 candidates.
 step 3 depth loops: 'iterative gap-filling'
 - planner reads what it has, spots missing evidence,
  issues follow-up searches, or
  asks a 'specialist' lm to extract facts from long pdfs.
 - this repeats until a budget limit
  (time, tool calls, or confidence threshold) is hit.
 step 4 synthesis & self-critique
 - draft answer with inline citations
  - self-critique loop (serial depth)
    to repair contradictions / hallucinations.
 - optional parallelism: generate 2-3 candidate drafts and
  let an internal judge choose (breadth at the document level).
 step 5 return report + bibliography.

 choreography is:
 - plan
 - fan-out search
 - rank
 - iterative gap-fill
 - self-critique
 - answer


 breadth (parallel fan-out) and
 depth (serial gap-filling + self-critique)

 iterative refinement w/o tools
 -> re-explore just latent knowledge

 iterative refinement w/ tools
 -> evidence (new tokens)


 https://grokaimodel.com/how-to-use/#%f0%9f%94%8d_deepsearch_vs_deepersearch
 deepersearch
 multi-source, multi-layer research. longer to run but yields more analytical depth. best for technical, legal, or market analysis.

 https://support.anthropic.com/en/articles/11095361
 extended thinking + research
	You are an expert software architect tasked with creating detailed technical specifications for software development projects.

	Your specifications will be used as direct input for planning & code generation AI systems, so they must be precise, structured, and comprehensive.

	First, carefully review the project request:

	<project_request>
	ROLE
	You are a system architect and AI engineer. Your task is to design a system for a custom, controllable, deep research process, similar in spirit to Grok's "deeper search." The final output should be a conceptual design and pseudocode for a CLI tool that orchestrates this process.

	OBJECTIVE
	Design a CLI tool that performs deep, multi-source, multi-layer research on a given query. The system must be highly configurable, allowing the user to adapt the "budget" in every aspect: thinking time, tool call limits, and iteration cycles. It should be optimized for both cost and effectiveness.

	CORE CONCEPTS
	- Knobs & Budget: The user must be able to control the research process through parameters. This includes:
	- `depth`: Number of iterative refinement loops.
	- `breadth`: Number of parallel search queries or draft generations.
	- `tool_calls`: Maximum number of calls to external tools (e.g., web search).
	- `thinking_budget`: If the underlying model supports it (like extended thinking in Claude), use this to control processing time per step. If not, use the model's maximum capacity.
	- Model Selection: Assume we can use two types of models:
	- "Thinking Model": A powerful, state-of-the-art model for complex tasks like synthesis, critique, and in-depth analysis. (e.g., GPT-4, Claude 3 Opus).
	- "Non-Thinking Model": A smaller, faster, and cheaper model for simple, routine tasks like query generation, planning, and ranking. (e.g., Haiku, a fine-tuned local model).
	- Tool Integration: The web search tool can be the one provided natively by the model OR a third-party tool accessed via an API (what I call an "mcp server").

	PROPOSED WORKFLOW & CHOREOGRAPHY
	The system's logic must be inspired by the following multi-step choreography. Design the system to execute these steps:

	Step 0: Plan
	- The 'Planner/Orchestrator' (a "Non-Thinking Model") receives the user query.
	- It performs a cheap internal 'extended thinking' pass to decide if web access is even necessary. If not, it can answer directly.

	Step 1: Breadth Pass (Fan-Out Retrieval)
	- The Planner generates `n` sub-queries or rewritten variants of the original query in parallel (`n` is controlled by the `breadth` parameter).
	- It calls a search API for each sub-query, fetching the top-k URLs per query to ensure wide coverage.

	Step 2: Ranking & Pruning
	- A "Non-Thinking Model" (acting as a cross-encoder or LLM judge) scores all the retrieved snippets/sources for relevance to the original query.
	- It prunes the list, keeping only the best 30-50 candidate sources.

	Step 3: Depth Loops (Iterative Gap-Filling)
	- This is a loop that runs up to `depth` times or until the budget is exhausted.
	- In each iteration:
	- The Planner (a "Thinking Model" this time) reads the currently synthesized information and the ranked sources.
	- It identifies gaps, missing evidence, or contradictions.
	- It issues new, targeted follow-up search queries to fill these gaps.
	- It can optionally delegate tasks, like asking a 'specialist' LLM to extract specific facts from a long PDF.
	- The loop terminates when the `depth` limit, `tool_calls` limit, or a confidence threshold is met.

	Step 4: Synthesis & Self-Critique
	- A "Thinking Model" drafts a comprehensive report based on all verified information, including inline citations.
	- It then enters a self-critique loop to check for and repair contradictions, hallucinations, or logical flaws.
	- Optional Breadth: For maximum quality, the system can generate 2-3 candidate drafts in parallel and use a "Non-Thinking Model" as a judge to select the best one.

	Step 5: Final Output
	- The system returns the final, polished report along with a full bibliography of the sources used.

	CLI TOOL DESIGN
	Propose a design for this as a CLI tool. Include pseudocode for the main orchestration loop and define the command-line arguments. For example:

	`python deep_research.py "What are the latest advancements in solid-state battery technology?" --depth 3 --breadth 5 --tool-calls 15`

	- `--depth`: Controls the number of iterative gap-filling cycles.
	- `--breadth`: Controls the number of initial parallel sub-queries.
	- `--tool-calls`: Sets the hard limit for external API calls.
	- `--model-fast`: Specifies the "non-thinking" model to use.
	- `--model-smart`: Specifies the "thinking" model to use.

	Your final output should be a complete conceptual breakdown and the requested pseudocode that brings this entire vision to life.
	</project_request>

	Next, carefully review the project rules:

	<project_rules>
	- use python when possible
	- if you can base from good tools like llm cli from simonw that's great
	</project_rules>

	Finally, carefully review the starter template:

	<starter_template>
	...
	</starter_template>

	Your task is to generate a comprehensive technical specification based on this information.

	Before creating the final specification, analyze the project requirements and plan your approach. Wrap your thought process in <specification_planning> tags, considering the following:

	Core system architecture and key workflows
	Project structure and organization
	Detailed feature specifications
	Database schema design
	Server actions and integrations
	Design system and component architecture
	Authentication and authorization implementation
	Data flow and state management
	Payment implementation
	Analytics implementation
	Testing strategy
	For each of these areas:

	Provide a step-by-step breakdown of what needs to be included
	List potential challenges or areas needing clarification
	Consider potential edge cases and error handling scenarios
	In your analysis, be sure to:

	Break down complex features into step-by-step flows
	Identify areas that require further clarification or have potential risks
	Propose solutions or alternatives for any identified challenges
	After your analysis, generate the technical specification using the following markdown structure:

	# {Project Name} Technical Specification

	## 1. System Overview
	- Core purpose and value proposition
	- Key workflows
	- System architecture

	## 2. Project Structure
	- Detailed breakdown of project structure & organization

	## 3. Feature Specification
	For each feature:
	### 3.1 Feature Name
	- User story and requirements
	- Detailed implementation steps
	- Error handling and edge cases

	## 4. Database Schema
	### 4.1 Tables
	For each table:
	- Complete table schema (field names, types, constraints)
	- Relationships and indexes

	## 5. Server Actions
	### 5.1 Database Actions
	For each action:
	- Detailed description of the action
	- Input parameters and return values
	- SQL queries or ORM operations

	### 5.2 Other Actions
	- External API integrations (endpoints, authentication, data formats)
	- File handling procedures
	- Data processing algorithms

	## 6. Design System
	### 6.1 Visual Style
	- Color palette (with hex codes)
	- Typography (font families, sizes, weights)
	- Component styling patterns
	- Spacing and layout principles

	### 6.2 Core Components
	- Layout structure (with examples)
	- Navigation patterns
	- Shared components (with props and usage examples)
	- Interactive states (hover, active, disabled)

	## 7. Component Architecture
	### 7.1 Server Components
	- Data fetching strategy
	- Suspense boundaries
	- Error handling
	- Props interface (with TypeScript types)

	### 7.2 Client Components
	- State management approach
	- Event handlers
	- UI interactions
	- Props interface (with TypeScript types)

	## 8. Authentication & Authorization
	- Clerk implementation details
	- Protected routes configuration
	- Session management strategy

	## 9. Data Flow
	- Server/client data passing mechanisms
	- State management architecture

	## 10. Stripe Integration
	- Payment flow diagram
	- Webhook handling process
	- Product/Price configuration details

	## 11. PostHog Analytics
	- Analytics strategy
	- Event tracking implementation
	- Custom property definitions

	## 12. Testing
	- Unit tests with Jest (example test cases)
	- e2e tests with Playwright (key user flows to test)
	Ensure that your specification is extremely detailed, providing specific implementation guidance wherever possible. Include concrete examples for complex features and clearly define interfaces between components.

	Begin your response with your specification planning, then proceed to the full technical specification in the markdown output format.

	Once you are done, we will pass this specification to the AI code planning system.
	decoding = how i turn logits into a final string
	- how many forward passes (total compute)?
	- do the passes talk to each other or only to the prompt?
	- who or what decides which text the user finally sees?

	deep serial refinement - step feeds on the single previous draft
	- iterative refinement
	- any loop that feeds the same answer back to the lm for improvement.
	- condition types / budget limit :
	- time (ex: 15 min)
	- tool calls (3)
	- confidence threshold (ask it via prompt and parse from llm prompt response)
	- self-critique / reflection (reflexion, self-refine),
	- self-refinement / self-reflective decoding (madaan et al., 2023)
	- answer the question.
	- critique your answer; list flaws.
	- improve it based on your critique.
	- iterative refinement where the lm explicitly critiques itself in natural language
	- chain-of-thought with reflection
	- refinement prompting
	- multi-pass or two-pass decoding (deliberation networks)
	- deliberate decoding (shazeer, 2019)
	- draft pass (cheap, low-temperature, maybe truncated).
	- refinement pass conditioned on the entire draft.
	- iterative refinement with exactly one draft → polish hop
	- iterative editing (model alternates writer ↔ editor roles)
	- depth: 10 sequential passes over a growing context (draft + critiques)
	- quadratic-ish context cost and no parallelism.

	wide parallel generation + final judge
	- diverse beam search + rerank in older mt literature
	- generate-then-rank (a.k.a. n-best list + reranker)
	- self-consistency decoding (diverse chain-of-thought sampling + majority vote)
	- self-consistency with llm-as-judge
	- self-consistency sampling (wang et al., 2022)
	- sample k independent chain-of-thoughts.
	- strip the final answers.
	- let the plurality vote decide, or pick via lm-judge. key point: each branch never sees the other branches.
	- run k independent chain-of-thoughts, aggregate answers by vote or judge
	- aka parallel fan-out
	- ensemble-of-reasoners with an llm-as-a-judge
	- many models or many stochastic copies of one model
	- marketing phrase for self-consistency with different temperature seeds or different model checkpoints.
	- multi-agent debate
	- differs from plain self-consistency because the agents can reference and attack the other branches, not just vote blindly.
	- k agents attack each other’s answers; a judge (often another lm) decides
	- ensemble multi-agent methods
	- judge-synthesize patterns
	- same as debate, but instead of picking a winner the judge writes a new composite answer
	- breadth: 10 independent 1× calls + 1 extra aggregation pass
	- linear context cost and full parallelism.

	both are techniques that could be use in reasoning / thinking models

	deep research

	step 0 user query ➜ 'planner / orchestrator' lm
	- decides whether it needs the web at all
	(cheap internal 'extended thinking' pass).
	step 1 breadth pass: 'fan-out retrieval'
	- generate n sub-queries or rewrite variants in parallel
	(gemini literally calls this _query fan-out_).
	- hit a search api or internal index;
	fetch top-k urls per sub-query (wide coverage).
	step 2 ranking / pruning
	- cross-encoder or llm judge scores snippets
	- keep the best 30-50 candidates.
	step 3 depth loops: 'iterative gap-filling'
	- planner reads what it has, spots missing evidence,
	issues follow-up searches, or
	asks a 'specialist' lm to extract facts from long pdfs.
	- this repeats until a budget limit
	(time, tool calls, or confidence threshold) is hit.
	step 4 synthesis & self-critique
	- draft answer with inline citations
	- self-critique loop (serial depth)
	to repair contradictions / hallucinations.
	- optional parallelism: generate 2-3 candidate drafts and
	let an internal judge choose (breadth at the document level).
	step 5 return report + bibliography.

	choreography is:
	- plan
	- fan-out search
	- rank
	- iterative gap-fill
	- self-critique
	- answer


	breadth (parallel fan-out) and
	depth (serial gap-filling + self-critique)

	iterative refinement w/o tools
	-> re-explore just latent knowledge

	iterative refinement w/ tools
	-> evidence (new tokens)


	https://grokaimodel.com/how-to-use/#%f0%9f%94%8d_deepsearch_vs_deepersearch
	deepersearch
	multi-source, multi-layer research. longer to run but yields more analytical depth. best for technical, legal, or market analysis.

	https://support.anthropic.com/en/articles/11095361
	extended thinking + research