Thanks! I’ll design a full, modular pipeline that builds a Deep Research system like OpenAI’s but customized to your analyzer stack. It will:

Use your analyzer output to deeply understand the user's query and tone
Decompose complex or multi-part prompts into subquestions using QDMR-style logic
Auto-generate research plans with prioritized search queries, tailored to the topic
Use Firecrawl/Perplexity APIs to retrieve high-signal results, emphasizing academic sources
Loop over search + synthesis, revising the plan if the evidence doesn’t satisfy the question
Generate sectioned Markdown reports with structured citations, tables, graphs where needed
Embed inline source links with strategies to future-proof them (e.g., reverse RAG or citation mirrors)

I’ll also include architectural notes on where you can inject agents, chain-of-thought reasoning, and validation hooks for factual or citation consistency.

I’ll report back with a full breakdown shortly.

Designing an Autonomous Deep Research System

Overview of the System

This system is a modular pipeline that transforms any user query – whether a casual question, a multi-part research task, or an academic query – into a comprehensive, sectioned research report. It operates as an autonomous research agent: analyzing the query’s intent and structure, breaking it into manageable sub-tasks, gathering information via search and tools, and synthesizing findings into a well-organized Markdown report with citations. In spirit, it resembles OpenAI’s Deep Research agent, which “autonomously finds, analyzes, and synthesizes information from hundreds of online sources” to produce structured, well-cited reports. Likewise, it draws inspiration from systems like Perplexity AI that perform real-time web searches and return answers “with sources and citations included”. The workflow is fully automated: at each step, the agent can self-refine its plan (e.g. add new sub-questions or revisit earlier steps) to ensure all aspects of the query are thoroughly addressed.

Pipeline Summary: The system comprises several modules operating in sequence:

Query Analysis: Understand the user’s query (intent, domain, tone, and structure) using advanced NLP analyzers.
Question Decomposition: Break complex or multi-part queries into focused subquestions (using QDMR-style logic and other semantic parses).
Research Planning: Devise a plan for answering each subquestion – deciding which search engines, APIs, or tools to use and what information to gather.
Iterative Search & Retrieval: Execute searches for each subquestion, retrieve and read relevant sources, and dynamically adjust the plan (add/refine subquestions) based on findings.
Information Synthesis: Aggregate answers from all subquestions and synthesize a comprehensive answer, organizing it into a structured Markdown format (with appropriate sections, lists, tables/figures).
Citation & Verification: Inline-cite sources for each factual claim and employ verification strategies (like “reverse RAG”) to ensure every statement is backed by reliable evidence (with backup links or stored copies for resilience).

Each of these stages is described in detail below, along with how the query analyzers’ outputs guide the process. The overall design leverages established research in question understanding, multi-hop question answering, and agent architectures to maximize thoroughness and accuracy. For example, OpenAI’s Deep Research model showed that breaking down complex questions and retrieving authoritative information leads to superior performance on hard tasks, and our system follows this principle by design.

1. Query Analysis and Understanding

The pipeline begins by analyzing the user’s query in depth, using a suite of NLP analyzers to determine what the user is asking for, how they are asking it, and what kind of answer is expected. This step ensures the agent captures nuances of the request before formulating any plan. Key analyses include:

Speech Act & Dialogue Act: Determine the overall communicative intent. Is the user asking a question, requesting an explanation or instruction, or seeking debate/argumentation? The ISO 24617-2 dialogue act classification (a standard for tagging communicative functions) can label the query as an Information Request, a Directive, a Commissive, etc. If it’s an info-seeking question (most likely in a research context), the system knows it should provide an informative answer. A directive (“Tell me how to do X”) indicates the answer should be a step-by-step procedure, whereas a question (“What is X?”) calls for an explanatory report. This sets the tone and structure of the final output (e.g. a how-to list vs. an explanatory essay).
Intent Classification: Identify the user’s underlying intent and goal. Even among questions, the intent could be, for example, comparative (e.g. “Which product is better for my needs?”), analytical (“Why does phenomenon Y happen?”), factual (“When/where did Z occur?”), or opinion-seeking. The system uses intent detection (possibly via large language model categorization or trained classifiers) to figure out if the user expects a summary, a detailed report, a list of pros/cons, a recommendation, etc. This guides both the depth of research and the style (e.g. an intent for “quick facts” yields a concise answer, while an intent for in-depth analysis yields a thorough report).
Domain and Frame Semantics: Using frame-semantic parsing (e.g. FrameNet-style analysis), the system identifies the domain or scenario implied by the query and any specific roles or slots mentioned. For instance, if the query triggers a Comparison frame (user asks to compare two entities), the parser will extract the items to compare and the criteria. If the query triggers an Explanation frame (“why/how” question), it knows the answer should describe causes or processes. Frame semantics provides a structured representation of the query’s meaning in terms of actor, object, properties, etc., which helps target the right knowledge sources. For example, a query “What are the health benefits of green tea?” might be mapped to a Health_benefit frame with Item=green tea, guiding the agent to focus on medical/nutritional sources. Slot filling ensures no key variable is overlooked – e.g. if a question has parameters like time (“in the last decade”), location, or specific subtopic, these are noted for the search stage.
Semantic Parsing (SRL and AMR): Semantic Role Labeling (SRL) identifies the subject, predicates, and objects in the query, clarifying who does what to whom. Abstract Meaning Representation (AMR) goes further to produce a structured graph of the query’s content. These deepen understanding of complex sentences. For example, in a question like “What impact did the 2008 financial crisis have on European banking regulations?”, SRL can extract that the user asks for the impact (effect) of [event] on [object]. AMR would represent the relationships explicitly (FinancialCrisis2008 → impact_on → BankingRegulations in Europe). Such representations help the agent generate precise search queries later (ensuring we include terms for both the cause and effect parts, and the context “Europe”). They also help disambiguate meaning – e.g. if a query has pronouns or ambiguous modifiers, the structured parse resolves what refers to what.
Discourse Structure Analysis (RST): If the user’s query is multi-sentence or contains background context plus a question, discourse parsing (Rhetorical Structure Theory) identifies relationships like Explanation, Elaboration, Contrast, etc. For instance, a user might say: “I tried doing X but got result Y. Why did that happen?” RST would label the first sentence as a background or context and the second as the actual question (perhaps a Cause-Effect relation). Recognizing this prevents the system from ignoring the context. The context might contain presuppositions or constraints that the answer should address. RST also helps in structuring the answer: if the query has multiple parts (e.g. “Explain A. Also, compare A with B.”), discourse analysis can signal that the answer needs separate sections (one for explaining A, another for the comparison). Essentially, this yields a template for the answer’s outline aligned to the query’s rhetorical structure.
Dialogue Context & QUD: In an interactive setting (if this query were part of a dialog), the system would consider the Question Under Discussion (QUD) and any presuppositions or context from earlier turns. Even for a single-turn query, identifying presuppositions is useful. For example, the query “What are the new features of the latest iPhone?” presupposes that the latest iPhone exists and has new features. The system notes such presuppositions and can plan to verify them (“ensure that the ‘latest iPhone’ is identified correctly and that we have info on its features”). Epistemic modality analysis (does the user sound certain, or are they hedging?) might indicate how the answer should be phrased – e.g. if the user is unsure and asking carefully, the answer might reassure basics before diving into details. Generally, the agent treats presupposed information as something to check during research – if a presupposition might be false or questionable, that becomes a sub-task (to confirm or correct it). The QUD concept helps pinpoint the exact explicit question the user needs answered, especially when the query is phrased indirectly or contains multiple questions.
Pragmatic Markers & Tone: The query is also examined for politeness markers, formality level, or any emotional cues (like “I’m desperately looking for…”, or “just curious: …”). These pragmatics inform the tone of the answer. The system will mirror the user’s level of formality and concern. For instance, if the user uses slang or a very casual tone, the final report might adopt a slightly more informal style (while still being clear and factual). If the user says “Please provide a detailed explanation,” that direct request indicates a formal, thorough answer is expected. Pragmatic cues also cover things like whether the user is expecting an argumentative or persuasive answer. If the query implicitly asks for an argument (e.g. “Is X better than Y?” or “Should we do Z?”), the system notes that argument structure will be needed – likely presenting multiple viewpoints or pros/cons. The argument structure analyzer can detect if the user is, say, asking to justify a position or just to inform. If an argumentative answer is needed, the final report will have a structure like: Introduction of issue, Arguments for, Arguments against, Conclusion.

After this analysis stage, the system produces a rich Query Representation that includes: the classified speech act (e.g. information query), the extracted intents and frames (domain/topic, type of answer), any sub-questions or sub-parts identified, constraints (time, location, etc.), and style/tone guidelines. For example, the query representation might say: Domain: History; Task: Explanation; User Intent: Detailed analysis; Tone: Formal/academic; Subtasks: (a) Explain event A, (b) Compare interpretations of event A by historians, (c) provide references. This comprehensive understanding is then passed to the planning module.

2. Question Decomposition (QDMR and Subquestions)

If the user’s query is complex or multi-faceted, the system automatically breaks it down into smaller subquestions that are easier to research and answer. This step uses the output of the query analysis – especially the QDMR (Question Decomposition Meaning Representation) and semantic parses – to ensure no aspect of the query is overlooked. In essence, the system asks itself: “What intermediate questions would I need to answer in order to fully address the user’s request?”

We leverage QDMR because it explicitly represents a complex question as an ordered list of steps needed to answer it. QDMR provides a sequence of subqueries expressed in natural language. For example, for a user question like “How did the discovery of penicillin lead to modern antibiotic development?”, a QDMR might produce steps such as: 1) “Identify what penicillin is and who discovered it.” 2) “Find out what happened after penicillin’s discovery in medical research.” 3) “Determine how those events led to the development of modern antibiotics.” These subquestions decompose the reasoning required. Each subquestion can often be answered independently by searching for specific information. Such decomposition is crucial for multi-hop reasoning tasks, as noted by Wolfson et al. (2020) – breaking a complex query into QDMR steps makes it easier to retrieve answers and improves open-domain QA performance.

The system can generate a QDMR representation using either a fine-tuned model or prompt-based parsing of the query. It might also incorporate heuristics from discourse analysis: if the query explicitly contains multiple questions (e.g. “Explain X. Also, how does it compare to Y?”), those become separate top-level subquestions. If the query is a single question that implicitly requires multiple pieces of information (very common in complex “why” or “how” questions), the agent uses QDMR-style Plan-and-Solve reasoning. Recent research has shown that prompting an LLM to first plan out subquestions (forming a QDMR graph or tree of subqueries) and then solve them one-by-one yields more reliable results. In our system, we implement a plan-then-solve strategy: the LLM “brain” first outputs a decomposition (possibly a directed acyclic graph of questions, where some subquestions depend on answers to others). Once this plan is in place, the agent proceeds to answer each subquestion with the help of tools.

Notably, the decomposition process is not rigidly one-shot. The system is designed to be adaptive. Initially, it produces a best-guess breakdown (QDMR). Then, as it starts researching, it may realize additional subquestions are needed. This aligns with the Self-Ask prompting strategy: “to answer a complex question, a model needs to break it down into simpler sub-questions, answer them, and then synthesize a final answer”. Our agent effectively “self-asks” follow-up questions if something is unclear. For example, while researching a subquestion, it might encounter unfamiliar terms or claims that raise new questions – the agent can inject these as new subqueries. This makes the system autonomous and exploratory. It doesn’t strictly require the initial user query to list every detail; it will ask itself new questions as needed and find answers, much like a human researcher who discovers new angles during investigation.

In summary, by the end of this stage the system has a set of focused subquestions (from QDMR and any additional self-generated queries) that together cover the full scope of the original query. Each subquestion is formulated to be as specific as possible, which will make the subsequent search more effective. The subquestions also imply an initial structure for the final report – e.g., each might correspond to a section or subsection in the output.

3. Research Planning (Tools & Data Gathering Strategy)

With the query fully understood and broken down, the agent now formulates a research plan. This plan decides what to do, in what order, to gather all necessary information. It’s essentially the agent’s strategy before diving into execution. Planning involves selecting appropriate tools or APIs, choosing search queries, and allocating subquestions to different resources. Key aspects of this planning module include:

Selecting Resources and Tools: Based on the query domain and subquestions, the system chooses which search engines or databases to query, and which auxiliary tools to employ. For example:
- For general knowledge or current events, a web search engine (via an API like Bing, Google, or a service like Perplexity’s API) will be used. If the query is not time-sensitive, even a Wikipedia lookup might suffice for some subquestions. If it’s a technical programming question, the plan might involve searching StackOverflow or documentation sites.
- For academic or scientific queries, the plan favors scholarly databases (Semantic Scholar, arXiv, PubMed, Google Scholar API if available). The agent might use an academic search tool to find relevant papers or use a site-specific search (e.g. site:edu or journal sites).
- If the query involves data or statistics (e.g. “latest GDP of countries X, Y” or “trend in climate data”), the system may plan to use a code execution tool (like a Python environment) to retrieve or calculate results. For instance, it might call an API for data (IMF API for GDP, etc.) or use Python to scrape a table and plot a graph. In OpenAI’s Deep Research, they integrated Python-based data analysis for precisely this reason, and our system does similarly when needed.
- If the relevant information is likely on a specific website or requires crawling (for example, the user asks about information that might be hidden in documentation pages, or a multi-page report), the plan includes using Firecrawl. Firecrawl can crawl a website and return clean text or structured data. The agent might say: “Use Firecrawl on example.com with query X” to gather data from pages that are not indexed by search engines or require clicking through. This is especially useful if, say, the user needs a summary of a large report or a product’s documentation – the agent can crawl the site and then use the LLM to summarize or extract the needed details.
- The agent also prepares for using any specialized APIs or tools. For instance, if analyzing sentiments or performing a calculation, a sentiment analysis API or a calculator tool could be invoked. The planning stage lists these out: e.g., “If subquestion 5 requires sentiment of tweets, use sentiment analysis API on retrieved tweets.”
Query Formulation for Subquestions: For each subquestion, the agent formulates one or multiple search queries. This often involves rephrasing the subquestion into likely search-engine-friendly terms. The query analysis provides keywords, and the agent may augment them with synonyms or related terms to cast a wide net. The tone/intent analysis also influences this: if the user expects highly technical detail, the search queries might include technical terminology to find scholarly sources; if a more layman explanation is needed, the agent might include terms like “explained” or target sources known for accessible language. For example, given a subquestion “What are the physiological effects of green tea catechins?”, a technical query might be "green tea catechins epigallocatechin gallate health effects research", whereas if the user wanted a simple answer, the agent might also try "green tea benefits explained simply". The agent can generate multiple queries per subquestion, ranging from broad to specific, and plan to try them in sequence if needed (this is a contingency for the retrieval stage).
Ordering and Dependency Management: The plan establishes an order in which to tackle subquestions, especially if some depend on others. The QDMR step ordering is helpful here – e.g., if subquestion 3 requires the answer from subquestion 1, the agent schedules subquestion 1’s research first. The plan can be represented internally as a directed graph or a list of steps. Modern agent frameworks often allow the LLM to output a structured plan (even in a pseudo-code or JSON format), but a more human-readable approach could be a simple ordered list. The agent might internally maintain a plan like:
1. Search for background on A (subq1) – Tool: WebSearch.
2. Search for details on B (subq2) – Tool: WebSearch & then Firecrawl full report from source.
3. Run analysis on data from B (subq2 continued) – Tool: Python (for graph).
4. Compare A and B (subq3) using info from 1 & 2.
5. Check for any presupposition (P) verification – Tool: targeted WebSearch.
6. Final synthesis.
This ordering ensures all prerequisites are collected before comparative or summary tasks.
Incorporating Analyzer Outputs: The various analyzer outputs inform the above decisions. For instance, if frame semantics indicated a “Comparison” task, the plan will ensure symmetrical information gathering for each item to compare (to present a balanced report, possibly even planning a table to juxtapose features). If the intent was an argumentation, the plan will explicitly seek counter-arguments and supporting arguments as separate subquestions (“Find arguments supporting X”, “Find arguments against X”) to ensure both sides are researched. If presupposition analysis flagged something uncertain, the plan might include a step “Verify presupposed fact Y” so that any answer doesn’t blindly accept a false premise. In essence, the plan is customized to the query’s needs as understood by the analyzers.
Example Plan: Suppose the user asked: “I’m considering buying an electric car. Could you explain how electric vehicles work, the pros and cons compared to hybrid cars, and what to consider before buying one?” After analysis, the system might plan:
- Subq1: “How do electric vehicles work?” – search for EV mechanism, maybe gather an illustrative graphic or technical explanation.
- Subq2: “What are the pros and cons of EVs vs hybrids?” – search for comparisons (likely plan to create a table of pros/cons).
- Subq3: “What factors to consider before buying an EV?” – search for buying guides or expert advice.
- Tools: Use general web search for all; if a good comparison article is found, use Firecrawl to get the full text for pros/cons; possibly use a calculator or external API if cost of ownership calculations are needed (depending on what info comes up).
- The plan might also note: tone should be explanatory but accessible (user didn’t use highly technical language), so gather a mix of technical facts and simplified explanations. It will aim to gather data like battery life, costs (for a possible small table), etc.
- Order: First get how EVs work (background section), then pros/cons (this likely forms a major section), then considerations (likely bullet points advice section).
Memory for Plan Tracking: The system maintains a working memory of the plan and progress. As the agent executes searches, it will mark subquestions as answered or pending. Modern agent designs, such as the one described by Lilian Weng (2023), highlight the importance of memory (both short-term and long-term) and planning modules alongside the LLM “brain”. In our design, after initial planning, the plan (and partial results) can be stored in the agent’s memory context so that the LLM can refer back to it at any time. This helps in iterative adjustments—if a later step fails or yields new subquestions, the agent can update this plan in memory.

At this stage, the system effectively has a “to-do list” for research, with each item specifying a query and a tool. It is ready to move to the execution phase, where it will carry out this plan, step by step, interacting with search engines and other tools.

4. Iterative Search and Information Retrieval

Now the agent enters the search execution loop, where it carries out the plan formulated above. This phase is where the system actively gathers content from external sources to answer each subquestion. The key characteristics of this module are iteration, adaptivity, and thoroughness. The agent behaves similar to an autonomous web researcher: formulating queries, reading results, and refining its approach until each subquestion is satisfactorily answered and evidence is collected.

Search Query Execution: For each subquestion, the agent sends the prepared search query to the chosen search API (or other data source). The system parses the search results (e.g. a list of snippets and URLs from a web search). Rather than stopping at the first result, the agent will typically scan the top N results (N might be around 5–10, depending on the need for coverage) to identify which ones look relevant and high-quality. It considers factors like the source (a .edu or .gov site might be more authoritative, an established publication vs. a random forum), the snippet content matching the information needed, and the date (especially if up-to-date info is needed). If the query is time-sensitive (e.g. “latest research in 2025 on X”), the agent filters for recent results.

Clicking and Scraping Content: Once promising results are identified, the agent “clicks” them — i.e., it uses an HTTP client or browser tool to fetch the full content of the page. The content is then either summarized or analyzed by the LLM, or it might be passed to specialized parsers if it’s in a known format (like PDF or JSON data). For web pages, the agent may use Firecrawl if the page is part of a larger site that needs crawling. For example, if result #3 is a homepage that doesn’t directly answer the question but likely contains links to the info, Firecrawl can recursively fetch sub-links and return a compiled text. The agent will be mindful of not going overboard (the plan guides how many pages to fetch).

Information Extraction and Note-Taking: As each page or document is retrieved, the agent searches within it for the specific answers to the subquestion at hand. It might employ keyword search on the page or just read sequentially. Modern LLM-based agents often use a Read-Analyze loop: they read a chunk of text and then the LLM is asked (in a prompt) something like “Does this contain information relevant to subquestion X? If so, extract the key facts/citations.” The agent effectively takes notes – storing snippets of text with metadata about the source and which subquestion it addresses. For instance, if the subquestion was “pros and cons of EV vs hybrid”, and an article enumerates those, the agent will extract each pro/con point (maybe in bullet form) along with the source URL or title for citation. This becomes raw material for the final synthesis.

Multi-Turn Interaction and Refinement: The agent may not get everything it needs in one shot. Here’s where the iterative aspect shines:

After an initial search, the agent evaluates if the subquestion is fully answered. If not, it can reformulate the query or try an alternative approach. This is informed by what was found (or not found). For example, if none of the results directly answered “subquestion 2”, the agent might broaden the search terms or use a different phrasing. It could even split the subquestion further: e.g., if searching “effects of X on Y” yields nothing obvious, the agent might search “X mechanism” and “Y outcomes” separately.
The agent uses success/failure signals. A success might be “found a source that directly answers with clear data or explanation.” Failure might be “results are off-topic or too shallow.” In case of failure, the agent can consult its memory or chain-of-thought to adjust. It might recall the intent/tone: perhaps it needs to search more technically if the initial query was too generic.
The system may also leverage the LLM’s ability to generate new subqueries (self-ask). For example, if the user’s question is open-ended, the agent can spontaneously pose a follow-up: “Do I need background on topic Q to answer this?” If yes, that follow-up becomes a new subquestion and the agent will search for it. This autonomy in asking new questions ensures complete coverage. It’s essentially the agent performing retrieval-augmented reasoning: querying when it realizes there’s a gap in its knowledge. Perplexity AI’s “Pro Search” mode does something similar by “asking follow-up questions to refine the search for complex queries”. Our agent mimics this by interrogating the topic from different angles until satisfied.

Cross-Checking and Fact Validation: As information comes in, the agent cross-checks critical facts across multiple sources. If Source A says “XYZ”, the agent might try to find Source B that confirms it. This is important for reliability. It may explicitly issue a search query like “XYZ [some key phrase] confirmation” if a fact is crucial or controversial. The plan may have marked certain answers for verification (especially if the query had implicit assumptions). By doing this, the agent reduces the chance of including a one-off claim that isn’t well-supported. In practice, this could be done by comparing notes: if two sources agree, great; if they conflict (e.g. two sources give different statistics for the same metric), the agent flags this and perhaps formulates a clarifying subquestion: “Why do sources differ on this? Which is more trustworthy or recent?” Then it searches for an explanation or the most authoritative data (maybe an official report to resolve the discrepancy). This behavior ensures the final report can either reconcile differences or at least note them with proper attribution.

Loop Termination: The search loop for a given subquestion ends when the agent has gathered sufficient information. “Sufficient” means:

The subquestion can be answered with confidence, with evidence from reliable sources.
The agent has multiple citations if appropriate (especially for important claims).
No major aspect of the subquestion remains unanswered.

The agent then marks that subquestion as completed in its plan, storing the collected info. It proceeds to the next subquestion, and repeats the search/retrieve process. The system continues this until all planned subquestions are handled.

Importantly, the agent is allowed to revise the plan on the fly. If during retrieval a new subquestion was added or an existing one changed, the plan updates. For instance, a user query might have been decomposed into subquestions A, B, C. While answering A, the agent realizes there’s also a D to ask. It adds D, perhaps to be done after C (or immediately, if D is a prerequisite for others). This flexibility makes the research process robust. By the end of all iterations, the agent will have a pool of gathered content: facts, figures, quotes, and explanations, each tied to sources.

Technical note: Under the hood, this iterative process can be orchestrated by an agent framework (like the ReAct paradigm where the model alternates between reasoning and acting). There might be a loop where the LLM looks at the plan and notes, decides an “action” (e.g. SEARCH, READ, ASK), executes it via a tool, gets observation, and updates its knowledge. This continues until a stopping condition (all questions answered) is met. Such frameworks have been demonstrated by systems like WebGPT, which “interacts multiple times with a web browser” and uses retrieved results to craft a final answer.

5. Information Synthesis and Report Generation

After the retrieval stage, the agent now has a collection of factual information and insights relevant to the user’s query. The next step is to synthesize this raw material into a coherent, structured report in Markdown format. This stage is akin to writing a mini research paper or an in-depth article: the agent must organize the content logically, present it clearly, and ensure the user’s question is fully answered. Several sub-tasks happen here: organizing the outline, drafting the text for each section (using the info collected), inserting tables/graphs if helpful, and integrating citations.

Structuring the Report: The system uses both the initial query analysis and the subquestion breakdown as a blueprint for the report’s structure. Typically, the final Markdown report will include:

An Introduction: a brief paragraph rephrasing the question and previewing what will be covered (this helps ensure the answer is contextualized). For example, it might say “In this report, we explore X, including an explanation of Y, a comparison of Z, and considerations regarding Q.”
A series of Sections (with appropriate headings) corresponding to major subquestions or themes. We use Markdown headings (##, ###, etc.) to clearly demarcate these. For instance, if subquestions were about “Mechanism of X”, “Advantages and Disadvantages”, “Considerations”, each becomes a section. The headings are phrased descriptively (e.g. “How Electric Vehicles Work”, “Pros and Cons of EVs vs Hybrids”, “Practical Considerations for EV Buyers”). Using clear headings allows the user to scan the document and quickly find specific parts of the answer, fulfilling the need for easy navigation of key points.
Subsections if needed: within a large section, the agent might further organize content. E.g. under “Pros and Cons”, it could have ### Pros and ### Cons as sub-headings, or use a list format for each.
A Conclusion or Summary (if the query benefits from it): for argumentative or decision questions, a conclusion section can weigh the findings; for explanatory ones, a short recap reinforces the answer.
Possibly a References section at the end (though since we provide inline citations, an explicit reference list might not be necessary in the output the user sees—footnote-style citations suffice as per the inline linking approach).

The rhetorical structure determined earlier helps here: e.g., if the argument structure analyzer indicated the answer should present both sides of an issue and then conclude, the agent ensures the sections reflect that (one for each side, one for conclusion). If the discourse analysis indicated multiple questions, each becomes a section. Essentially, the outline is built to mirror the query’s demands.

Composing Paragraphs with Citations: With the outline set, the agent drafts each section. It uses the notes and facts gathered, guided by the user’s requested depth and tone. The writing process for each section involves:

Selecting relevant facts: The agent decides which pieces of information from its notes to include. It will prioritize key facts that directly answer the question and provide necessary context or explanation. Less relevant details are dropped to keep the answer focused.
Paraphrasing and integrating: The agent paraphrases the content from sources in a unified voice (the voice appropriate for the user). It avoids copying large verbatim text (except maybe short definitions or specific terms, which it will quote if needed). By paraphrasing, it ensures the content flows logically and isn’t just disjointed snippets.
Ensuring coherence: The agent adds connecting sentences and transitions between points. For example, it might start a section with a topic sentence introducing what will be discussed, then present facts (with citations), and perhaps end with a brief concluding sentence that ties back to the main query.
Tone and clarity: It adjusts the phrasing to match the user’s level. If the analysis indicated the user is non-expert, the agent may include brief explanations of technical terms (or use simpler terms). If the user expected a formal report, the language stays objective and academic. Pragmatic analysis (politeness, etc.) also reflects here – usually answers are politely and helpfully phrased by default.
Markdown formatting: The agent uses Markdown elements to enhance readability:
- Headings (#, ##, etc.) as mentioned for sections.
- Bullet points or numbered lists when enumerating items or steps. For instance, if answering “What should I consider before buying an EV?”, it might list considerations as bullets: - Range and battery life, - Availability of charging infrastructure, etc., for clarity. Lists make key takeaways stand out.
- Tables for structured data or comparisons. If the query involves comparing features (as in our EV vs hybrid example), a table could concisely show a side-by-side comparison (e.g. columns for EV and Hybrid, rows for various criteria like fuel cost, maintenance, emissions). The agent will create a Markdown table if it improves understanding. It gathers the data for the table from sources during retrieval. For example, if two different sources gave specs for EVs and hybrids, the agent can tabulate them. It will cite sources for each data point either in the table caption or within the text around it.
- Graphs or Figures if applicable. While pure text interface might not show graphs directly, the agent can describe them or embed an image (more on embedding in a moment). If numeric data over time was found, the agent might even decide to generate a quick plot using its Python tool and then embed that image. For instance, if the question was about “trend of CO2 emissions in last 20 years”, it could create a line chart and include it. However, it will ensure any figure has a caption explaining it and is cited (since it’s derived from data). Simpler, it might rely on external graphs available in sources: e.g., if a source had a relevant chart, the agent could fetch that image and embed it, given it can attribute it.
- Inline emphasis using markdown italics or bold if needed for key terms, though this is usually minimal in a research report style answer.

Throughout the drafting, the agent inserts citations in-line for any fact, statistic, or claim that is not common knowledge. Our citation format is the bracketed reference style (e.g. ) that was requested. For every piece of information drawn from a source, the agent appends a citation right after the sentence (or clause). This not only gives credit but also allows the user to verify the info. For example, a sentence might read: According to a DeepMind study, an LLM called GopherCite was able to back up all its factual claims with evidence from the web. – here the citation points to the source where that claim is documented. The agent keeps track of which source each piece of info came from to do this accurately.

Citation Integration and Tracking: It’s worth emphasizing how the agent manages citations during writing (this also leads into the next section on robust citation strategies). The system has stored metadata like: Fact X -> Source Y (URL, and perhaps quote). When writing Fact X, it will cite Source Y. The format 【source†Ln-Lm】 is used, where source is an identifier linking to a bibliography entry or directly to the URL and Ln-Lm indicates the line numbers of the source supporting that fact. The agent ensures that these line numbers correspond to the specific snippet or quote for precision. If the source was a PDF or article where line numbers are not straightforward, it might cite a page or section number if available (or a timestamp for a video, etc.). For web pages (like blog posts or HTML content), line ranges can refer to the text as extracted. Our system specifically can store the exact snippet from the page to use for the citation reference. By preserving these references, the final answer can be audited by the user.

Inclusion of Tables/Graphs: If the answer warrants, the agent includes any prepared tables or images at appropriate points. For images, the system uses the Markdown image embedding syntax with a caption. For example, it might embed an architecture diagram to illustrate a point. (We must ensure any embedded image is cited properly as well). Consider that we have an architecture figure of an LLM-agent:

Figure: An overview of an LLM-powered autonomous agent architecture (adapted from Lilian Weng, 2023). The agent (center) is supported by a Planning module for subgoal decomposition and self-reflection, a Memory module (short-term and long-term) for storing information, and various Tool interfaces (web search, code execution, etc.) to interact with external data.

In this figure, the key components (planning, memory, tools) are depicted around the agent’s core, reinforcing our system’s design philosophy. Including such a figure in the final report can help the user visually understand the pipeline architecture being described.

For tables, the agent writes them in Markdown like:

| Feature            | Electric Vehicles (EV)       | Hybrid Vehicles        |
|--------------------|------------------------------|------------------------|
| Fuel Type          | Electricity (battery)        | Gasoline + Electricity |
| Emissions          | Zero tailpipe emissions【source†Lx-Ly】 | Lower than gas-only, but some emissions【source†Lx-Ly】 |
| Range per fill     | ~250 miles per charge【...】   | ~500 miles per tank【...】 |
| Refuel Time        | ~30 min (fast charge)【...】   | ~5 min (gas pump)【...】 |

And so on. Each cell that contains specific data gets a citation. The agent either uses footnotes for the entire table or inline citations as shown, depending on clarity.

Ensuring All Query Aspects Are Covered: As it writes, the agent cross-references the original query (and its structured representation) to ensure completeness. It may tick off each element of the query that has been addressed. If it notices any part missing (perhaps a subquestion got lost or new info suggests another aspect), it can still do a quick retrieval in this phase. For example, maybe while writing the conclusion it realizes a particular statistic would bolster the answer – the agent can quickly search for that stat and then include it with a citation. This is part of the autonomous behavior: even in synthesis, it can loop back to retrieval if needed (though ideally most info was gathered already).

Finally, the Markdown report is reviewed (by the agent itself) for coherence and correctness. The agent might do a “read-through” with an instruction like: “Check the draft for any unanswered questions or logical gaps.” If the LLM notices something like “Oops, we mentioned X but never explained Y fully,” it can address it (either using existing notes or doing a last-minute search).

At the end of this stage, we have a polished, well-structured Markdown document that directly answers the user’s query, complete with headings, paragraphs, lists, tables, and embedded images/graphs as appropriate. Each factual statement in the document has an accompanying citation to its source, which leads us to ensuring those citations are robust.

6. Citation Management and Verification Strategies

Providing sources is crucial for trustworthiness, so our system implements robust citation tracking and verification mechanisms. The goals are twofold: (a) make sure every significant claim in the answer can be traced to a reliable source (to avoid hallucinations), and (b) ensure the citations remain valid and accessible over time (resilience via backups).

Inline Source Linking: As described, the answer uses inline citations (e.g. footnote-style links like 【source†Lx-Ly】). During synthesis, the system kept track of which source material contributed to each part of the answer. This mapping is maintained in a citation database. The agent ensures that each citation link is inserted right after the claim it supports, rather than just a bibliography at the end. This way, the user can click the link immediately to see the origin of that statement. The format includes the line numbers from the source document, which provide context when the user views it. This approach was inspired by systems like Perplexity and WebGPT that emphasize source transparency; indeed, WebGPT was designed to “cite evidence to back up its responses” in an interactive browsing setting.

Complete Attribution (No Unsupported Statements): The system’s policy is that no factual assertion should go uncited unless it’s truly common knowledge or logically deduced from cited facts. To enforce this, after drafting the answer, the agent performs a self-check: it goes through each sentence and asks “Does this need a source? Do we have one attached?”. This is akin to an attribution audit. Any sentence without a citation triggers the agent to either attach one (if it was inadvertently missing) or to justify that it’s a summary of something previously cited. For example, if multiple sentences in a paragraph all draw from the same source, the agent might cite only the first one and last one; during the audit it might decide to add one more citation in the middle for clarity. This way, the final answer is densely referenced.

To further ensure nothing is unsupported, we can use a “reverse RAG” approach – effectively a verification pass after generation. In a traditional RAG (Retrieval-Augmented Generation) pipeline, retrieval precedes generation to feed the model information. In reverse RAG, after generating an answer (or draft), we treat the answer’s claims as queries to see if we can retrieve sources to back them. Mayo Clinic’s recent work on “reverse RAG” exemplifies this: their system focuses on “confirming that each piece of generated information can be traced back to a legitimate source,” flipping the emphasis from just retrieving for generation to verifying after generation. In our system, a similar method is applied: the agent can take each factual claim in the draft and run a quick search to find corroborating evidence if the source is missing or weak. If it finds a better source, it can cite that or even replace the info from the original draft with the new, verified info. This verification-first mindset ensures high accuracy. In practice, since we already retrieved sources during research, reverse searching is mainly a safeguard for any straggling claims.

Automated Fact-Checking (RARR): We also draw on ideas from the RARR approach (Retrofit Attribution using Research and Revision). RARR is a framework that “automatically finds attribution for the output of any text generation model and post-edits the output to fix unsupported content”. In our case, the agent itself is generating with retrieval, so the output should already be supported, but to be extra sure, the agent can do a mini-RARR: if any sentence cannot find a backing source, the agent either removes it or revises it to align with what sources say. For example, if the draft said “X is the fastest growing Y in 2023” but our sources aren’t explicit, the agent would search that claim. If it finds it’s unsupported or maybe only true for 2021, it will correct the statement or at least qualify it (“X was among the fastest growing Y as of 2021【source】, but data for 2023 is not yet available.”). This post-editing for factuality step is crucial to avoid subtle inaccuracies.

Citation Format and Resilience: The system uses a standardized citation format for consistency. Each citation includes enough detail to locate the source (line numbers, etc.). To make citations resilient:

The agent preferentially cites stable sources when possible. For academic info, DOIs or official publications are cited (instead of a random blog summarizing them). For news, a well-known outlet is preferred over a less known one. This doesn’t mean it ignores niche sources (often niche blogs or posts can have valuable info), but if the same info is available from a more permanent source, that’s chosen.
Archived link fallback: For every URL the agent decides to cite, it can automatically retrieve a cached or archived version. It might use services like the Internet Archive’s Wayback Machine or an internal archiver. The idea is to store a snapshot of the content as of the time of access. Then, in the citation, it could link to the archived version (or keep it internally such that if the user clicks and the original is dead, it can present the archived content). This protects against link rot. Alternatively, the agent might present both the original and an archived link. For example, a reference might be stored as: (Original: URL1, Archived: URL2). If URL1 fails, URL2 can be provided.
Another strategy is to quote the relevant snippet in the answer (some systems do this to ensure the user sees the evidence immediately). Our format shows line numbers, but the agent could also explicitly include short quotes from the source if warranted (especially for definitions or direct statements). By quoting, even if the link breaks, the key info is already in the answer itself (with attribution).
The agent could maintain a local vector database of retrieved documents. This means even if the link breaks later, the system has the content indexed and could answer follow-up questions about it if needed. It’s part of long-term memory – not directly exposed to the user, but it underpins the resiliency. In context, it means the agent isn’t just copying from sources blindly; it has “remembered” them in a structured form.

Reverse Lookup for Citations: Another clever method (akin to reverse RAG) is source verification via cross-check. The agent, when citing, might do a quick sanity check: it could take the citation snippet and search it to ensure that snippet indeed appears on that page, or that the page is the true origin. This guards against any mismatches (for instance, if the wrong URL was attached by mistake). It’s a final consistency check before presenting the report.

Robustness to Multi-Turn Dialogue: If the user asks follow-up questions referencing the answer (like “Can you show the source for that statistic?”), the system can easily pull up the cited source thanks to the structured citation tracking. The use of line numbers and archived copies means it can even quote the source context if needed in the next turn.

Incorporating these citation practices makes the system’s output trustworthy and transparent. The user sees not just what information was found, but where it came from. This fosters trust and allows verification, aligning with the best practices suggested in recent research about grounded NLG. For example, DeepMind’s GopherCite model emphasized backing up “all of its factual claims with evidence from the web” and quoting those sources – our agent follows the same principle rigorously. And in cases like healthcare or legal advice, such verification is paramount; the Mayo Clinic’s verification-first (reverse RAG) approach essentially “traces every piece of information back to its original source” to eliminate unsupported content.

References to Literature & Components: Throughout the design, we have implicitly leveraged ideas from academic literature and existing systems:

The decomposition approach is grounded in QDMR research and has been enhanced by methods like self-ask prompting and Plan-and-Solve QDMR prompting, which have shown improved performance on complex reasoning tasks.
The iterative retrieval and reasoning loop takes cues from the ReAct framework and WebGPT, ensuring the LLM “thinks” about when to search and what to do with the results. It’s also similar to how Perplexity’s Pro search can ask follow-ups.
For the agent architecture, we referenced the concept of an LLM-centric agent with tools, planning, memory as described by Weng (2023). The modular approach here could be implemented with open-source frameworks such as LangChain or HuggingFace’s transformers agent which allow defining tool use and planning. Indeed, HuggingFace’s open deep research project suggests using a “CodeAgent” to structure the agent’s reasoning in code form for efficiency. That insight (from Wang et al. 2024) showed code-based actions can reduce the steps needed by ~30%, hinting that an implementation of our pipeline could benefit from a code-based planner (for example, writing Python pseudocode for the plan and executing it).
Open-source tools: Firecrawl (for crawling web content) is an integral piece of our design for gathering data beyond basic search. Other tools that could be integrated include Browser-like APIs, PDF parsers, and the OpenAI Code Interpreter (now Advanced Data Analysis) for executing Python – which parallels how OpenAI’s Deep Research can do Python-based analysis.
For citation integrity, we discussed reverse RAG (as reported by VentureBeat/Mayo Clinic) and RARR, both cutting-edge 2023 approaches to solving hallucination via post-hoc verification. Our design is aligned with these, showing it’s informed by the latest research in aligning LLM outputs with source material.

In conclusion, this autonomous deep research system marries comprehensive language understanding with strategic planning and tool use to deliver thorough answers. It continuously verifies its work and presents results in a structured, reader-friendly format. By breaking queries into sub-tasks, searching widely (and wisely), and grounding every statement in citations, it achieves a high level of completeness and reliability – essentially functioning as a tireless research analyst that can handle any query from open-ended academic questions to practical how-tos. The modular design means each component (analysis, retrieval, synthesis, verification) can be improved or expanded independently, and the use of proven techniques from the literature ensures the system stands on the shoulders of prior research agents and methodologies. Each answer it produces is not just an answer, but a mini-report that the user can trust and further explore through the provided sources.

Sources: The references in brackets (e.g.【5】,【10】,【29】) correspond to literature and resources that underpin the design choices. These include academic papers (for QDMR, argumentation, etc.), blog posts describing advanced systems (OpenAI, DeepMind, etc.), and documentation of tools (Firecrawl, etc.), demonstrating that the system’s architecture is built upon established knowledge and open-source capabilities.

thomasdavis/deepresearch_clone_engine.md