Skip to content

Instantly share code, notes, and snippets.

@medeirosinacio
Last active April 13, 2026 16:37
Show Gist options
  • Select an option

  • Save medeirosinacio/17e1cfeaf87002fd9000dd2ea1e46a49 to your computer and use it in GitHub Desktop.

Select an option

Save medeirosinacio/17e1cfeaf87002fd9000dd2ea1e46a49 to your computer and use it in GitHub Desktop.
web-research — AI agent skill for grounded web research with verified sources, tier classification, and anti-hallucination guardrails

🔍 web-research

AI agent skill for grounded web research — verified sources, tier classification, and anti-hallucination guardrails.

version category license

Use this skill whenever the correct answer depends on external evidence — not on the model's internal knowledge. Best practices, benchmarks, how big companies do X, technical references, papers, tech comparisons, trends.


When to activate

Intent Example requests
Best practices "best practices for X", "what's the right way to do X"
Industry standards "what does the market use", "industry standard for X"
How big companies do it "how Netflix does X", "how Spotify/Google/Amazon handle X"
References & papers "references on X", "papers about X", "articles on X"
Benchmarks & comparisons "X vs Y", "is X worth it?", "when to use X vs Y"
Real examples "real-world examples of X", "which companies use X in production"
Trends "trends in X", "what changed in X in 2024"
Academic research "papers on X", "what does academia say about X"

Also activate on implicit questions where the answer depends on evidence:

  • "Should I migrate to X?" → needs benchmarks and real cases
  • "Which database should I use here?" → needs verified comparisons
  • "Is this a good practice?" → needs cross-referencing sources

Do not activate for: direct implementation questions ("how do I do X in Python?"), specific code debugging, or simple conceptual explanations that don't require external evidence.


Execution flow

1. TRANSLATE query to English (if in PT-BR)
   - Run searches in BOTH languages in parallel
   - EN: global technical corpus, papers, official docs
   - PT-BR: local context, Brazilian community
   - If output format is unclear, ask before searching

2. CHOOSE the right strategy:
   - Know the source? → webfetch directly to canonical URL
   - Need to discover sources? → Playwright → Google → webfetch the best results
   - Academic topic? → arXiv search + Scholar
   - Code patterns? → GitHub search
   - Quick alternative without browser? → Brave Search

3. EXECUTE at least 2 different sources
   - If webfetch truncates (100KB limit), try in order:
     1. Sub-URL of relevant section or anchor (/docs/, #section)
     2. Save via curl to temp file and read after:
        curl -sL "https://example.com/article" -o /tmp/web_research_tmp.html
        cat /tmp/web_research_tmp.html
     3. Playwright + snapshot (extracts only visible text, no KB limit)
   - Partial content beats nothing — only discard if empty, error, or irrelevant

4. VALIDATE each source before using:
   - Access real content via webfetch — never use only the search result title
   - Check: is the domain recognizable? does the author have credibility? does the date make sense?
   - Discard: AI farms, SEO spam, generic tutorials without authorship, inconsistent dates

5. SYNTHESIZE ranked by reliability:
   - Tier 1: official spec, peer-reviewed paper, official docs
   - Tier 2: company engineering blog with real case, conference talk
   - Tier 3: personal blog from recognized reference
   - Tier 4: Reddit, Stack Overflow (specific points, not for patterns)
   - Cross-reference 3+ Tier 1/2 sources before asserting as established

6. PRESENT with clickable links + source date
   - Output must be readable for humans — not for LLMs to process
   - Translate EN findings to PT in the final response
   - Cite original URL even if synthesis is in PT
   - Mark own analysis/commentary with Note: before the paragraph

Anti-hallucination guardrails

These rules apply to every execution, no exceptions.

  • Never invent URLs. Every cited URL must have been accessed via webfetch or Playwright in the current session. Do not cite URLs from memory.
  • Never invent source content. If webfetch returned an error or insufficient content, mark the source as ❌ and exclude it from the synthesis.
  • Never invent dates. If the page doesn't show a publication date, the field stays . Never estimate or infer dates from article context.
  • Never attribute a claim to a source without having read it. If only the title was seen in the search results and the page wasn't accessed, don't cite that source as if the content was read.
  • If you didn't find enough sources — say so. An honest response with 2 verified sources beats a complete response with made-up sources.

Citation format

Inline citation at the end of the item, with clickable link:

- Define a specific role: "helpful assistant" doesn't work, "test engineer writing tests for React" works. ([GitHub Blog](URL), Nov/2025)

Table at the end (always include):

## Sources consulted

| # | Source | Tier | Date | Status |
|---|---|---|---|---|
| 1 | [Name](URL) | Tier 1 | Month/Year | ✅ accessed |
| 2 | [Name](URL) | Tier 2 || ⚠️ partial |
| 3 | [Name](URL) | Tier 3 || ❌ 403 |

What works and what doesn't

Method Status Purpose
Direct URL + webfetch ✅ Always prefer Accesses real content when the URL is known and the site doesn't block
Direct URL + Playwright ✅ Fallback for blocked sites Use when webfetch returns 403, CAPTCHA, empty page, or truncated content
Google + Playwright ✅ Source discovery Google requires JavaScript to render results — Playwright uses a real browser; collects URLs from pages 1 and 2
Brave Search + webfetch ✅ Quick alternative Discovers sources without browser — clean links, no CAPTCHA
Bing + webfetch ✅ Works w/ caveat Real URL is Base64-encoded in the u= param — decode before use
arXiv + webfetch ✅ Papers /abs/{id} for abstract; /html/{id} for full content when available
Google + webfetch ❌ Always blocked JS redirect — doesn't work without a real browser
DuckDuckGo + webfetch ❌ CAPTCHA Visual CAPTCHA, no workaround
PDFs via webfetch ❌ Binary Look for equivalent HTML version

Google extraction script (Playwright)

async (page) => {
  const results = await page.evaluate(() => {
    const items = [];
    document.querySelectorAll('a[href]').forEach(a => {
      const h3 = a.querySelector('h3');
      if (!h3) return;
      let url = a.href;
      // Google wraps results in /url?q=<real_url> — extract the real destination
      if (url.includes('google.com/url')) {
        try { url = new URL(url).searchParams.get('q') || url; } catch (_) {}
      }
      if (url && !url.includes('google.com') && !url.startsWith('/'))
        items.push({ title: h3.innerText.trim(), url });
    });
    return items.slice(0, 15);
  });
  return JSON.stringify(results, null, 2);
}

High-signal sources — access directly without relying on SERP

Engineering blogs

# Global
netflixtechblog.com       # Netflix
engineering.atspotify.com # Spotify
dropbox.tech              # Dropbox
engineering.fb.com        # Meta
aws.amazon.com/blogs/aws  # AWS
shopify.engineering       # Shopify
uber.com/blog/engineering # Uber
confluent.io/blog         # Confluent/Kafka
martin.kleppmann.com      # Distributed systems

# Brazil
engineering.nubank.com.br

Papers & academic research

Source Domain Note
arXiv arxiv.org /abs/{id} for abstract, /html/{id} for full — never /pdf/ (binary)
Semantic Scholar semanticscholar.org Citation graph, open access links
Papers with Code paperswithcode.com ML papers + implementations

Standards, RFCs & specs

rfc-editor.org/rfc/rfcXXXX
learn.microsoft.com/azure/architecture
aws.amazon.com/architecture
landscape.cncf.io

Known gotchas

Gotcha Behavior Fix
PDF via webfetch Returns unreadable binary Look for /html/ version
Bing links Real URL is Base64 in u= param Decode: base64.decode(u_param)
Product page ≠ content Main URL has index, not real content Explore sub-URLs: /tips/, /resources/, /docs/
100KB truncation Long pages cut off mid-content 1) Try sub-URL or anchor; 2) Playwright snapshot; 3) Use what you got and mark ⚠️ partial
Single source Not verifiable Cross-reference 3+ independent Tier 1/2 sources

Query templates

Best practices for X

"{technology}" "best practices" site:infoq.com OR site:martinfowler.com
intitle:"lessons learned" "{technology}" after:2022-01-01

How big companies do X

site:netflixtechblog.com OR site:engineering.atspotify.com "{X}"
"{X}" "how we" OR "at scale" site:medium.com/engineering

Academic research

https://arxiv.org/search/?searchtype=all&query={X}&start=0
https://scholar.google.com/scholar?q={X}&as_ylo=2022
"{X}" survey intitle:survey site:arxiv.org

Postmortems & real cases

intitle:postmortem OR intitle:"incident report" "{tech}" site:github.com
site:sre.google "{pattern}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment