AI agent skill for grounded web research — verified sources, tier classification, and anti-hallucination guardrails.
Use this skill whenever the correct answer depends on external evidence — not on the model's internal knowledge. Best practices, benchmarks, how big companies do X, technical references, papers, tech comparisons, trends.
| Intent | Example requests |
|---|---|
| Best practices | "best practices for X", "what's the right way to do X" |
| Industry standards | "what does the market use", "industry standard for X" |
| How big companies do it | "how Netflix does X", "how Spotify/Google/Amazon handle X" |
| References & papers | "references on X", "papers about X", "articles on X" |
| Benchmarks & comparisons | "X vs Y", "is X worth it?", "when to use X vs Y" |
| Real examples | "real-world examples of X", "which companies use X in production" |
| Trends | "trends in X", "what changed in X in 2024" |
| Academic research | "papers on X", "what does academia say about X" |
Also activate on implicit questions where the answer depends on evidence:
- "Should I migrate to X?" → needs benchmarks and real cases
- "Which database should I use here?" → needs verified comparisons
- "Is this a good practice?" → needs cross-referencing sources
Do not activate for: direct implementation questions ("how do I do X in Python?"), specific code debugging, or simple conceptual explanations that don't require external evidence.
1. TRANSLATE query to English (if in PT-BR)
- Run searches in BOTH languages in parallel
- EN: global technical corpus, papers, official docs
- PT-BR: local context, Brazilian community
- If output format is unclear, ask before searching
2. CHOOSE the right strategy:
- Know the source? → webfetch directly to canonical URL
- Need to discover sources? → Playwright → Google → webfetch the best results
- Academic topic? → arXiv search + Scholar
- Code patterns? → GitHub search
- Quick alternative without browser? → Brave Search
3. EXECUTE at least 2 different sources
- If webfetch truncates (100KB limit), try in order:
1. Sub-URL of relevant section or anchor (/docs/, #section)
2. Save via curl to temp file and read after:
curl -sL "https://example.com/article" -o /tmp/web_research_tmp.html
cat /tmp/web_research_tmp.html
3. Playwright + snapshot (extracts only visible text, no KB limit)
- Partial content beats nothing — only discard if empty, error, or irrelevant
4. VALIDATE each source before using:
- Access real content via webfetch — never use only the search result title
- Check: is the domain recognizable? does the author have credibility? does the date make sense?
- Discard: AI farms, SEO spam, generic tutorials without authorship, inconsistent dates
5. SYNTHESIZE ranked by reliability:
- Tier 1: official spec, peer-reviewed paper, official docs
- Tier 2: company engineering blog with real case, conference talk
- Tier 3: personal blog from recognized reference
- Tier 4: Reddit, Stack Overflow (specific points, not for patterns)
- Cross-reference 3+ Tier 1/2 sources before asserting as established
6. PRESENT with clickable links + source date
- Output must be readable for humans — not for LLMs to process
- Translate EN findings to PT in the final response
- Cite original URL even if synthesis is in PT
- Mark own analysis/commentary with Note: before the paragraph
These rules apply to every execution, no exceptions.
- Never invent URLs. Every cited URL must have been accessed via webfetch or Playwright in the current session. Do not cite URLs from memory.
- Never invent source content. If webfetch returned an error or insufficient content, mark the source as ❌ and exclude it from the synthesis.
- Never invent dates. If the page doesn't show a publication date, the field stays
—. Never estimate or infer dates from article context. - Never attribute a claim to a source without having read it. If only the title was seen in the search results and the page wasn't accessed, don't cite that source as if the content was read.
- If you didn't find enough sources — say so. An honest response with 2 verified sources beats a complete response with made-up sources.
Inline citation at the end of the item, with clickable link:
- Define a specific role: "helpful assistant" doesn't work, "test engineer writing tests for React" works. ([GitHub Blog](URL), Nov/2025)
Table at the end (always include):
## Sources consulted
| # | Source | Tier | Date | Status |
|---|---|---|---|---|
| 1 | [Name](URL) | Tier 1 | Month/Year | ✅ accessed |
| 2 | [Name](URL) | Tier 2 | — | ⚠️ partial |
| 3 | [Name](URL) | Tier 3 | — | ❌ 403 || Method | Status | Purpose |
|---|---|---|
| Direct URL + webfetch | ✅ Always prefer | Accesses real content when the URL is known and the site doesn't block |
| Direct URL + Playwright | ✅ Fallback for blocked sites | Use when webfetch returns 403, CAPTCHA, empty page, or truncated content |
| Google + Playwright | ✅ Source discovery | Google requires JavaScript to render results — Playwright uses a real browser; collects URLs from pages 1 and 2 |
| Brave Search + webfetch | ✅ Quick alternative | Discovers sources without browser — clean links, no CAPTCHA |
| Bing + webfetch | ✅ Works w/ caveat | Real URL is Base64-encoded in the u= param — decode before use |
| arXiv + webfetch | ✅ Papers | /abs/{id} for abstract; /html/{id} for full content when available |
| Google + webfetch | ❌ Always blocked | JS redirect — doesn't work without a real browser |
| DuckDuckGo + webfetch | ❌ CAPTCHA | Visual CAPTCHA, no workaround |
| PDFs via webfetch | ❌ Binary | Look for equivalent HTML version |
async (page) => {
const results = await page.evaluate(() => {
const items = [];
document.querySelectorAll('a[href]').forEach(a => {
const h3 = a.querySelector('h3');
if (!h3) return;
let url = a.href;
// Google wraps results in /url?q=<real_url> — extract the real destination
if (url.includes('google.com/url')) {
try { url = new URL(url).searchParams.get('q') || url; } catch (_) {}
}
if (url && !url.includes('google.com') && !url.startsWith('/'))
items.push({ title: h3.innerText.trim(), url });
});
return items.slice(0, 15);
});
return JSON.stringify(results, null, 2);
}# Global
netflixtechblog.com # Netflix
engineering.atspotify.com # Spotify
dropbox.tech # Dropbox
engineering.fb.com # Meta
aws.amazon.com/blogs/aws # AWS
shopify.engineering # Shopify
uber.com/blog/engineering # Uber
confluent.io/blog # Confluent/Kafka
martin.kleppmann.com # Distributed systems
# Brazil
engineering.nubank.com.br
| Source | Domain | Note |
|---|---|---|
| arXiv | arxiv.org |
/abs/{id} for abstract, /html/{id} for full — never /pdf/ (binary) |
| Semantic Scholar | semanticscholar.org |
Citation graph, open access links |
| Papers with Code | paperswithcode.com |
ML papers + implementations |
rfc-editor.org/rfc/rfcXXXX
learn.microsoft.com/azure/architecture
aws.amazon.com/architecture
landscape.cncf.io
| Gotcha | Behavior | Fix |
|---|---|---|
| PDF via webfetch | Returns unreadable binary | Look for /html/ version |
| Bing links | Real URL is Base64 in u= param |
Decode: base64.decode(u_param) |
| Product page ≠ content | Main URL has index, not real content | Explore sub-URLs: /tips/, /resources/, /docs/ |
| 100KB truncation | Long pages cut off mid-content | 1) Try sub-URL or anchor; 2) Playwright snapshot; 3) Use what you got and mark |
| Single source | Not verifiable | Cross-reference 3+ independent Tier 1/2 sources |
"{technology}" "best practices" site:infoq.com OR site:martinfowler.com
intitle:"lessons learned" "{technology}" after:2022-01-01
site:netflixtechblog.com OR site:engineering.atspotify.com "{X}"
"{X}" "how we" OR "at scale" site:medium.com/engineering
https://arxiv.org/search/?searchtype=all&query={X}&start=0
https://scholar.google.com/scholar?q={X}&as_ylo=2022
"{X}" survey intitle:survey site:arxiv.org
intitle:postmortem OR intitle:"incident report" "{tech}" site:github.com
site:sre.google "{pattern}"