reference to data https://geo-data-viewer-652465950700.europe-west1.run.app/dashboard
What do we consider as is_cited = 1 ?
I did cited vs SERP top-10 drift in the thesis but also had a separate section for drift between cited (used inline within response) and additional (side menu GPT providers for more links) can check image below.
I'd personally do both combined here since they were seen relevant enough to go into GPT's context. My rationale for this is that on different executions of a prompt additional section links can actually flip to being used as cited.
This is not relevant at all for Gemini as they don't have the "additional" section GPT has.
ChatGPT issues multiple search queries per prompt (fan-outs, ~2 per run mooost of the time). A URL can appear in the top-10 of more than one fan-out within the same run. But the model still makes one citation decision per run regardless of how many fan-outs surfaced the URL. So in the raw data a single citation decision shows up as multiple rows with different rank values, which the regression would treat as independent observations.
One solution could be just using min rank of the cited URL in SERP within all fanouts but that also feels to me like we'd be making an assumption there or add both fan outs as different rows but I suppose that could also have other implications? Def need some ideas on this :)
This one I didn't know would affect regression but was raised to me by AI and this could also affect a lot of the findings. As my thesis findings show top ranks carry disproportionate citation weight. Moving from rank 1 to 2 has a much bigger effect than 9 to 10. Is this something we'd need to address in regression logic?
I was considering just doing Top-10 SERP results for both Google and Bing. But on thesis I actually had to go pretty deep (Top 200 rank) on Bing's SERP results to get a reasonable pool (thesis §2.5.4 for the full explanation) so doing Top-10 for both would mean results will be quite disproportionate (8 pages of results on Bing barely match top 10 of Google in overlap count).
I'm attaching the Bing and Google Overlap results I have from the thesis below for context.
Bing pool options to match Google's coverage:
| Pool | Business coverage | Plus coverage |
|---|---|---|
| Bing — Page 1 only | 17% | 5% |
| Bing — Pages 1–8 | 62% | 48% |
| Bing — Pages 1–16 (full scrape) | 81% | 68% |
| Google — Rank 1–10 | — | 46% |
| Google — Full scrape | — | 65% |
We can expand Bing to Pages 1–8, keep Google at top-10. Coverage becomes comparable (~71–77% Bing vs 62–73% Google). Rank scale asymmetry (1–80 vs 1–10) would be handled this way. But the overall pool size for Bing models would also be waay bigger and so I'd def like some guidance here.
**Gemini — Google**
We discussed last week that we could put both rank_Bing and rank_Google in one combined GPT Plus model (as we suspect it might be reading from both engines) because ~65% of Plus-cited URLs appear in both. But I'm not sure how we'd address this at all also given the thing I mention above on Bing vs Google.
Would Bing search results have to be normalized so they are comparable to Google?
If Plus reads from both engines but we can't tell which one drove a given citation, would combining both ranks mix signals?
I guess this part is not really open questions but I have some takes regarding what shouldn't be considered as IV, anything that doesnt make sense or you need changed feel free to flag.
I thought we need different IV for listicles and also for different LLM versions as their behavior can be quite different.
Listicle models — GPT Business × Bing, GPT Plus × Bing, GPT Plus × Google (8 IVs): rank, has_tables, has_numbered_lists, has_pros_cons, has_clear_authorship, is_vendor_owned, freshness_cue_strength, log_words
Listicle model — Gemini × Google (7 IVs):
Dropped freshness_cue_strength — Gemini's fan-out queries explicitly inject year tokens (thesis §3.2.2), which pre-filters the SERP pool for recency and leaves almost no within-pool variance.
Product-page models — all four splits (5 IVs): rank, has_tables, has_numbered_lists, freshness_cue_strength, log_words
has_pros_cons dropped — doesn't appear most of the time on product pages.
has_bullet_points dropped — as our earlier discussion