Skip to content

Instantly share code, notes, and snippets.

@Saccarab
Last active April 20, 2026 05:24
Show Gist options
  • Select an option

  • Save Saccarab/a64ac7f1bca993efb47ba06d50981830 to your computer and use it in GitHub Desktop.

Select an option

Save Saccarab/a64ac7f1bca993efb47ba06d50981830 to your computer and use it in GitHub Desktop.

reference to data https://geo-data-viewer-652465950700.europe-west1.run.app/dashboard

open questions on logistic regression methodology

1. DV operational definition

What do we consider as is_cited = 1 ?

I did cited vs SERP top-10 drift in the thesis but also had a separate section for drift between cited (used inline within response) and additional (side menu GPT providers for more links) can check image below.

I'd personally do both combined here since they were seen relevant enough to go into GPT's context. My rationale for this is that on different executions of a prompt additional section links can actually flip to being used as cited.

This is not relevant at all for Gemini as they don't have the "additional" section GPT has.

gpt_ss

2. Fan-out deduplication

ChatGPT issues multiple search queries per prompt (fan-outs, ~2 per run mooost of the time). A URL can appear in the top-10 of more than one fan-out within the same run. But the model still makes one citation decision per run regardless of how many fan-outs surfaced the URL. So in the raw data a single citation decision shows up as multiple rows with different rank values, which the regression would treat as independent observations.

One solution could be just using min rank of the cited URL in SERP within all fanouts but that also feels to me like we'd be making an assumption there or add both fan outs as different rows but I suppose that could also have other implications? Def need some ideas on this :)

3. Non-linear rank effect

This one I didn't know would affect regression but was raised to me by AI and this could also affect a lot of the findings. As my thesis findings show top ranks carry disproportionate citation weight. Moving from rank 1 to 2 has a much bigger effect than 9 to 10. Is this something we'd need to address in regression logic?

4. Bing vs Google rank coverage

I was considering just doing Top-10 SERP results for both Google and Bing. But on thesis I actually had to go pretty deep (Top 200 rank) on Bing's SERP results to get a reasonable pool (thesis §2.5.4 for the full explanation) so doing Top-10 for both would mean results will be quite disproportionate (8 pages of results on Bing barely match top 10 of Google in overlap count).

I'm attaching the Bing and Google Overlap results I have from the thesis below for context.

Bing pool options to match Google's coverage:

Pool Business coverage Plus coverage
Bing — Page 1 only 17% 5%
Bing — Pages 1–8 62% 48%
Bing — Pages 1–16 (full scrape) 81% 68%
Google — Rank 1–10 46%
Google — Full scrape 65%

We can expand Bing to Pages 1–8, keep Google at top-10. Coverage becomes comparable (~71–77% Bing vs 62–73% Google). Rank scale asymmetry (1–80 vs 1–10) would be handled this way. But the overall pool size for Bing models would also be waay bigger and so I'd def like some guidance here.

business-bing plus-bing business-google plus-google **Gemini — Google** gemini-google

5. GPT Plus: cross-engine ranks

We discussed last week that we could put both rank_Bing and rank_Google in one combined GPT Plus model (as we suspect it might be reading from both engines) because ~65% of Plus-cited URLs appear in both. But I'm not sure how we'd address this at all also given the thing I mention above on Bing vs Google.

Would Bing search results have to be normalized so they are comparable to Google?

If Plus reads from both engines but we can't tell which one drove a given citation, would combining both ranks mix signals?

6. Feature sets

I guess this part is not really open questions but I have some takes regarding what shouldn't be considered as IV, anything that doesnt make sense or you need changed feel free to flag.

I thought we need different IV for listicles and also for different LLM versions as their behavior can be quite different.

Listicle models — GPT Business × Bing, GPT Plus × Bing, GPT Plus × Google (8 IVs): rank, has_tables, has_numbered_lists, has_pros_cons, has_clear_authorship, is_vendor_owned, freshness_cue_strength, log_words

Listicle model — Gemini × Google (7 IVs): Dropped freshness_cue_strength — Gemini's fan-out queries explicitly inject year tokens (thesis §3.2.2), which pre-filters the SERP pool for recency and leaves almost no within-pool variance.

Product-page models — all four splits (5 IVs): rank, has_tables, has_numbered_lists, freshness_cue_strength, log_words

has_pros_cons dropped — doesn't appear most of the time on product pages.

has_bullet_points dropped — as our earlier discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment