Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save LEX8888/3f4183df6fef0d6e4783aae1bd986d17 to your computer and use it in GitHub Desktop.

Select an option

Save LEX8888/3f4183df6fef0d6e4783aae1bd986d17 to your computer and use it in GitHub Desktop.
AI Models 2026: Context Windows, Compacting Hell, and Chinese Models 30x Cheaper (Claude vs GPT vs Gemini vs DeepSeek vs Qwen)

AI Models in 2026: Context Windows, Quality, and the Rise of Chinese Challengers

As you read this, somewhere a developer is screaming at their monitor: "Why did you forget what I told you 5 minutes ago?!"


🎭 TL;DR for the Impatient

The AI landscape has fundamentally shifted:

  • Gemini 3 Pro β€” context window champion: 10M tokens (~7.5 million words!)
  • Claude β€” coding leader (82% SWE-bench), but plagued by compacting issues
  • Chinese models (DeepSeek, Qwen) β€” now at frontier level, but 10-30x cheaper
  • Everyone lies about real context window size β€” models typically break at 65-80% of advertised capacity

And yes, Sam Altman publicly admitted OpenAI "ruined" GPT-5.2's text quality. Anthropic confirmed bugs in Claude. Gemini regularly crashes with 503 errors. Welcome to 2026.


πŸ“ Context Windows: Marketing vs Reality

What They Promise vs What We Get

You know that moment when you've spent an hour explaining your project architecture to Claude, and it suddenly asks "What's our stack again?" Congratulations β€” you've met the real context window.

AIMultiple tested 22 models and discovered an uncomfortable truth: most break significantly earlier than advertised. A model with "200K tokens" typically becomes unreliable around ~130K. And it's not a gradual degradation β€” quality drops suddenly, without warning.

Model Advertised Actually Works NIAH Accuracy
Gemini 3 Pro 10M tokens ~10M (record!) High
Gemini 2.5 Pro 2M tokens ~2M (99.7% at 1M) >99.7%
Claude 4 Sonnet 200K (1M beta) <5% degradation >99%
GPT-5 400K input 400K Variable
DeepSeek V3.2 128-164K 128K Strong
Qwen3-Max 258K-1M 256K native Near-perfect
Mistral Large 3 256K 256K Stable

Google leads in context capacity. Gemini 1.5 Pro achieved 99.2% retrieval accuracy at 10 million tokens. January 2026 β€” Gemini 3 Pro release with 10M tokens in production. The largest available context window on the market.

Anthropic expanded Claude to 1M tokens in January 2026 for tier 4+ organizations. Requests over 200K are charged at 2x input and 1.5x output rates. Claude shows less than 5% degradation across its full window β€” one of the most stable performers.


🧠 The "Lost in the Middle" Problem β€” Why All Models Get Dumber Mid-Context

Stanford and University of Washington researchers discovered a fundamental architectural flaw: all transformer models show a U-shaped performance curve.

What this means in practice:

  • Information at the beginning β€” remembered well (primacy bias)
  • Information at the end β€” also remembered (recency bias)
  • Information in the middle β€” 30%+ performance drop

This isn't a bug in a specific model. It's a consequence of Rotary Position Embedding (RoPE) used in nearly all modern LLMs.

The RULER benchmark from NVIDIA showed shocking results: despite 99%+ scores on simple "needle in a haystack" tests, all models degrade sharply on more complex long-context tasks. Only half the models claiming 32K+ context actually maintain quality at that length.


πŸ”₯ Claude's Compacting Problems: Pain in Real-Time

"Compacting our conversation so we can keep chatting..." β€” and then everything goes to hell

If you use Claude, you've definitely seen this message. And probably screamed at your screen afterwards.

What is compacting? When context approaches the limit, Claude "compacts" the conversation β€” creates a brief summary instead of full history. Sounds reasonable. In practice β€” often catastrophic.

Real User Complaints (GitHub Issues, January-February 2026):

πŸ”΄ Issue #18482 (26 πŸ‘): "Compaction failed unexpectedly"

After the first compaction event, affected conversations enter a broken state: prompts submit briefly, then revert to draft. No execution occurs. This is backend conversation-state corruption related to the compaction pipeline. Issue began around the Jan 15, 2026 compaction incident.

πŸ”΄ Issue #18866 (61 πŸ‘): "Auto-compact not triggering despite being marked as fixed"

Auto-compact doesn't work. Anthropic flagged this as fixed on January 15, but the issue persists. When context window fills up, one of two things happens: Messages get bounced back to the input box with no error (most common), or a "limit reached" error appears. This happens even when the context shouldn't be anywhere near the 200k token limit.

πŸ”΄ Issue #22729 (8 hours ago!): "Error compacting conversation" causes Session Freeze

Session: ~2,674 messages, 13+ hours of active use. Severity: CRITICAL. Session becomes unusable. No graceful way to save work or exit. Forces user to terminate Claude Code process. Loss of session context.

πŸ”΄ Issue #6004: "Infinite compaction loop"

Claude Code v1.0.83 is stuck in an infinite loop attempting to "compact" the conversation. Consequently, I'm getting "Approaching Opus usage limit" a lot faster than what I'm used to for the past 3 months.

What the Community Says:

"There are few things in life that can kill the vibes like this." β€” Du'An Lightfoot, developer

You and Claude were tag teaming building an amazing new app feature when suddenly, Claude can't remember what you discussed five minutes ago. Do you push through? Try to salvage the session? Nope! Here's what I do: /quit. Then immediately after enter claude again to begin a fresh new session.

"All was fine at first, but then it started to forget things... it's just silly to have to tell it 20 times to do the same thing over and over again." β€” Reddit user

"One user on Medium described it as the 'compacting trap.' They'd hit the compact button to be efficient, only to find that Claude had lost the entire narrative thread of the project, leading to a mess of disconnected modules that wouldn't even compile."

Why This Happens Technically:

Claude uses a single shared context buffer with no separation between short-term, long-term, or profile memory. As tokens accumulate, old ones are simply dropped via sliding window. No memory between chats β€” every conversation starts fresh.

This is a fundamental architectural limitation, and compacting is an attempt to work around it. An attempt that often fails.


πŸ“Š Benchmarks: Who's Actually Best?

Coding: Claude Still King (When It Works)

Benchmark Leader Score Runner-up
SWE-bench Verified Claude Sonnet 4.5 82.0% Claude Opus 4.5 (80.9%)
GPQA Diamond GPT-5.2 92.4% Gemini 3 Pro (91.9%)
AIME 2025 GPT-5.2 / Gemini 3 100% Tied
Arena Elo Gemini 3 Pro 1501 GPT-5.2 (~1480)
HumanEval Qwen2.5-Coder-32B ~92% Mistral Large 3 (90-92%)

Claude dominates agentic coding. Claude Sonnet 4.5 leads SWE-bench Verified at 82.0% β€” real GitHub issue resolution, not synthetic tests. Significant lead over GPT-5.2 (80.0%) and Gemini 3 Pro (76.2%).

GPT-5.2 leads in pure reasoning. On GPQA Diamond (graduate-level science questions) β€” 92.4%. Both GPT-5.2 and Gemini 3 Pro achieved perfect 100% on AIME 2025 (mathematics).

Gemini 3 Pro broke the Arena. First model to breach the 1500 Elo barrier on LMSYS Chatbot Arena β€” based on over 6 million real user votes.


πŸ‡¨πŸ‡³ Chinese Models: This Is Serious Now

DeepSeek V3.2: Gold Medals and 30x Lower Prices

DeepSeek V3.2 won gold medals at IMO 2025, IOI 2025, and second place at ICPC World Finals. Performance matching or exceeding top Western models.

Pricing (per million tokens):

Model Input Output
GPT-4o $3 $10
Claude Opus $15 $75
DeepSeek V3.2 $0.27 $1.10

Yes, you read that right. DeepSeek is 30-50x cheaper than Western alternatives at comparable quality.

Training cost for V3 was only $5.576 million β€” pennies compared to Western competitors.

Technical innovations:

  • Multi-Head Latent Attention (MLA) β€” 50-70% memory savings
  • DeepSeek Sparse Attention (DSA) β€” ~70% inference cost reduction for long contexts

Qwen3-Max: Third Place Globally, Beats GPT-5-Chat

Alibaba's Qwen3-Max ranked third globally on LMArena, surpassing GPT-5-Chat. Qwen3-Max-Thinking variant achieves 100% accuracy on AIME25 and HMMT β€” matching or exceeding Gemini 3 Pro and GPT-5.2.

The family includes:

  • Dense models from 0.6B to 32B parameters
  • MoE models up to 235B total (22B active)
  • Support for 119 languages
  • Apache 2.0 license β€” full commercial freedom

What Else Is Coming from China:

  • Kimi K2.5 from Moonshot AI β€” "most powerful open-weights model" with Agent Swarm (100 parallel subagents)
  • ERNIE 5.0 from Baidu β€” 2.4 trillion parameters, omnimodal, GPT-5 parity on several benchmarks, $0.85/M tokens

😱 Confessions and Degradation: What the Companies Say

Sam Altman Admitted: "We Ruined GPT-5.2"

At a January 2026 developer meeting, OpenAI's CEO stated:

"I think we just messed up."

The company focused on intelligence, coding, and reasoning, but "due to limited bandwidth" neglected text quality. He promised future 5.x versions will write "much better than 4.5" and announced plans to make GPT-5.2 level intelligence 100x cheaper by end of 2027.

Anthropic Confirmed Bugs in Claude

In September 2025, Anthropic acknowledged:

"We've received reports... that Claude and Claude Code users have been experiencing inconsistent responses. We opened investigations into a number of bugs causing degraded output quality on several of our models for some users."

Two bugs were fixed in Claude Sonnet 4 and Claude Haiku 3.5. Investigation into Claude Opus 4.1 continues.

Important: Anthropic explicitly denied intentionally degrading models due to load: "We never reduce model quality due to demand, time of day, or server load."

Independent tracker Margin Lab detected a 4.1% performance decline in Claude Code over 30 days based on 655 evaluations.

ChatGPT Also Loses Memory

February 2025 β€” massive ChatGPT memory collapse affecting thousands β€” "wiping years of user data" according to community reports.

Users report that after 30-50 messages, GPT-4o "often forgets earlier parts" of the conversation. Critical data loss continues: "Conversations with long history are progressively breaking: Parts of the dialogue disappear, sometimes entire hours."


⚑ Service Reliability: Everyone Goes Down

Service Recent Performance Key Issues
Claude 152+ incidents since Oct 2025 Opus 4.5 errors, compacting failures
Gemini 65+ incidents since Jun 2025 Frequent 503 errors, API instability
ChatGPT Memory failures, peak throttling Feb 2025 data loss event
DeepSeek Variable, hidden throttling 30-min timeouts, queue deprioritization
Mistral Most stable ~0.1% gibberish rate

Gemini API is particularly unstable. Developer forums: "Over the past few days, the Gemini API... has become nearly unusable" and "I am losing money and clients... falling back to GPT for now." Peak hours (12:00-16:00 Madrid Time) show the worst 503 rates.


🎯 Practical Recommendations

By Use Case:

Task Best Choice Why
Maximum context Gemini 2.5/3 Pro 2-10M tokens, >99% retrieval
Coding & development Claude Sonnet 4.5 82% SWE-bench, best for agents
Budget projects DeepSeek V3.2 Frontier quality at $0.27/M
Reasoning GPT-5.2 or Gemini 3 Pro 92%+ GPQA, 100% AIME
Reliability Mistral Large 3 Minimum incidents

If You Use Claude:

  1. Don't rely on auto-compact β€” do manual checkpoints at 70% context
  2. Use CLAUDE.md β€” a file in project root that Claude reads automatically
  3. Start fresh sessions β€” better to restart often than fight corrupted state
  4. Watch GitHub Issues β€” that's where problems surface first

If You're Considering Chinese Alternatives:

  • DeepSeek β€” excellent for coding, but possible hidden throttling under high load
  • Qwen β€” Apache 2.0 license allows self-hosting, no vendor lock-in
  • Consider latency β€” servers in China may add delays for your region

πŸ“š Sources

Benchmarks and Research:

GitHub Issues (Claude compacting):

  • #18482 β€” Compaction corrupts conversation state
  • #18866 β€” Auto-compact not triggering
  • #22729 β€” Error compacting causes session freeze
  • #17808 β€” Compacting conversation twice

Articles and Reviews:


Data collected: February 3, 2026

P.S. If you made it this far, congratulations β€” your context window is clearly larger than some models'. πŸ˜„

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment