Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save CrazyPit/7cf20ca2067082a9ba29fc18558eaa76 to your computer and use it in GitHub Desktop.

Select an option

Save CrazyPit/7cf20ca2067082a9ba29fc18558eaa76 to your computer and use it in GitHub Desktop.

Why LLMs Often Make "Locally Smart, Globally Wrong" Engineering Decisions

One of the core problems in LLM-driven software development is not that models are simply "bad at coding." It is that they are often extremely strong at local optimization inside the current task frame, while being much weaker at preserving system intent, architectural boundaries, and domain-level coherence.

This creates a very specific failure mode.

An LLM can often:

  • continue a design direction very convincingly,
  • find subtle edge cases,
  • reverse-engineer unfamiliar code quickly,
  • detect rare bugs or strange technical interactions,
  • produce a patch that looks internally consistent.

At the same time, it may completely miss things that would look obvious to a human who understands the system:

  • that the proposed change silently alters the principle the module is based on,
  • that a similar mechanism already exists elsewhere in the codebase,
  • that the problem should be solved one level higher in the abstraction hierarchy,
  • that the patch introduces conceptual duplication,
  • that the current implementation is a historical artifact and should not be imitated,
  • that the task itself has been framed incorrectly.

This is why LLMs can sometimes discover extremely sophisticated technical issues while failing to notice something simpler but more important, such as "this subsystem already solves the same class of problem" or "this change breaks the architecture's intended symmetry."

Local intelligence vs. system intelligence

A useful way to think about the current generation of LLMs is this:

They are often excellent at reasoning over artifacts and much weaker at reasoning over intent.

Artifacts include:

  • source code,
  • APIs,
  • logs,
  • test failures,
  • call graphs,
  • configuration,
  • naming patterns,
  • repeated implementation structures.

Intent includes:

  • why the system is designed the way it is,
  • which constraints are fundamental and which are accidental,
  • where the real abstraction boundaries are,
  • what kind of precedent a new mechanism creates,
  • whether a local solution is aligned with the long-term shape of the system.

Humans with domain understanding are often much better at intent. LLMs are often much better at artifact-level pattern recognition. This difference explains a large share of the "why is it doing such clever nonsense?" feeling that appears in AI-assisted development.

An important nuance: this gap is not purely a limitation of model intelligence. It is partly a consequence of technical constraints — particularly context window limits and the "lost-in-the-middle" effect, where models struggle to attend equally to all parts of a long input. Even with 1M+ token windows available in 2025-era models, research shows that attention quality degrades with input length. A typical enterprise monorepo can span millions of tokens — far more than any current context window. This means that when a model "misses" a relevant subsystem, it may literally not have that subsystem in its working context, rather than failing to reason about it.

Why models tend to solve the problem in front of them instead of re-framing it

LLMs trained with RLHF (Reinforcement Learning from Human Feedback) are optimized to be:

  • helpful,
  • cooperative,
  • responsive,
  • decisive,
  • productive within the current conversational frame.

This optimization creates a well-documented tendency known as sycophancy — the model's bias toward producing responses that match user expectations rather than challenging them. Research published at ICLR 2024 by Anthropic found that five state-of-the-art AI assistants consistently exhibit sycophantic behavior across varied text-generation tasks. A 2025 study in npj Digital Medicine showed compliance rates up to 100% with illogical medical requests, prioritizing helpfulness over logical consistency.

As a result, models are biased toward behavior like this:

  1. Accept the task as given.
  2. Infer a plausible local model of the code.
  3. Move forward quickly.
  4. Produce a concrete solution.
  5. Minimize hesitation and friction.

In real engineering, however, the most valuable move is often the opposite:

  • stop,
  • question the framing,
  • reject the implied solution,
  • point out missing context,
  • ask whether the problem should be solved elsewhere,
  • notice that the proposed implementation violates an architectural principle.

Current models are often under-optimized for this kind of behavior. They are rewarded more for "making progress" than for saying "this is the wrong level at which to make progress."

That makes them feel less like co-architects and more like hyper-capable executors.

It is worth noting that this is an active area of research. Techniques like DPO (Direct Preference Optimization) and constitutional AI methods are being used to specifically penalize sycophantic responses while preserving helpfulness. Additionally, "thinking" or "reasoning" model variants (such as Claude's extended thinking, DeepSeek R1, and OpenAI's o-series) are specifically designed to reason before acting, partially addressing the "rushing forward" tendency. These mitigations are real but incomplete — the fundamental tension between helpfulness and skepticism remains.

Why reverse engineering often replaces higher-level thinking

A common pattern in LLM-assisted coding looks like this:

  1. The model inspects existing code.
  2. It infers how the current system appears to work.
  3. It builds a patch consistent with that observed pattern.
  4. It returns a solution that fits the local surface structure.

What is often missing is the next question:

  • Should the system work this way at all?
  • Is this pattern a principled design or just accumulated historical residue?
  • Is copying this pattern reinforcing a mistake rather than extending a design?
  • Are we preserving a constraint or merely preserving inertia?

This is one of the biggest practical limitations of using LLMs as coding agents. They tend to respect the surface of the current implementation more than the underlying truth of the design. Research on cognitive biases in LLM-assisted development confirms that biased decisions in LLM-assisted contexts can cascade through entire codebases and create compounding technical debt. A 2026 analysis identified five specific anti-patterns that LLMs actively reintroduce into codebases, including over-commenting and pattern replication from training data.

A strong engineer does not just read the codebase. They classify parts of it into:

  • essential structure,
  • accidental complexity,
  • legacy constraints,
  • copied patterns that should not spread further,
  • abstractions that are missing but should exist.

LLMs are still inconsistent at this classification, though agent-based approaches with explicit planning phases — such as Self-Planning — are beginning to address this by forcing models to decompose problems and consider system structure before generating code.

Why they catch rare edge cases but miss obvious architectural conflicts

This seeming contradiction is actually very natural.

LLMs can be very good at:

  • enumerating hidden execution paths,
  • spotting complex combinatorial interactions,
  • identifying unusual failure scenarios,
  • tracing subtle consequences through code.

These are often pattern-recognition-heavy tasks.

But they may still miss:

  • that the change introduces a second version of the same concept,
  • that the module is being reused in another context with a different operating principle,
  • that a local optimization undermines a global invariant,
  • that the proposed fix belongs in a shared abstraction rather than a new local implementation.

This is not a paradox. It is a consequence of the type of intelligence being applied.

Finding a rare bug is often a matter of deep local analysis. Recognizing that a solution is conceptually wrong is often a matter of structural judgment.

Those are not the same skill.

The real failure mode: locally valid, globally distorting

The most dangerous LLM outputs are not the obviously broken ones. Those are relatively easy to catch.

The dangerous outputs are the ones that are:

  • technically competent,
  • internally consistent,
  • plausible,
  • cleanly implemented,
  • well explained,
  • and still wrong at the system level.

This happens when the model produces a patch that satisfies the immediate task while quietly distorting the larger design.

Examples include:

  • introducing a new mechanism where an existing one should have been generalized,
  • embedding domain logic into infrastructure code because it was locally convenient,
  • replicating behavior instead of consolidating it,
  • preserving an implementation detail that should have been abstracted away,
  • changing behavior that other modules rely on implicitly,
  • solving a symptom instead of the structural cause.

These outputs are especially dangerous because they often look better than messy human code. They can appear cleaner, faster, and more elegant while actually pushing the system in the wrong direction. Survey data shows that 76% of developers believe AI-generated code needs refactoring — suggesting that the gap between "looks correct" and "is architecturally sound" is widely recognized in practice.

Why human oversight is still essential

In an LLM-heavy engineering workflow, the human role is not merely:

  • writing prompts,
  • reviewing syntax,
  • approving patches,
  • fixing occasional bugs.

The deeper role is to act as:

  • the keeper of architectural intent,
  • the holder of domain boundaries,
  • the detector of false local optima,
  • the person who recognizes when the task should be reframed,
  • the person who knows what must not be duplicated,
  • the person who can tell the difference between "clever implementation" and "correct system move."

This is why someone with strong domain understanding can often instantly see that an LLM-produced change is wrong, even when the model appears highly intelligent and technically impressive.

The model may be operating correctly within the visible frame. The human is often the only one holding the invisible frame.

What this means for AI-native engineering teams

Teams that rely heavily on LLMs need to stop thinking of the model as "a faster programmer" and start thinking of it as something closer to:

  • an aggressive local optimizer,
  • a high-bandwidth implementation engine,
  • a reverse-engineering assistant,
  • a pattern detector with weak native architectural skepticism.

If that is true, then the surrounding process has to compensate.

A useful workflow should force the model to address questions like:

  • What concept is being changed here?
  • Is this a new mechanism or a specialization of an existing one?
  • Which modules depend on the current behavior?
  • Are we preserving a principle or just copying a pattern?
  • Is there already another subsystem solving the same class of problem?
  • Does this belong at a higher abstraction layer?
  • What architectural invariant could this change violate?
  • What part of the existing code should be treated as accidental rather than authoritative?

Without such pressure, the default tendency of many models is to produce a solution that is quickly useful rather than deeply correct.

Practical mitigations are emerging. Retrieval-augmented generation (RAG) can supply models with design documentation, ADRs, and architectural context that isn't in the immediate code. Multi-agent architectures can separate planning from execution. Structured prompting frameworks like CLAUDE.md files and system instructions can encode architectural invariants directly. These approaches don't eliminate the fundamental tension, but they meaningfully reduce the gap between local and global reasoning.

The core tension

The central tension in LLM-assisted development is this:

Models are optimized to move the task forward. Engineering often requires knowing when not to move forward inside the current frame.

That gap explains a large share of why current coding agents can feel both brilliant and strangely shallow at the same time.

They can be astonishingly strong at local reasoning while remaining unreliable custodians of system meaning.

However, this gap is narrowing. On SWE-bench Verified, top models have progressed from under 30% in 2024 to over 80% by late 2025. At the same time, the more challenging SWE-bench Pro — which tests against unseen codebases — shows that even the best models still resolve only ~23% of issues. This disparity itself validates the article's core thesis: models perform well on familiar patterns but struggle significantly when confronted with novel architectural contexts.

Final takeaway

The key limitation of current LLMs in software development is not merely that they make mistakes. It is that they often make the wrong kind of mistakes:

  • not random,
  • not incompetent,
  • but structurally misaligned.

They tend to be strongest exactly where modern codebases already generate too much accidental complexity: local implementation detail, patch construction, reverse engineering, and artifact-level reasoning.

They tend to be weakest where mature engineering judgment matters most: abstraction choice, system intent, conceptual reuse, architectural restraint, and recognizing when the task itself has been framed incorrectly.

That is why the human role in an AI-heavy team does not disappear. It shifts upward.

The human is no longer only the person who can write the code. The human becomes the person who can preserve the shape of the system while the model tries to optimize within it.

This framing is likely to remain valid for the foreseeable future, even as models improve. The gap between local and global reasoning is narrowing, but the fundamental asymmetry — that local reasoning scales more easily than architectural judgment — appears to be a deep property of how current LLMs work, not merely an engineering limitation that will be solved by larger context windows or better training data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment