Drafted: 2026-04-10
The previous brainstorms (ai-value-p2p-protocol-brainstorm.md and future-of-work-ai-blockchain-p2p-brainstorm.md) explored who captures value when AI does the work, and how a P2P protocol could route that value back to contributing humans.
This note addresses a question that comes before value distribution: who gets to use AI at all?
If a handful of companies control inference capacity, they become gatekeepers. The question of how to share AI's economic upside becomes irrelevant for anyone who cannot access AI in the first place. Access sovereignty is the foundation on which everything else depends.
As of early 2026, the supply chain for frontier AI looks roughly like this:
| Layer | Dominant players | Concentration level |
|---|---|---|
| Chip design | NVIDIA (GPU), Broadcom/Google (custom TPU/ASIC) | Very high |
| Chip fabrication | TSMC, Samsung | Extreme (TSMC ~90% of advanced nodes) |
| High-bandwidth memory | SK Hynix, Samsung, Micron | High (3 players) |
| Cloud compute | AWS, Azure, GCP | High (3 players hold ~65% of global cloud) |
| Frontier model training | OpenAI, Google, Anthropic, Meta, xAI, DeepSeek | Moderate but narrowing at the top |
| Frontier model serving | OpenAI, Google, Anthropic, xAI | High |
| Open-weight model release | Meta (Llama), Mistral, Alibaba (Qwen), DeepSeek, Google (Gemma) | Moderate and growing |
Every layer has significant concentration, but the critical chokepoints are:
- Fabrication: TSMC in Taiwan fabricates nearly all cutting-edge AI chips. A single geopolitical event could disrupt the entire global supply.
- GPU design: NVIDIA holds a dominant position in training and inference accelerators. CUDA is the de facto standard. Alternatives exist (AMD ROCm, Intel Gaudi, custom ASICs) but the ecosystem gap is wide.
- Frontier model weights: only a few organizations can afford the $100M+ training runs that produce state-of-the-art models. Those weights are proprietary assets.
- Inference serving at scale: running frontier models requires thousands of GPUs in coordinated clusters. Only hyperscalers and well-funded startups can do this.
Concentration is not inherently bad. The problem appears when concentrated suppliers have reasons and mechanisms to restrict access selectively.
Those reasons already exist:
- Geopolitics: the US has imposed export controls on advanced AI chips to China. This is the first "AI embargo" but probably not the last. If AI becomes a strategic asset like nuclear technology or advanced weapons systems, more countries may face restrictions.
- Commercial priority: when demand exceeds supply, providers allocate capacity to their highest-paying customers first. Enterprise tiers with reserved capacity are already standard. During peak demand, individual users and small companies experience degraded service.
- Regulatory compliance: the EU AI Act and similar regulations create compliance costs that providers may choose to avoid by withdrawing from certain markets or restricting certain use cases.
- Policy and values alignment: providers already restrict model behavior through content policies, use-case restrictions, and terms of service. These decisions are made unilaterally.
- Strategic leverage: if a country or industry becomes dependent on a single provider's AI infrastructure, that provider (and its home government) gains leverage over the dependent party.
The concern is not that access will be cut off tomorrow for most users. The concern is a gradual tightening:
- Frontier models become increasingly expensive to train and serve.
- A small number of providers control the best models.
- Those providers allocate capacity first to enterprise customers, strategic partners, and politically aligned jurisdictions.
- Non-strategic users (individuals, small businesses, non-allied countries, controversial use cases) face a combination of: higher prices, lower rate limits, older model versions, restricted features, or outright unavailability.
- Over time, this creates a two-tier AI world: those with reliable access to frontier intelligence, and those without.
This is not science fiction. It is how many strategic technologies already work (advanced semiconductors, satellite imagery, financial infrastructure, encryption export controls).
AI concentration has a compounding quality that earlier technology monopolies did not:
- Intelligence is general-purpose. Unlike a specialized tool, an AI model can improve productivity across nearly every domain. Losing access to frontier AI means falling behind in research, business, education, healthcare, law, engineering, and governance simultaneously.
- Models improve faster than institutions adapt. By the time a country or organization builds domestic AI capacity, the frontier may have moved further ahead.
- Data advantages compound. Providers with more users generate more data, which improves models, which attracts more users. This flywheel is difficult to break from outside.
- Switching costs are real. Fine-tuned models, prompt libraries, workflow integrations, and institutional knowledge all create lock-in to specific providers.
Here is a more detailed map of the specific ways a user, company, or country can lose meaningful AI access.
A government restricts or bans AI service exports to specific countries. This has already happened with chips. It could extend to model APIs, cloud-hosted inference, or even open-weight model downloads.
Current examples: US chip export controls on China. OpenAI API availability varies by country. Some providers do not serve sanctioned jurisdictions.
Future risk: "AI embargoes" where model access is restricted as a foreign policy tool, similar to financial sanctions. Countries without bilateral AI agreements may be deprioritized.
Providers allocate scarce inference capacity to their highest-value customers. Enterprise contracts with guaranteed throughput and reserved capacity become the norm. Everyone else gets best-effort service.
Current examples: OpenAI, Anthropic, and Google all offer enterprise tiers with priority access. During high-demand periods, free and lower-tier users experience rate limiting and degraded performance.
Future risk: frontier model access becomes effectively enterprise-only. The "free tier" or "developer tier" serves older, smaller models. The gap between what paying enterprises can access and what individuals can access widens significantly.
AI regulation (EU AI Act, potential US federal regulation, sector-specific rules) creates compliance burdens. Providers may withdraw from markets where compliance costs exceed revenue, or restrict use cases to avoid liability.
Current examples: some AI features are unavailable in the EU due to GDPR or AI Act requirements. Providers sometimes delay launches in regulated jurisdictions.
Future risk: regulatory fragmentation creates a patchwork where different capabilities are available in different places. Small providers cannot afford multi-jurisdiction compliance.
Each provider makes unilateral decisions about what its models will and will not do. These decisions reflect the values, legal exposure, and strategic priorities of the company. Users have no voice in these choices.
Current examples: content restrictions, use-case prohibitions, safety filters that vary by provider. A use case allowed by one provider may be blocked by another.
Future risk: as liability pressure increases, providers may become more conservative. Legitimate but sensitive use cases (security research, medical decision support, legal analysis, journalism) may be restricted preemptively.
Users who build workflows, fine-tune models, or accumulate context with a single provider become dependent on that provider. Migrating is expensive and sometimes technically impossible.
Current examples: fine-tuned models on OpenAI cannot be exported. Conversation histories and context are stored on provider servers. Prompt engineering optimized for one model may not transfer well.
Future risk: deeper integration (custom models, agent frameworks, memory systems) creates stronger lock-in. Providers may deliberately increase switching costs.
Physical events (natural disasters, geopolitical conflicts, supply chain disruptions) can affect chip supply, data center availability, and network connectivity.
Current examples: GPU shortages affected AI deployment timelines in 2023-2024. Data center power constraints are already limiting expansion.
Future risk: a major disruption to TSMC, a conflict in a key submarine cable route, or energy supply constraints could create sudden AI capacity shortages affecting entire regions.
As AI becomes more capable, access to the best models may simply become too expensive for individuals and small organizations. If frontier inference costs $1-10 per complex query, and competitive advantage requires frequent use, a cost barrier emerges.
Current examples: GPT-4 class models cost significantly more per token than GPT-3.5 class. The best coding and reasoning models require the most expensive tiers.
Future risk: the cost of frontier intelligence may rise as models become larger and more capable. Universal access to the best AI could become economically impractical without subsidy or redistribution.
The response to concentration risk is not a single solution. It is a stack of increasingly ambitious capabilities that a user, community, or country can build toward. Each layer provides value on its own and reduces dependence on centralized providers.
Goal: ensure that no single provider or format can lock you out.
This is the lowest-cost, highest-impact first step. It requires no special hardware, no P2P network, and no new protocol. It is a discipline.
What it means in practice:
- Maintain local copies of open-weight models that cover your core use cases (general reasoning, code generation, summarization, translation, domain-specific tasks).
- Standardize on portable model formats: GGUF for CPU/hybrid inference, safetensors for GPU inference, ONNX for cross-platform deployment.
- Keep versioned model checksums and download sources. If a model is removed from Hugging Face or a government restricts downloads, you should already have it.
- Abstract all model calls behind a local API gateway. Your applications should never call a specific provider's API directly. They should call your gateway, which routes to whatever backend is available.
- Maintain prompt and workflow compatibility across at least two model families.
Key open-weight model families as of early 2026:
| Family | Origin | Strengths | License |
|---|---|---|---|
| Llama 4 | Meta | Strong general reasoning, large community, broad fine-tune ecosystem | Llama license (permissive with usage limits) |
| Mistral / Mixtral | Mistral AI (France/EU) | Efficient architectures, good multilingual, strong coding | Apache 2.0 for smaller models |
| Qwen 2.5 / Qwen 3 | Alibaba (China) | Competitive performance, strong multilingual, large model range | Apache 2.0 / Qwen license |
| DeepSeek V3 / R1 | DeepSeek (China) | Strong reasoning, cost-efficient training, open weights | MIT-like |
| Gemma 2 / 3 | Good quality at small sizes, well-documented | Gemma license | |
| Command R+ | Cohere (Canada) | Strong retrieval and enterprise tasks | CC-BY-NC or commercial |
| Phi-4 | Microsoft | Strong reasoning at small parameter counts | MIT |
Why this matters: even without any other layer, model portability means you are never fully dependent on a single provider. If OpenAI restricts your access, you can fall back to local Llama inference within hours, not weeks.
Goal: run useful models entirely on hardware you own or control.
This is the sovereignty floor. If you can run inference locally, no external party can prevent you from using AI. Quality may be lower than frontier cloud models, but capability is nonzero and improving rapidly.
Current state of local inference (early 2026):
- Consumer laptops with 16-32 GB RAM can run 7B-14B parameter models at usable speeds using llama.cpp with GGUF quantization.
- Workstations with 64-128 GB RAM can run 70B parameter models with acceptable throughput.
- A single consumer GPU (RTX 4090 / 5090 with 24 GB VRAM) can run 13B-30B models at good speed, or 70B models with partial offloading.
- Multi-GPU workstations (2-4 GPUs) can run 70B+ models at production quality.
- Apple Silicon (M2/M3/M4 Ultra with 192 GB unified memory) can run very large models with good throughput due to high memory bandwidth.
Key software stack for local inference:
llama.cpp/llama-server: C++ inference engine, runs on CPU, GPU (CUDA, Metal, Vulkan, ROCm), highly optimized. The most portable option.ollama: user-friendly wrapper around llama.cpp, manages model downloads and serving. Good entry point for non-technical users.vLLM: high-performance GPU inference server, best for multi-user or production-like setups. Supports PagedAttention for efficient memory use.text-generation-inference(TGI): Hugging Face's inference server, production-oriented.exllamav2: optimized for GPTQ/EXL2 quantized models on NVIDIA GPUs, very fast.MLX: Apple's framework for efficient inference on Apple Silicon.
Fine-tuning locally:
- LoRA and QLoRA allow fine-tuning of large models on consumer GPUs. A 70B model can be fine-tuned on a single 24 GB GPU using 4-bit QLoRA.
unsloth,axolotl, and Hugging Facetrlare the main tools.- This means a domain specialist can adapt an open model to their needs without cloud access.
What local inference cannot do today:
- Match frontier model quality on the hardest reasoning and coding tasks. The gap between a local 70B model and a frontier model (GPT-5 class, Claude Opus class) is real, though shrinking.
- Serve many concurrent users. Local hardware is best for personal or small-team use.
- Train models from scratch. Pre-training requires cluster-scale compute.
Key trade-off: local inference gives you full sovereignty at the cost of lower peak capability. For many real tasks (drafting, summarization, code assistance, translation, structured data extraction), local models are already good enough.
Goal: pool consumer hardware across many participants to run models too large for any single node.
This is the layer that transforms individual sovereignty into collective capability. A single user might only run a 14B model locally. A hundred users pooling their hardware could run a 405B model.
How distributed inference works:
A large model is split into layers or blocks. Each participating node loads a subset of the model's layers into its memory. When a user submits a prompt, the computation is passed sequentially (or in parallel where the architecture allows) through nodes holding different layers. The result is returned to the requester.
Existing projects:
- Petals: the most mature example. An open-source system where volunteers host segments of large models (originally BLOOM 176B, now supports Llama and others). Clients connect to a swarm of servers, each holding a few layers. Inference latency is higher than centralized, but the system works.
- Hivemind: the underlying library Petals is built on. Provides decentralized training and inference primitives over the internet.
- Distributed llama.cpp experiments: various community efforts to split llama.cpp inference across multiple machines on a LAN or WAN.
Technical challenges:
- Latency: sequential layer execution across network hops adds latency. For interactive chat, this may be acceptable. For latency-sensitive applications, it is not.
- Bandwidth: activations must be transmitted between nodes at each layer boundary. For large models, this can require significant bandwidth.
- Trust: if nodes are untrusted, a malicious node could corrupt activations or eavesdrop on inputs. Mitigations include redundant computation, TEE-based execution, and homomorphic encryption (still impractical for full inference).
- Reliability: consumer nodes go offline unpredictably. The swarm needs mechanisms for rerouting, replication, and graceful degradation.
- Incentives: without economic reward, volunteers may not contribute reliably. This is where the ValueTorrent protocol becomes relevant.
What this layer enables:
- Running frontier-scale models (100B+ parameters) without depending on any centralized provider.
- Geographic distribution of inference, which can provide resilience against regional restrictions.
- A credible "community cloud" alternative for AI inference.
What it does not solve:
- Training new models (requires synchronized, high-bandwidth compute).
- Matching the throughput and latency of a well-provisioned data center.
- Fully private inference (activations flow through third-party nodes).
Goal: collectively build domain-specific models that outperform general-purpose models in their niche, without centralizing data.
This layer recognizes that raw model size is not everything. A well-tuned 14B model with domain-specific training data can outperform a general 400B model on specific tasks. Communities of practitioners can create this specialization without relying on a centralized provider.
How federated fine-tuning works:
- A community agrees on a base model (for example, Llama 4 70B).
- Each participant fine-tunes the model on their local data using LoRA adapters.
- The LoRA adapter weights (small, typically 0.1-1% of the base model's parameters) are shared with a coordination server or P2P network.
- Adapters are aggregated (averaged, merged, or selected) to produce a community adapter that reflects the collective knowledge.
- The base model + community adapter is distributed back to participants.
Data never leaves any participant's machine. Only small adapter weights are exchanged.
Why this is powerful for sovereignty:
- Professional communities (lawyers, doctors, accountants, engineers) hold deep domain knowledge in their local data.
- Centralized providers cannot access this data, so they cannot build equivalent specialized models.
- A community of 100 tax accountants collectively fine-tuning a model on their anonymized case data could produce a tax-specific model superior to anything a general provider offers.
- This creates a local knowledge advantage that is resilient to access restrictions on frontier general-purpose models.
Technical approaches:
- Federated averaging (FedAvg): the classic approach. Each participant trains locally, shares gradients or weights, and a coordinator averages them. Well-studied but can struggle with heterogeneous data.
- LoRA merging: participants share LoRA adapters trained on different data slices. Adapters are merged using task arithmetic or TIES merging. Simpler than full federated learning and works well in practice.
- Mixture of LoRA experts: instead of merging, keep multiple adapters and route queries to the most relevant one. This preserves specialization while benefiting from diversity.
- Evaluation-driven selection: participants submit adapters, a shared evaluation suite scores them, and only the best are kept or merged. This is more like collaborative model development than traditional federated learning.
Shared evaluation infrastructure:
A critical piece that is often missing. For federated specialization to work, the community needs:
- Agreed-upon evaluation benchmarks for their domain.
- A test suite that is held out from all participants' training data.
- Transparent scoring and ranking.
- This evaluation layer could itself be a shared asset, connecting to the "machine teaching marketplace" concept from the earlier brainstorms.
Risks:
- Data poisoning by malicious participants.
- Free-riding (using the community model without contributing).
- Fragmentation if the community cannot agree on base models or evaluation criteria.
Goal: build physical computing infrastructure that is not dependent on any single hardware vendor, cloud provider, or jurisdiction.
This is the most ambitious and most expensive layer. It is relevant mainly for countries, large organizations, and well-funded cooperatives. But even at a smaller scale, parts of this layer are actionable.
Components:
NVIDIA's dominance in AI accelerators is a concentration risk. Alternatives are emerging:
- AMD Instinct (MI300X, MI350): competitive performance, improving ROCm software stack. The most credible alternative for training and inference today.
- Intel Gaudi 3: competitive price/performance for inference, weaker ecosystem.
- Google TPU: excellent for JAX/TensorFlow workloads, but only available through Google Cloud. Not sovereignty-enhancing for external users.
- Custom ASICs: Cerebras, Groq, SambaNova offer specialized inference hardware. These provide performance advantages but create new vendor dependencies.
- RISC-V AI accelerators: still early, but RISC-V's open ISA means no single company controls the instruction set. Projects like Esperanto Technologies and various Chinese efforts are developing RISC-V AI chips.
- Neuromorphic and analog compute: very early research (Intel Loihi, IBM NorthPole). Could eventually provide ultra-efficient inference for specific workloads.
The strategic move is to ensure that software stacks (inference engines, training frameworks) support multiple hardware backends, reducing dependence on any single chip vendor.
Instead of renting from hyperscalers, organizations can build or co-invest in compute infrastructure:
- National AI compute initiatives: several countries (France, UK, Japan, Saudi Arabia, UAE, India) are building sovereign AI compute clusters. These provide a foundation for domestic AI capability.
- University and research consortia: shared GPU clusters managed by academic institutions, available to researchers and startups.
- Hardware cooperatives: groups of companies or individuals pool resources to purchase and operate AI hardware. This is the "community broadband" model applied to compute.
- Colocation with ownership: instead of renting cloud instances, organizations buy hardware and colocate it in neutral data centers. This provides cost advantages for sustained workloads and eliminates the risk of cloud provider restrictions.
AI compute is energy-intensive. Dependence on imported energy creates another vector of vulnerability.
- Local renewable energy: solar, wind, or hydro power directly connected to compute facilities reduces dependence on energy markets and geopolitical energy dynamics.
- Energy storage: batteries or other storage systems allow compute to continue during grid disruptions.
- Heat reuse: AI compute generates significant waste heat. Co-locating compute facilities with buildings, greenhouses, or industrial processes that need heat improves economic viability and community support.
This is the "full stack" version of sovereignty: you own the silicon, the energy, and the software. No external party can cut you off.
Realistic assessment: full Layer 4 sovereignty is expensive and only practical for nations or large organizations. But elements of it (using AMD instead of only NVIDIA, colocating owned hardware, combining compute with local energy) are actionable at smaller scales.
The layers above describe capabilities. This section proposes a specific system that integrates them into something buildable: a local-first AI gateway that provides resilient, sovereign access to AI inference.
The Inference Mesh is a piece of software that runs on the user's hardware and mediates all AI model interactions. It sits between applications and model providers, routing each request to the best available backend based on capability, cost, latency, privacy requirements, and sovereignty constraints.
The key properties are:
- Local-first: the gateway always works, even if all external providers are unreachable. Local models handle requests at reduced quality rather than failing.
- Multi-backend: the gateway can route to local models, P2P swarm nodes, and multiple cloud APIs simultaneously.
- Privacy-aware: the user can classify queries by sensitivity level, and the router respects those classifications (sensitive queries never leave local hardware).
- Sovereignty-scored: each backend has a sovereignty score reflecting how much external dependency it introduces. The user can set policies like "never route to providers in jurisdiction X" or "prefer local inference unless quality gap exceeds threshold Y."
- Provider-agnostic: applications built on top of the gateway use a standard API (OpenAI-compatible or a superset). Switching or losing a backend requires zero application changes.
+---------------------------------------------------------+
| User's Applications |
| (IDE, chat UI, agents, scripts, automation) |
+---------------------------------------------------------+
| standard API (OpenAI-compatible)
v
+---------------------------------------------------------+
| Inference Mesh Gateway (runs locally) |
| |
| +---------------------------------------------------+ |
| | Request Classifier | |
| | - task type (code, chat, summarize, embed, ...) | |
| | - complexity estimate | |
| | - privacy/sensitivity level | |
| | - latency requirement | |
| +---------------------------------------------------+ |
| | |
| v |
| +---------------------------------------------------+ |
| | Smart Router | |
| | - sovereignty policy engine | |
| | - cost/quality/latency optimizer | |
| | - provider health monitor | |
| | - failover controller | |
| +---------------------------------------------------+ |
| | | | | |
| v v v v |
| +----------+ +----------+ +----------+ +----------+ |
| | Local | | P2P | | Cloud | | Cloud | |
| | Models | | Swarm | | API A | | API B | |
| | (ollama, | | (Petals, | | (OpenAI) | | (Anthr.) | |
| | llama. | | custom | | | | | |
| | cpp) | | mesh) | | | | | |
| +----------+ +----------+ +----------+ +----------+ |
| |
| +---------------------------------------------------+ |
| | Model Cache & Registry | |
| | - local model storage (GGUF, safetensors) | |
| | - auto-download and quantization pipeline | |
| | - model capability index | |
| | - version tracking and checksums | |
| +---------------------------------------------------+ |
| |
| +---------------------------------------------------+ |
| | Context Store (local) | |
| | - conversation history | |
| | - user preferences and system prompts | |
| | - RAG indexes and embeddings | |
| | - fine-tune adapters (LoRA) | |
| +---------------------------------------------------+ |
+---------------------------------------------------------+
Before routing, the gateway analyzes each incoming request:
- Task type detection: is this a code generation request, a conversational query, a summarization task, an embedding request, a structured extraction task? Different tasks have different quality thresholds and different models excel at them.
- Complexity estimation: a simple "translate this sentence" can be handled by a small local model. A "design a distributed system architecture" query may need a frontier model. The classifier estimates complexity to determine the minimum capable backend.
- Privacy classification: the user can tag queries with sensitivity levels (public, internal, confidential, restricted). The router enforces that queries above a threshold never leave local hardware. Automatic classification can also be applied based on patterns (presence of names, financial data, medical information, credentials).
- Latency requirement: interactive chat needs low latency. Batch processing (document summarization, code review of a large codebase) can tolerate higher latency and use cheaper or more sovereign backends.
The router is the core decision engine. For each classified request, it selects a backend using a scoring function:
score(backend) = w_quality * quality_estimate
+ w_cost * (1 - normalized_cost)
+ w_latency * (1 - normalized_latency)
+ w_sovereignty * sovereignty_score
+ w_privacy * privacy_compliance
The weights are user-configurable. A sovereignty-maximizing user sets w_sovereignty high and accepts lower quality. A cost-minimizing user sets w_cost high. A privacy-focused user sets w_privacy as a hard constraint (non-compliant backends score zero).
Sovereignty scoring for backends:
| Backend type | Sovereignty score | Rationale |
|---|---|---|
| Local model on owned hardware | 1.0 | No external dependency |
| Local model on rented/managed hardware | 0.8 | Hardware dependency, but model and data stay local |
| P2P swarm (trusted community) | 0.7 | Distributed but some trust required |
| P2P swarm (open/untrusted) | 0.5 | Availability risk, privacy concerns |
| Cloud API (domestic provider, no data retention) | 0.4 | Jurisdictional alignment, but still external |
| Cloud API (foreign provider, no data retention) | 0.2 | Foreign jurisdiction, policy risk |
| Cloud API (foreign provider, data retained) | 0.1 | Maximum external dependency |
Failover logic:
The router continuously monitors backend health (latency, error rate, availability). If a provider becomes unavailable or degrades:
- Requests are immediately rerouted to the next-best backend.
- If all cloud APIs fail, the P2P swarm is attempted.
- If the P2P swarm fails, local models handle all requests.
- The user is notified of the degradation, but service never stops entirely.
This means the system degrades gracefully rather than failing hard. A user who loses access to OpenAI and Anthropic simultaneously still has a working AI assistant, just one running on local hardware.
The gateway maintains a local repository of model files:
- Automatic downloading: when a new open-weight model is released and matches the user's needs, the gateway can download and quantize it automatically.
- Quantization pipeline: models are stored in multiple quantization levels (Q4_K_M, Q5_K_M, Q8_0) to match different hardware profiles.
- Capability indexing: each cached model has metadata describing what it can do (supported languages, context window, strengths by task type, benchmark scores).
- Integrity verification: all model files are checksummed. If a file is corrupted or tampered with, the gateway detects it and re-downloads from an alternative source.
- P2P model distribution: model files can be shared through a BitTorrent-like protocol, reducing dependence on centralized hosting (Hugging Face, etc.).
All user context is stored locally:
- Conversation history: every interaction is logged locally. No provider retains your conversation data.
- System prompts and preferences: your customizations are local files, portable across backends.
- RAG indexes: local vector databases (using tools like
qdrant,chromadb, orsqlite-vss) store embeddings of your documents for retrieval-augmented generation. - LoRA adapters: if you have fine-tuned adapters for specific tasks, they are stored locally and applied to local models automatically.
This means your AI "memory" and customization never depend on a provider. If you switch from OpenAI to a local Llama model, your context and preferences follow.
The gateway's P2P backend needs a lightweight protocol for discovering and using distributed inference capacity:
- Peer discovery: nodes announce their available models, hardware specs, and capacity to a DHT or gossip network.
- Capability matching: the gateway's router queries the P2P network for nodes that can serve the required model or a compatible alternative.
- Session establishment: the gateway establishes a connection to one or more peers, negotiates encryption, and begins streaming inference.
- Payment (optional): if an incentive layer exists (see ValueTorrent connection below), micro-payments are settled per request or per token.
- Verification: the gateway can optionally send the same request to multiple peers and compare outputs to detect malicious or faulty nodes.
A user installs the Inference Mesh gateway on their machine. The setup experience might be:
- Install the gateway (a single binary or container).
- The gateway auto-detects local hardware (GPU model, VRAM, RAM, CPU cores).
- It recommends and downloads appropriate open-weight models for the hardware profile.
- The user optionally configures cloud API keys for providers they subscribe to.
- The user optionally joins a P2P swarm (community, organizational, or public).
- Applications connect to
localhost:8080(or similar) using the OpenAI-compatible API.
From that point on, AI just works. If the user's cloud provider goes down, has an outage, or blocks their country, the experience degrades in quality but never disappears. The user has a sovereign AI floor that nobody can take away.
These components exist as mature or near-mature open-source projects. An MVP could be assembled largely by integrating and orchestrating existing tools.
1. Local inference backend
Already production-quality. ollama provides a one-command setup for local model serving. llama.cpp's llama-server provides an OpenAI-compatible HTTP API. vLLM provides high-throughput serving for GPU-rich setups.
All of these expose OpenAI-compatible endpoints, which means any application that works with the OpenAI API can be pointed at local inference with minimal changes.
2. Multi-provider API gateway
Several open-source projects already provide this:
- LiteLLM: Python library and proxy server that provides a unified interface to 100+ LLM providers. Supports load balancing, fallback, and cost tracking.
- OpenRouter: a commercial API that routes across multiple providers (not self-hosted, so not sovereign, but demonstrates the pattern).
- Portkey: gateway with fallback, caching, and observability across providers.
An MVP could start with LiteLLM as the routing layer, extended with sovereignty scoring and privacy classification.
3. Model cache and quantization pipeline
huggingface_hubCLI for downloading models.llama.cpp'squantizetool for GGUF conversion.AutoGPTQ,AutoAWQfor GPU-optimized quantization.- A simple script that monitors model releases, downloads recommended models, and quantizes them for local hardware.
4. Privacy-aware request classification
A lightweight local classifier (could itself be a small local model like Phi-4-mini) that analyzes incoming prompts and tags them with sensitivity levels. Rules-based classification (regex for PII patterns, keywords for financial data) can supplement model-based classification.
5. Provider health monitoring
Simple HTTP health checks, latency measurement, and error rate tracking against each configured backend. Trigger failover when a backend exceeds latency thresholds or returns errors above a rate.
What the MVP looks like: a locally-running service that exposes an OpenAI-compatible API, manages a cache of local models, routes to local or cloud backends based on simple policies, and fails over automatically. A developer installs it, configures their API keys, and points their tools at localhost. Everything else is transparent.
Estimated effort: a small team (2-3 engineers) could build a usable MVP in 2-3 months by integrating existing components. The novel work is mainly the sovereignty-scoring router and the unified configuration experience.
These require more original engineering but build on existing foundations.
1. P2P inference swarm integration
Build on the Petals architecture or implement a simpler variant:
- Nodes advertise available models and hardware via a DHT.
- The gateway discovers peers and establishes inference sessions.
- Start with a trusted-community model (explicit peer lists, invite-only) before moving to open discovery.
- Payment integration via simple stablecoin transfers or Lightning Network for per-request settlement.
The Petals codebase (built on the Hivemind library) provides a starting point, but may need significant modification for production use, reliability, and incentive compatibility.
2. Federated fine-tuning pipeline
- A coordination service that manages a federated LoRA training round.
- Participants fine-tune locally using
unslothortrl, then upload adapter weights to a coordinator. - The coordinator merges adapters (using task arithmetic or TIES merging) and distributes the result.
- An evaluation suite validates that the merged adapter improves on the base model.
This could be built as a CLI tool and a lightweight server. The hardest part is designing the evaluation framework and handling heterogeneous data quality.
3. Portable model evaluation framework
- A shared benchmark and scoring system that communities can use to evaluate models and adapters for their specific domains.
- Connects to the "machine teaching marketplace" idea: evaluation suites are valuable shared assets.
- Could start as a simple framework (JSON-defined test cases, automated scoring, leaderboard) and grow more sophisticated.
4. Sovereignty scoring and policy engine
- A more sophisticated version of the simple scoring function described above.
- Tracks jurisdictional dependencies (where is the provider incorporated? where are the data centers? which laws apply?).
- Allows organizations to define compliance policies ("no data to non-EU providers," "no inference on models with restrictive licenses").
- Could integrate with threat intelligence feeds to detect emerging access restrictions.
These are technically possible but require significant research, community building, or infrastructure investment.
1. Full distributed inference mesh with incentive layer
The complete vision: a decentralized network where anyone can contribute compute, earn rewards, and access inference. This is the intersection of the Inference Mesh (this document) and the ValueTorrent protocol (previous brainstorm).
Requires solving:
- Reliable incentive design (game theory, tokenomics or stablecoin settlement).
- Compute verification (ensuring nodes actually perform inference correctly).
- Privacy in distributed inference (activations flowing through untrusted nodes).
- Network economics (whether the cost of coordination overhead is less than the cost of centralized inference).
2. Encrypted inference on untrusted nodes
- Fully Homomorphic Encryption (FHE): allows computation on encrypted data. Theoretically solves the privacy problem in distributed inference. Practically, FHE adds enormous computational overhead (1000x+ slowdown). Active research is reducing this gap, but production use for full LLM inference is likely years away.
- Multi-Party Computation (MPC): distributes computation across multiple parties such that no single party sees the full input. More practical than FHE for some operations, but still adds significant overhead.
- Trusted Execution Environments (TEEs): hardware-based isolation (Intel SGX, AMD SEV, ARM TrustZone). Practical today for some workloads. The trust model shifts from trusting the node operator to trusting the hardware manufacturer. Not a perfect solution, but a pragmatic one.
The most realistic near-term approach is probably TEE-based inference on untrusted nodes, with FHE and MPC as longer-term upgrades.
3. Decentralized model training
Training new models from scratch in a decentralized way remains very hard due to:
- The need for high-bandwidth, low-latency interconnects between GPUs during training.
- The massive cost of pre-training runs.
- The difficulty of coordinating hundreds of unreliable consumer nodes for weeks-long training.
Approaches like DiLoCo (distributed local optimization) and other communication-efficient training methods are being researched. They may eventually enable community-driven pre-training, but this is the farthest horizon.
4. Hardware cooperative coordination
- A platform for organizing collective hardware purchases.
- Shared colocation management.
- Tooling for multi-tenant GPU cluster management (SLURM, Kubernetes with GPU scheduling).
- Economic coordination (cost sharing, usage metering, revenue splitting).
This is more of an organizational and legal challenge than a technical one. The technology for shared compute management exists. The coordination mechanisms need to be designed.
5. Model preservation and archival network
- A decentralized, censorship-resistant archive of open-weight models.
- Ensures that if a model is removed from Hugging Face or a hosting provider, copies persist in the network.
- Could use IPFS, BitTorrent, or a custom content-addressed storage protocol.
- Important because the "open" in open-weight models is only meaningful if the weights remain accessible.
The Inference Mesh described in this document and the ValueTorrent protocol described in the previous brainstorm are complementary systems. They address different but interdependent problems.
The Inference Mesh answers: how do I get reliable AI inference regardless of what providers do?
It is concerned with:
- Routing requests to available backends.
- Maintaining local model caches.
- Discovering P2P inference capacity.
- Ensuring privacy and sovereignty.
- Failing over gracefully.
ValueTorrent answers: how do contributors get fairly paid for the assets they provide to AI production?
It is concerned with:
- Tracking who contributed what (compute, knowledge, review, arbitration).
- Issuing contribution receipts.
- Settling micro-payments and royalties.
- Managing reputation and trust.
- Routing value through a royalty DAG.
User / Application
|
v
+------------------+
| Inference Mesh | <-- access sovereignty
| Gateway | (this document)
+------------------+
|
+------------+------------+
| | |
v v v
Local P2P Swarm Cloud APIs
Models (fallback)
|
v
+------------------+
| ValueTorrent | <-- economic layer
| Protocol | (previous brainstorm)
+------------------+
|
+------------+------------+
| | |
v v v
Compute Knowledge Review
Payments Royalties Payments
The relationship is:
- The Inference Mesh provides the demand side: it routes inference requests to available capacity, including P2P swarm nodes.
- ValueTorrent provides the supply side incentives: it ensures that people who contribute compute, knowledge, or review capacity to the P2P swarm are fairly compensated.
- Without the Inference Mesh, ValueTorrent has no reliable way to reach users. Users stuck on centralized APIs have no reason to use P2P inference.
- Without ValueTorrent, the Inference Mesh's P2P layer lacks sustainable incentives. Volunteers may contribute initially, but long-term reliability requires economic reward.
- Local inference is the sovereignty floor: even without a P2P network or economic layer, a user with local models has a minimum viable AI capability. This is the foundation that makes the whole stack credible.
The practical build order is:
- First: build the Inference Mesh gateway (local inference + multi-provider routing + failover). This is useful immediately, requires no P2P network, and creates a user base.
- Second: add P2P inference swarm integration. Start with trusted communities, LAN-based clusters, or organizational networks.
- Third: integrate the ValueTorrent economic layer to incentivize P2P contribution at scale.
- Fourth: extend ValueTorrent beyond compute to include knowledge royalties, review payments, and cooperative ownership, as described in the earlier brainstorm.
Each step is independently valuable. The full vision is the composition of all four.
These principles should guide any technical work that follows from this brainstorm.
The system must work offline or under restriction. A user with no internet connection and no API keys should still have a functioning AI assistant. Quality will be lower, but capability is nonzero. This is the non-negotiable floor.
Losing access to frontier APIs reduces quality but does not break functionality. Losing access to the P2P swarm reduces capacity but local models still work. Losing internet entirely still leaves local inference. Each layer of external dependency is optional.
No one provider, hardware vendor, jurisdiction, or hosting service can block all access. This applies to model hosting (not only Hugging Face), chip supply (not only NVIDIA), cloud compute (not only AWS), and even the gateway software itself (open source, multiple implementations).
Start with a useful local tool that happens to support multiple backends. Then add P2P capabilities. Then add economic incentives. Then add federated training and sovereign infrastructure. Each stage is useful on its own. The system does not need to be fully decentralized from day one to be valuable.
Portable model formats (GGUF, safetensors, ONNX). Open APIs (OpenAI-compatible). Open protocols for P2P discovery and inference. Open-source gateway software. No proprietary lock-in at any layer. A user should be able to replace any component with an alternative implementation.
Sensitive data should never leave local hardware unless the user explicitly chooses to send it externally. Privacy classification should be automatic by default and user-overridable. The system should assume data is private until proven otherwise.
Support multiple hardware backends. Do not build for NVIDIA only. Design abstractions that work across CUDA, ROCm (AMD), Metal (Apple), Vulkan (cross-platform), and CPU. This is both a sovereignty principle (reducing vendor dependency) and a practical one (users have diverse hardware).
The Inference Mesh should be designed from the start to integrate with an economic layer (ValueTorrent or similar). Even if the economic layer is not built yet, the gateway should expose the hooks needed: contribution tracking, provider scoring, payment channel integration, reputation signals.
- What is the minimum viable gateway that is genuinely useful? Is it just LiteLLM with a local model backend and some sovereignty-aware routing?
- How should the request classifier work? A small local model, a rules engine, or a hybrid?
- What is the right granularity for sovereignty scoring? Per-provider? Per-request? Per-data-type?
- Should the gateway be a standalone binary, a Docker container, a library, or a browser extension?
- How do we handle streaming responses across heterogeneous backends with different latency profiles?
- Is Petals/Hivemind the right foundation, or should a simpler protocol be designed from scratch?
- What is the minimum swarm size needed for reliable inference of a 70B+ model?
- How should latency be managed when inference passes through multiple network hops?
- What trust model is practical? Fully trustless (expensive verification), trusted-community (social trust), or TEE-based (hardware trust)?
- Can speculative decoding or other inference optimizations reduce the latency penalty of distributed execution?
- Which professional communities are most likely to adopt federated fine-tuning early?
- How should evaluation suites be designed to resist gaming and ensure genuine quality improvement?
- What is the right balance between LoRA merging, mixture of experts, and evaluation-based selection?
- How do you prevent free-riding in a federated training community?
- Should the MVP Inference Mesh include any payment mechanism, or should that be deferred to the ValueTorrent integration?
- How do existing micropayment rails (Lightning Network, L2 stablecoins, Stripe) compare in practicality for per-request settlement?
- What minimum economic incentive is sufficient to get reliable P2P inference contributors?
- Who is the first user? A developer who wants local AI resilience? A company in a restricted jurisdiction? A privacy-conscious organization?
- What is the "aha moment" that makes someone install the gateway? Probably not sovereignty as an abstract principle. More likely: cost savings, privacy, offline capability, or provider-independent workflows.
- Is the right distribution channel a CLI tool, a desktop app, an IDE plugin, or a library?
- How fast is the quality gap between local/open models and frontier models closing? If it closes to "good enough" for 80% of tasks within 2 years, the sovereignty argument becomes much stronger.
- Will governments actively support or hinder sovereign AI infrastructure? National compute clusters are a positive signal. Potential restrictions on open model distribution are a negative one.
- Is the biggest risk actually not restriction, but rather economic dependency? Even if no one blocks access, if the entire economy depends on 3 API providers, the structural vulnerability is the same.
- Could a credible Inference Mesh with P2P capability become a bargaining chip that keeps centralized providers honest? The threat of exit may matter more than actual exit.
AI access is concentrating in a few providers who control model weights, inference infrastructure, and the policies governing use. This concentration creates risks: geopolitical blocking, commercial deprioritization, regulatory restriction, vendor lock-in, and economic exclusion.
The technical response is a sovereignty stack:
- Layer 0: keep local copies of open-weight models in portable formats.
- Layer 1: run inference on hardware you own.
- Layer 2: pool hardware with trusted peers for larger models.
- Layer 3: collectively fine-tune specialized models without centralizing data.
- Layer 4: build or co-own physical compute infrastructure.
The system that ties it together is an Inference Mesh: a local-first gateway that routes AI requests across local models, P2P swarms, and cloud APIs, failing over gracefully and scoring each backend for sovereignty, cost, quality, and privacy.
This Inference Mesh is the transport layer. The ValueTorrent protocol from the earlier brainstorm is the economic layer. Together, they form a complete stack for sovereign, incentive-compatible, decentralized AI access.
The starting point is not ambitious infrastructure. It is a simple, useful tool: a local gateway that makes your AI workflow resilient to any single provider going away. Everything else builds from there.