The Era of Distributed Compute (Part II)
In Part I, we covered how startups are supplying the purpose-built chips, edge hardware, and software stacks required to serve AI at scale. We discussed how the next phase of the compute value chain will be defined less by who can buy the most GPUs, and more by who can deliver the lowest-cost, highest-throughput inference within increasingly constrained power and deployment environments.
In Part II, we are tackling both end-points of the compute barbell: how inference efficiency will power an explosion of edge applications and how the stranglehold that NVIDIA has on the programming environment for chips may be loosening.
A quick refresher on the value chain:
Efficient Inference Enables an Explosion of Edge Applications
As marginal return to the scaling law gets smaller, and inference efficiency increases, the cost of computing at the edge decreases. Venture investors are already placing bets on edge deployment. We mapped 100 of the most important early stage companies within the above value chain and noted that over $20BN dollars have gone towards edge applications, robotics, and autonomous systems over the past three years. This makes it the second largest capital raising category behind the model builders themselves.
More efficient hardware also opens up the possibility of more powerful edge computing in physical AI, making it possible to run sophisticated models on battery-powered or resource-constrained hardware. A huge variety of industries can (and already) benefit from this. Three main benefits are worth highlighting.
Ultra-low latency
Processing on the device allows for nearly instantaneous analysis and action, reducing latency caused by a connection to an external datacenter. For example, autonomous vehicles must make life-or-death decisions (like braking for a pedestrian) in milliseconds. They cannot afford the time it takes to send camera data to a cloud server and wait for a response.
Improved Reliability
Edge computing ensures critical systems function even when internet connectivity is spotty, jammed, or nonexistent. For example, military drones operating in remote or contested environments can use onboard AI to identify targets and navigate terrain without relying on a vulnerable link to a command center.
Security
Keeping sensitive data on the local device minimizes the risk of leaks during transmission and storage. For example, wearable medical monitors can analyze patient heart rhythms or blood glucose levels locally to detect abnormalities, ensuring that highly regulated, private health data remains on the patient’s device rather than constantly streaming to a cloud server.
Paired with more powerful hardware, intelligent software on the edge can also be deployed to sharpen sensors and boost accuracy. At the highest level, a sensor’s hardware is simply a device that collects raw data (radar returns, images, waveforms, etc.), and the real “detection” happens when software runs an algorithm over that data to predict whether something of interest is present.
As edge hardware becomes more capable, it can run smarter AI models locally, but accuracy gains typically come from the combination: Better data acquisition hardware to capture higher-fidelity signals, better on-device compute to handle high-bandwidth streams, and smarter algorithms that denoise, fuse, and classify those signals in real time. .
Scaling laws still apply at these levels. A larger network of sensors, or more powerful sensors that each provide more data, can be paired with more powerful inference hardware to run more accurate models and deliver better results. In this environment, companies that leverage both the sensing hardware and the software layer are best positioned to capture value.
Loosening the CUDA Stranglehold, Unlocking More Developer Innovation
Arguably more important than the models themselves are the software that controls their development and deployment. For over a decade, NVIDIA’s CUDA (Compute Unified Device Architecture) has been the unshakeable bedrock of AI. It sits between the researcher’s abstract Python instructions and the physical silicon, acting as the ‘skeleton’ of the entire software stack. Because every major library (PyTorch, TensorFlow) was optimized for CUDA first, NVIDIA built a self-reinforcing loop: developers learned CUDA, built tools for CUDA, and bought chips that ran CUDA.
It is incredibly hard for competitors to escape because most AI researchers and engineers have built their entire workflows and legacy code on CUDA-specific foundations. Hardware startups cannot simply build faster or cheaper chips; they must also make them usable. This means replicating thousands of optimized functions and guaranteeing compatibility with frameworks that are already fine-tuned for NVIDIA. The moat is not just the code itself, but the millions of developer hours invested in it, making switching costs prohibitively high.
However, cracks are starting to appear. While CUDA remains dominant for training the largest frontier models, the inference market is diverging. The rise of abstraction layers like OpenAI’s Triton and increasingly mature compilers (like XLA) allow code to be written once and deployed across different architectures. These tools act as universal translators, compiling high-level programming languages down to efficient machine code for various GPUs, not just NVIDIA’s. This decoupling is critical: if a developer can write high-level Python code that compiles just as efficiently to an AMD MI400 or a Google TPU as it does to an NVIDIA H200, the CUDA moat begins to erode.
Hyperscalers are accelerating this erosion by developing their own vertically integrated stacks. By designing proprietary silicon (like Microsoft’s Maia and Google’s Axion) and coupling it with platform services (AWS Bedrock, Azure AI Foundry), they are actively decoupling from NVIDIA’s margins and ecosystem lock-in. By owning the entire vertical stack, from the chips up to the models and platforms, they can optimize performance and cost, securing their role as the primary engines for AI development rather than just utility providers.
This shift is creating a more mature AI tool chain, the set of software used to build, test, and deploy models. Similarly to hardware trends, the focus has shifted from creation (training) tools to operational tools (running models safely and efficiently). We are seeing a consolidation of inference engines, software layers that are indifferent to the underlying model. These tools automatically handle complex tasks like continuous batching (technique to maximize GPU utilization) and quantization (shrinking models to fit on smaller chips. This standardization is vital because it allows companies to swap out the underlying model (e.g., switching from GPT-5 to Llama 4 to DeepSeek) without rewriting their entire application code.
As the hardware layer diversifies, the software layer must become more flexible. The era of one-size-fits-all coding is ending as specialized inference chips (like those from Groq or Cerebras) enter the market. New memory-focused architectures are emerging to support massive context windows, requiring libraries that prioritize moving data efficiently rather than just crunching numbers. This fragmentation means open-source libraries are becoming increasingly modular. Developers can no longer assume they are running on a standard GPU; instead, they rely on smart intermediate compilers to translate their code for whatever hardware is available.
Incumbents are not standing still
The AI market is undergoing a radical shift, moving from a high-cost experimentation phase into a diversified, efficiency-driven market. Model builders are currently in a “race to the bottom” regarding inference costs. As older models become “good enough” for most commercial tasks, fewer customers are willing to pay premium prices for marginal performance gains.
These cost differences are massive. A prime example is the pricing disparity within OpenAI’s own lineup. At the time of writing, GPT-4o costs $2.50 per million input tokens, while the flagship reasoning model, o1, costs $15.00—a 6x increase. For routine tasks like summarization or basic coding, the incremental quality of o1 is rarely worth the premium. While exceptions exist for complex reasoning or hard sciences, the vast majority of enterprise volume settles on the most cost-effective option.
To capture this volume, builders are releasing dedicated lightweight models. Grok 4 Fast, for instance, targets the sub-$0.50 price point, offering GPT-4 Turbo-class capabilities at a cost structure that makes always-on applications financially viable. This commoditization forces even premium providers to compress their margins, as the premium for being the “smartest” becomes harder to justify against “good enough” alternatives.
This pressure is exponentially compounded by the explosion of Chinese open-weight models. By late 2025, models like DeepSeek V3 and Alibaba’s Qwen series captured nearly 30% of global open-source AI usage, up from just 1% the previous year. The value proposition is undeniable: near-performance parity with top-tier US models (like Claude Sonnet or Gemini) at 1/10th the inference cost, or free if self-hosted. Developers are aggressively capitalizing on this arbitrage, with tools like Cursor exploring variations run exclusively on these cheaper architectures. Ultimately, this abundance forces all model builders to compete on utility and ecosystem lock-in, rather than just raw intelligence.
Final Thoughts
A few main takeaways can be extracted from all of this. The AI landscape is shifting from a phase of unchecked training expenditure to one defined by inference efficiency and economic viability. As scaling laws face diminishing returns, the value proposition is moving away from generic model intelligence, which is rapidly commoditizing, toward the specialized silicon and software stacks required to deploy inference of these models at scale. The loosening of NVIDIA’s stranglehold and the rise of hyperscaler vertical integration confirm that the next winners will be defined by their ability to optimize performance per dollar rather than just raw compute power.
Consequently, the investment lens must widen beyond the datacenter to the edge and physical AI applications that this cheaper inference unlocks. Capital is flowing toward companies addressing critical power constraints and those building the agnostic infrastructure that allows developers to bypass legacy lock-ins to take advantage of novel hardware architectures. Ultimately, looking forward, the next phase of value creation may be defined less by the models themselves, and more by the hardware and systems that bridge the gap between theoretical intelligence and reliable, real-world utility.




