The Era of Distributed Compute (Part I)
We have previously discussed the AI investment supercycle at length (see our 2025 Annual Letter). Companies developing, serving, and leveraging artificial intelligence are scaling at unprecedented rates, commanding historic valuations, and capturing capital with an intensity that has no parallel. Outside of the major AI Labs, most of these dollars are flowing into cloud infrastructure giants like Microsoft, Amazon, and Google, who either buy chips from NVIDIA, develop and deploy in-house silicon (Google TPU, AWS Trainium). The behemoths appear insurmountable; with every novel solution we see along the “compute stack” the inevitable question becomes: why won’t “so and so giant” do this?
Nvidia’s acquihire and licensing deal with Groq highlights one potential response: even the most dominant general-purpose platforms need specialized things to cover the full AI lifecycle. We are moving to an era where “compute” is distributed and optimized more evenly throughout the value chain. Data centers and networking software become more built-to-purpose, more inference happens at the edge, chips become more specialized for unique edge cases.
To get a sense of where we are headed, we first need to understand the different categories of the “compute value chain” today and highlight the themes that will alter their shape for years to come.
Vast majority of dollars going to hardware
Most of the funding is still going towards model builders, who then spend that money on compute hardware. Although this space is still dominated by established players, energy and manufacturing bottlenecks have rewarded startups who can provide faster and/or more efficient chips. Additionally, a whole new set of ‘edge’ companies are beginning to gain traction as well. That is, hardware companies who create ‘Physical AI’, or physical products that leverage AI, whether it be drones, robots, autonomous vehicles, or new sensors. Waabi’s autonomous truck is a good example of this, as it deploys/hosts AI locally in the truck’s chassis rather than relying on a wireless connection to a datacenter.
Most of the capex being invested (both in distributed compute altogether, as well as AI hardware specifically) continues to go to established players such as NVIDIA, AMD, and Broadcom, although we are starting to see startups carve out market share in the AI computing space, either by directly competing/partnering with giants or carving out niches. Groq, Cerebras, and Tenstorrent are notable examples.
How inference is changing the landscape
Historically, the majority of compute demand has come from massive training runs on increasingly large parameter size models, which created better models. Demand is now shifting away from training towards inference due to a few key factors:
Scaling law saturation
The massive compute scaling buildout driving AI progress is likely to slow due to diminishing returns on hardware gains. Models like GPT-3 and GPT-4 were able to achieve better performance largely by increasing the amount of data, parameters, and compute used in their training runs. If these incremental performance improvements no longer match the exponential cost of building and operating ever-larger datacenters, the economics of huge training runs is no longer sustainable. In fact, model builders have effectively run out of new data, as high-quality, publicly available human-generated text data has largely been consumed for training existing models. Instead, larger breakthroughs in the fundamental architectures behind these models are needed.
Inference vs training tradeoff
Research demonstrates that inference-time compute can often substitute for training compute more cost-effectively, a dynamic that only accelerates as inference chips improve. This shift is driving the popularity of ‘thinking’ models, which perform internal reasoning and planning sequences before generating a final answer.
The relationship between training and inference is shifting. While training represents a massive upfront fixed cost, inference is a variable cost that scales with usage. Historically, training dominated the balance sheet, but successful models reach a crossover point where cumulative inference costs exceed the initial training spend. With ChatGPT recently processing nearly 2.5 billion queries daily and Anthropic’s Claude serving tens of billions of API calls per month, we have entered the era where inference costs are outpacing training costs. Consequently, the market opportunity is flipping from building models to serving them efficiently.
Lead time constraints
from renting small GPU clusters to building huge data centers requires years, and lead times are only getting longer as demand continues to outpace supplies. Hooking up datacenters to the power grid can take years on its own. This lengthening timeline adds a mechanical bottleneck to training compute scaling.
Lagging supply for NVIDIA chips have encouraged new players to enter the space and forced adaption among operators. Hyperscalers such as Google, Amazon, and Microsoft have successfully developed and launched their own specialized silicon to deploy within their own datacenters and reduce reliance on external vendors.
Chip Manufacturing Startups are Finding a Wedge
Startups are building purpose-built ASICs (application specific integrated circuits, essentially chips designed to do one thing very well), to accelerate AI training and inference workloads, as opposed to general purpose GPUs which are less efficient. Some companies have been especially successful at this specialization and are already being deployed. NVIDIA’s $20B ‘acquilicensing’ of Groq’s chips, a startup focusing on fast and cheap inference, is a notable example.
However, successfully challenging incumbents requires more than just speed. When evaluating AI inference hardware, the most critical metrics extend beyond theoretical limits to focus on actual bottlenecks and costs. Memory bandwidth is often the single most important technical factor. Inference workloads (especially LLMs) are memory-bound, so high bandwidth is required to feed data to compute units fast enough to prevent them from sitting idle. This technical capability directly influences the hardware’s throughput (tokens per second) and latency, which measures how quickly the system processes volume and responds to real-time inputs.
From a business perspective, the deciding metric is performance per dollar. While raw speed is important, companies deploying models via cloud GPUs prioritize the cost per 1 million tokens to ensure economic viability. Ultimately, a chip’s success depends on its ability to balance high memory bandwidth for sustained throughput with a cost structure that makes large-scale deployment affordable.
Some companies in this space that I’ve selected have chosen distinct architectures to optimize for the metrics above, often sacrificing flexibility to achieve breakthroughs in efficiency or cost.
Companies like Mythic differentiates by attacking the memory bandwidth bottleneck. By using analog compute-in-memory, they store model weights directly within the compute units, eliminating the need to constantly move data back and forth from external memory. This architecture removes the primary bottleneck for inference, allowing them to deliver high performance at just 3–4 Watts, ideal for battery-constrained edge devices where traditional GPUs are too power-hungry.
Better performance/efficiency on paper is the first step. Integrating these chips into actual servers and systems to exploit theoretical gains follows. We are currently in between the two stages, with many chips and architectures designed and proven at small scales but wider commercial adoption is only beginning to take off. This second step can largely be measured by the specific system’s overall maturity, and how well it can be integrated into existing software stacks and accepted by the developer community.
System Maturity
Commercial viability is measured by the progression from initial prototype silicon to large-scale deployment, and the current landscape spans the full spectrum. For the datacenter contenders, the next critical milestone is moving beyond pilot projects (1–8 systems) to shipping multi-rack configurations, verifying that power, cooling, and networking remain stable at production scale.
Software Stack
The software stack acts as the critical infrastructure layer that translates user code into executable hardware instructions in the circuit. A mature stack supports high-level frameworks like PyTorch and TensorFlow with “plug-and-play” capability. This is a key advantage for companies that abstracts their FPGA hardware behind standard APIs and libraries. In contrast, less mature stacks require manual tuning or custom operator development, significantly slowing (and discouraging) adoption regardless of the hardware’s theoretical speed.
Tackling power constraints
One common theme that merits attention is an emphasis on power efficiency. Datacenters are running into a “power wall” as they struggle to get electricity from already strained grids. Companies like Meta and Microsoft are investing in dedicated power generation facilities for their data centers. Other novel solutions are being proposed to meet surging energy demand, creating a whole new market. Modern nuclear and geothermal energy are both promising technologies that can help meet the increase in 24/7 energy demand, with the latter being especially attractive to datacenters.
One can envision a (near) future where the limiting variable shifts from total hardware availability to power availability. Raw compute volume becomes irrelevant if the power envelope is exceeded. The relevant question would then be how much compute can fit within a certain power envelope. With more efficient hardware, datacenters of the same size can be far more powerful without using as much electricity. This allows new chips to further lower the price of per-token inference, a trend necessary for both model builders’ profitability timelines, continued practical scaling, and competition with Chinese and open-source models.
Part I Summary:
In sum, the AI investment supercycle is evolving from a race to build ever-larger training clusters into a broader, more nuanced contest over where compute lives and how efficiently it can be delivered. As inference becomes the dominant cost center and scaling laws begin to saturate, the industry’s center of gravity shifts from brute-force general-purpose infrastructure toward specialized, power-aware systems optimized for real-world deployment.
This transition creates meaningful wedges for startups—not necessarily by displacing hyperscalers, but by supplying the purpose-built chips, edge hardware, and software stacks required to serve AI at scale. Ultimately, the next phase of the compute value chain will be defined less by who can buy the most GPUs, and more by who can deliver the lowest-cost, highest-throughput inference within increasingly constrained power and deployment environments.
—
Tune in for Part II as we tackle what inference means for an explosion of edge applications and how, on the other end of the compute spectrum, NVIDIA’s stranglehold on the developer environment may be loosening.




It's interesting how you've articulated the future of distributed compute, especially regarding specialized edge inference. Do you foresee significant changes in operating system or middleware design to truely leverage this decentralization?