>
We’re thrilled to announce that key members of the Ensemble AI team are joining Cloudflare, bringing their expertise to help us advance our AI infrastructure efforts and make it simpler for developers to run powerful AI models at scale with greater efficiency.
Founded in 2023 in San Francisco, Ensemble AI has dedicated its efforts over the past few years to tackling one of the most pressing challenges in artificial intelligence: making large models serve faster, leaner, and more affordably — all without compromising on output quality. The team pioneered novel strategies in model compression and efficient inference, purpose-built to slash the memory demands, computing requirements, and deployment complexity associated with large language models and multimodal architectures.
As AI becomes an increasingly central pillar of modern application development, the cost-effectiveness of inference is more critical than ever before. Models continue to grow in size; workloads are becoming increasingly dynamic. And customers expect AI to be everywhere at once — globally distributed, lightning-fast, dependable, and reasonably priced. Integrating the Ensemble AI team into Cloudflare significantly bolsters our capacity to deliver on that vision.
Bringing Ensemble’s expertise on board
Ensemble AI’s team has concentrated on preserving the internal structural integrity of contemporary AI models while simultaneously driving down the costs of operating them. Rather than framing model efficiency purely as a quantization or hardware challenge, Ensemble pioneered new foundational model building blocks that render neural networks inherently more compact and performant at the architectural level.
A cornerstone of this effort is NdLinear — a seamless replacement for conventional linear layers in transformer models that processes multidimensional activations directly, without discarding structural information through flattening. This empowers models to retain meaningful dimensions such as attention heads, feature channels, spatial coordinates, and other structured representations, all while using fewer parameters and less computation. Ensemble also created NdLinear-LoRA, an efficient adaptation technique purpose-built to minimize the number of trainable parameters required when fine-tuning large models.
These methodologies work hand-in-hand with other performance-boosting techniques, including quantization and vector quantization. Collectively, they paint a picture of a future where developers can run highly capable AI models using dramatically less memory, computing power, and financial resources.
Boosting the efficiency of AI inference
Cloudflare Workers AI provides developers with serverless, GPU-powered inference running across Cloudflare’s worldwide network. As developers build increasingly AI-native applications, the ability to serve models efficiently becomes a foundational aspect of the platform’s value.
One of the most significant obstacles to scaling AI-powered applications is the cost of inference. Each improvement in model compactness, memory consumption, throughput, and GPU utilization directly translates to AI being more attainable for developers and more cost-efficient for end users. This becomes especially pivotal as AI workloads grow beyond basic text generation and expand into agentic systems, multimodal models, personalization engines, fine-tuning pipelines, retrieval systems, and reinforcement learning.
We are significantly deepening our investment in the core machine learning competencies needed to make Workers AI faster, more adaptable, and more economical. This builds upon our existing work in the inference efficiency space, including our Infire inference engine, tensor compression techniques such as Unweight, and our platform for running extra-large language models. The team will concentrate on enhancing the economics of serving large language models and other sophisticated AI architectures, with a strong focus on model efficiency, GPU utilization, and scalable deployment.
Engineering infrastructure for the next wave of AI workloads
AI infrastructure is evolving into a new era. Developers need more than just access to models — they require infrastructure that can run those models reliably, cost-effectively, and in close proximity to their users. They need the freedom to experiment with varying model sizes, fine-tuning strategies, and deployment patterns without being constrained by prohibitive costs or operational overhead.
Cloudflare is exceptionally well-positioned to address these challenges. Our global network footprint, developer-first platform, and serverless architecture provide the ideal foundation for bringing AI closer to where applications already operate. The Workers AI Machine Learning Engineering team will help us refine the efficiency layer that underpins the entire developer experience.
By merging Cloudflare’s worldwide infrastructure with Ensemble’s deep expertise in model compression and efficient architectures, we can continue building a platform where developers deploy AI applications at lower cost, with superior performance, and with minimal operational complexity.
Together, we’ll keep building the infrastructure that makes AI more efficient, more accessible, and more practical for developers everywhere. Our mission is straightforward: empower developers to run powerful AI workloads at a global scale while continuously improving the economics of inference across the Cloudflare platform. If you’d like to be part of this journey, take a look at our careers page.



