AI & LLM

Jul 21, 2025

The rise of on-device AI and the return of data ownership

Discover how on-device AI is reshaping the tech landscape by prioritizing privacy, speed, and user control, marking a powerful shift toward true data ownership and away from cloud dependency.

For years, developers defaulted to the cloud not because it was ideal, but because it was the only viable path to intelligent products.

Architectures leaned on hosted models, remote APIs, and token-based pipelines by necessity.

Our shift to on-device AI wasn’t about offline mode or privacy checkboxes. It was about reclaiming inference. We moved the full stack, classification through generation, onto the device. No gateways. No orchestration. No dependency on someone else’s infrastructure.

Instead of relying on massive general-purpose LLMs, we built a mesh of task-specific nano-models: fast, lightweight, and precise. Think reflexes, not reasoning.

The result?

A pipeline that resolves in <150ms, eliminates token costs, and removes third-party callouts entirely. Our temporal intent classifier now outperforms GPT-4, Gemini Flash, and LLaMA-3 3B by up to 16% in weighted F1 – at 55× the throughput.

More than just speed, we gained reliability, determinism, and a system we could trust under real-world load.

What we built, and why it works

The architecture began with a simple reframing: large models weren’t giving us more power – they were introducing more surface area.

So we broke the problem down into composable, traceable tasks.

Is this query temporal? What span does it reference? Is it past-facing or forward-looking? Should the system retrieve, plan, remind, or summarize?

Each of these questions became its own model target, backed by fine-tuned, distilled networks trained on our own labeled data. We used proven open-weight foundations – LLaMA, Mistral, Phi-2 – and pushed them through aggressive quantization pipelines: 4- and 8-bit variants, fused operations, low-rank adapters.

Each model sits in the 20M–80M parameter range, designed not to generalize broadly, but to execute specifically and correctly.

What emerged is a microservice-like inference system – a pipeline of nano-models that hand off structured data from one phase to the next.

This approach gives us full visibility into behavior, reliable fallbacks, and efficient hardware usage and all without relying on a single generalist model to do it all.

The industry shift toward on-device AI

As we re-architected our own stack, it became clear we weren’t alone. The broader market was converging on the same realization: cloud-first AI has real limits in cost, compliance, and user experience.

That shift is now showing up in hardware. Apple's A18 Bionic features a 16-core Neural Engine built to run LLMs directly on the iPhone.

Qualcomm’s latest Snapdragon platforms deliver over 10 TOPS of AI performance on-device. Microsoft’s Copilot+ PCs ship with dedicated NPUs optimized for local generative workloads. Even Chromebooks now include tensor accelerators.

This isn’t a niche optimization it’s a full-spectrum realignment of where AI lives. From inference latency to data privacy to long-term cost structure, the industry is moving away from centralized intelligence and toward a new default: local-first by design.

And it’s not just hardware engineers pushing the message. As Clément Delangue, CEO of Hugging Face, asked in a widely shared 2025 post:

Everyone is talking about how we need more AI data centers... why is no one talking about on-device AI?
Running AI on your device:
– Free
– Faster & more energy efficient
– 100% privacy and control (you don’t send your data to an API)

That level of clarity resonated with the developer community because it reframes the on-device trend not as a constraint, but as a better foundation.

So, how does on-device AI actually work?

At its core, on-device AI means that models run where the user is, not in a remote data center, but directly on the device’s CPU, GPU, or NPU. There are no API calls to the cloud, no token streams crossing the network, no orchestration layers mediating the request. Inference happens locally, in memory, with all the data and context staying on-device by default.

Illustration showing how on-device models work

If you're looking for a deep dive, there's no better technical breakdown than this: A System’s View of On-Device Foundation Models. It lays out the constraints, tradeoffs, and mechanics in a way that actually respects the problem space.

At Pieces, we took this philosophy and built an inference engine around it, a composable stack of nano-models, each handling a scoped task like temporal classification, summarization, or memory retrieval.

Everything runs locally. No cloud fallback required. Just fast, explainable intelligence built to live on your machine.

Pieces allows you to run full LLM interactions offline. If you're on an Apple Silicon Mac or a Windows machine with a supported GPU, you can download an on-device model through our Ollama integration and keep working even without internet access.

You can even switch models mid-conversation: from a cloud provider like Claude or Gemini to a local model, without losing chat history or context.

It’s especially useful in environments with unreliable connectivity (think flights or remote work) or strict data governance (think financial services and regulated industries).

Developers at large organizations use Pieces to stay compliant without compromising on capability.

The same architecture powers how we enrich saved code.

Our Pieces Drive uses LLMs to annotate snippets with tags, descriptions, links, and suggested queries all of which can be generated offline, securely, and privately using your selected local model.

Whether you’re disconnected or working inside a privacy-restricted workspace, Pieces stays intelligent without ever sending your code out of your environment.

Download your model before you go offline; after that, the entire AI stack is yours to run, anytime, anywhere.

Privacy is the main architectural choice

Much of the conversation around AI privacy still assumes that systems need to “respect” user data that compliance is a layer to be added.

On-device AI inverts that premise. Instead of asking how to protect cloud-bound data, it eliminates the surface area entirely.

No personal data leaves the device unless explicitly permitted. No tokens are exchanged. No inference logs are stored by third parties. The entire model stack, from input to response, runs locally, auditable from end to end.

This shift also reshapes how enterprises approach regulatory pressure. GDPR, CCPA, and other privacy frameworks aren’t edge cases – they’re becoming global defaults.

By moving inference to the device, companies reduce legal exposure, simplify compliance, and reclaim architectural control.

But data protection isn't just a compliance story anymore; it’s becoming a systems debate. When Elon Musk tweeted in March 2025,

xAI and X’s futures are intertwined… Today, we officially take the step to combine the data, models, compute, distribution and talent.

It was more than a business move. It was a signal. The most valuable training data in the world , real human expression, is now fully integrated into AI pipelines. This merger raised new questions: Who owns the data behind generative AI? What does consent mean when public content is treated as a training corpus?

Former OpenAI researcher Suchir Balaji framed it succinctly in what became a widely shared final post:

Fair use seems like a pretty implausible defense for a lot of generative AI products, for the basic reason that they can create substitutes that compete with the data they’re trained on.

That tension between open models and owned content is at the core of modern AI system design. On-device AI isn’t a way to sidestep that conversation. It’s a way to build responsibly within it.

From Cloud costs to fixed compute

The economic implications are as significant as the technical ones.

Traditional cloud-first AI stacks carry hidden operational risk: variable token costs, unpredictable latency, and runaway cloud compute spend. Every new user is a multiplier on someone else’s infrastructure until the bill arrives.

On-device AI reverses that calculus. Inference becomes free at the margin. Costs become architectural rather than usage-based. Performance scales with user hardware, not centralized server farms.

For enterprise buyers, this means lower operational costs, reduced network strain, and infrastructure that scales linearly without spiking unpredictably. It also means compliance becomes easier, not harder, as systems consolidate around local execution.

But the cost equation goes beyond finance and into energy. And here, the numbers tell their own story.

A recent back-of-the-envelope analysis compared the per-token energy cost of three AI configurations: a small model like FLAN-T5 running on CPU; a 70B model like LLaMA-2 on a GPU cluster; and a GPT-4-class mixture-of-experts model running across an H100 setup. The results were staggering.

Model Class	Hardware	Energy per Token	kWh per 1M Tokens (with PUE)	CO₂ per 1M Tokens
77M (small)	CPU	~1 mJ	0.00034 kWh	0.14 g CO₂
70B (LLaMA)	20× A100s	3–4 J	1.0–1.3 kWh	400–530 g CO₂
1T+ (GPT-4-class)	H100 cluster	8–12 J (est.)	2.2–3.3 kWh	900–1300 g CO₂

In other words, the jump from a CPU-optimized 77M model to a cloud-deployed 70B model increases per-token energy by three orders of magnitude. When you zoom out to system-level design, this isn't a matter of marginal gains – it’s about carbon footprint, scalability, and total infrastructure impact.

And that’s before you factor in training, cooling, or networking.

By running inference locally, we eliminate not just cost volatility but also a substantial portion of the environmental footprint. When multiplied across millions of users, the impact becomes systemic, and this is where the industry is headed, whether we acknowledge it or not.

Use cases that actually make sense

Not every AI problem should be solved on-device. But for the kinds of high-frequency, low-complexity tasks users rely on daily, local wins, every time.

Think: voice transcription, image enhancement, language translation, memory recall, meeting summarization, keyboard suggestion. These aren’t tasks that require 70B parameters. They’re tasks that require speed, specificity, and trust.

And now, they also require energy proportionality. When inference is measured not only in latency or dollars but also in watts and grams of CO₂, local execution becomes not just a technical preference but a climate-responsible design choice.

In every one of these domains, we've seen local models outperform their cloud counterparts not just in latency, but in task precision, interpretability, and user satisfaction.

Where is this going

The conversation about on-device AI is not about catching up with the cloud. It’s about moving forward on a different path. One where intelligence is embedded, not streamed. Where memory is local. Where privacy is the default. Where cost doesn’t scale with usage. And where systems are designed to be understood, not abstracted away.

As you think about where to build and how to deploy AI responsibly, we’d encourage you to ask some simple questions:

Do I need 175B parameters to identify a URL in a note?
Is cloud inference necessary for tagging a calendar item?
Am I optimizing for abstraction or for actual user outcomes?

Because for us, the moment we reframed these as systems questions, the path forward became obvious.

We didn’t just optimize a pipeline. We rethought the foundation.

And we’re building from here.

Talk to us, if you’re curious how Pieces works on-device.

Written by

Tsavo Knott

The rise of on-device AI and the return of data ownership

...

Get started

Recent

Judson Bonneville on writing documentation at Pieces

Jul 22, 2025

How I write documentation at Pieces

Learn about a real-world use case for using AI tools to write production documentation from soup to nuts: voice-to-text, thought-process checks, and assisted structuring all the way to a finished piece of effective, thoughtful technical writing

Jul 21, 2025

The rise of on-device AI and the return of data ownership

Discover how on-device AI is reshaping the tech landscape by prioritizing privacy, speed, and user control, marking a powerful shift toward true data ownership and away from cloud dependency.

Jul 11, 2025

A different perspective on prompt evaluation

Learn what prompt evaluation is, why it matters in AI development, and how to systematically assess prompt quality to improve performance, accuracy, and reliability across use cases

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.