/

AI & LLM

Aug 13, 2025

Aug 13, 2025

How does gpt-oss compare to Gemma 3n architecture?

Inside our ML team’s week-long debate on OpenAI’s newly open-sourced GPT-OSS models versus Google’s Gemma3N architecture, from kernels and quantization tricks to efficiency, multimodality, and the quiet arrival of local AI’s future.

One topic hasn’t left our ML engineering chat all week → How does gpt-oss compare to Gemma 3n architecture?

It started when OpenAI open-sourced some of its models.  Then they dropped optimized kernels for it too. 

Not even a day later, Ollama was already updating their docs: “Get up and running with GPT-OSS, DeepSeek-R1, Gemma 3 and other models.” … that was quick.

Naturally, we’ve been digging in –  running tests, reading kernels, swapping notes,  and inevitably, the conversation turned into a running comparison: GPT-OSS vs Google’s Gemma3N architecture.


Two very different takes on the future of local AI

The differences aren’t just technical footnotes; they read like two competing philosophies.

GPT-OSS is a Mixture of Experts behemoth, trained with native MXFP4 so the 120B model runs on a single H100 GPU and the 20B fits in 16GB memory.

Gemma3N is built on a Matryoshka Transformer, where every layer is active but nested so you can pull out smaller, fully functional models on the fly.

Gemma3N is multimodal – text, images, audio, video with an efficiency edge that comes from years of Google optimizing architecture and algorithms. GPT-OSS is text-only, leaning into raw size and hardware-aware scaling.

Benchmarks? Too early for real open-source head-to-heads. But the early signals are telling:

  • GPT-OSS: Scale maximized to the edge of hardware, hitting ~3000 tok/s on Cerebras hardware.

  • Gemma3N: Efficiency so tight you can get sub-0.5s responses for 256 tokens on a laptop.

And then there’s the on-the-ground feel from our own testing:

And then, Cerebras is telling  they have deployed openai-oss-120b on their hardware and setting records for throughput, at 3000 tok/s.


From Kernel drops to production-ready

It’s not about picking a winner right now, it’s about noticing what’s happening. For years, “local AI” felt like a someday conversation: too slow, too heavy, too fringe. Now, billion-parameter models are running on single GPUs. Quantization tricks are making local inference not just possible, but fast. Multimodal models are slipping into workflows without breaking them.

These aren’t lab demos anymore. They’re becoming part of the toolbelt — the kind of thing you stop thinking of as “AI” and just start using.

This week’s GPT-OSS vs Gemma 3N back-and-forth is just one example of how fast the landscape is shifting. One release triggers another, kernels get patched, docs are updated, and suddenly the thing you were waiting for is already on your desk. What used to be a roadmap item is now something you can run today, in production, on local hardware, and without breaking your workflow.

Written by

Written by

SHARE

How does gpt-oss compare to Gemma 3n architecture?

Recent

Sep 17, 2025

Sep 17, 2025

Prototypes: the glue of Long-Term Memory

Explore how prototypes lay the foundation for long-term memory in AI. Learn why early experiments, iteration, and design “blueprints” are critical for building durable, context-rich intelligence.

Explore how prototypes lay the foundation for long-term memory in AI. Learn why early experiments, iteration, and design “blueprints” are critical for building durable, context-rich intelligence.

Sep 15, 2025

Sep 15, 2025

Why developers need AI that actually gets Their context

Tired of re-explaining your codebase to AI every week? Discover why developers need context-aware AI that remembers your workflow. Learn how Workstream Activity, Sources, and Time Ranges in Pieces give you control, continuity, and a searchable memory for your entire dev process.

Tired of re-explaining your codebase to AI every week? Discover why developers need context-aware AI that remembers your workflow. Learn how Workstream Activity, Sources, and Time Ranges in Pieces give you control, continuity, and a searchable memory for your entire dev process.

Sep 11, 2025

Sep 11, 2025

AI memory explained: what Perplexity, ChatGPT, Pieces, and Claude remember (and forget)

Discover the different types of AI memory, how they work, key use cases, and the best prompting approaches to get accurate, context-aware responses

Discover the different types of AI memory, how they work, key use cases, and the best prompting approaches to get accurate, context-aware responses

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.