/

AI & LLM

Aug 13, 2025

Aug 13, 2025

How does gpt-oss compare to Gemma 3n architecture?

Inside our ML team’s week-long debate on OpenAI’s newly open-sourced GPT-OSS models versus Google’s Gemma3N architecture, from kernels and quantization tricks to efficiency, multimodality, and the quiet arrival of local AI’s future.

One topic hasn’t left our ML engineering chat all week → How does gpt-oss compare to Gemma 3n architecture?

It started when OpenAI open-sourced some of its models.  Then they dropped optimized kernels for it too. 

Not even a day later, Ollama was already updating their docs: “Get up and running with GPT-OSS, DeepSeek-R1, Gemma 3 and other models.” … that was quick.

Naturally, we’ve been digging in –  running tests, reading kernels, swapping notes,  and inevitably, the conversation turned into a running comparison: GPT-OSS vs Google’s Gemma3N architecture.


Two very different takes on the future of local AI

The differences aren’t just technical footnotes; they read like two competing philosophies.

GPT-OSS is a Mixture of Experts behemoth, trained with native MXFP4 so the 120B model runs on a single H100 GPU and the 20B fits in 16GB memory.

Gemma3N is built on a Matryoshka Transformer, where every layer is active but nested so you can pull out smaller, fully functional models on the fly.

Gemma3N is multimodal – text, images, audio, video with an efficiency edge that comes from years of Google optimizing architecture and algorithms. GPT-OSS is text-only, leaning into raw size and hardware-aware scaling.

Benchmarks? Too early for real open-source head-to-heads. But the early signals are telling:

  • GPT-OSS: Scale maximized to the edge of hardware, hitting ~3000 tok/s on Cerebras hardware.

  • Gemma3N: Efficiency so tight you can get sub-0.5s responses for 256 tokens on a laptop.

And then there’s the on-the-ground feel from our own testing:

And then, Cerebras is telling  they have deployed openai-oss-120b on their hardware and setting records for throughput, at 3000 tok/s.


From Kernel drops to production-ready

It’s not about picking a winner right now, it’s about noticing what’s happening. For years, “local AI” felt like a someday conversation: too slow, too heavy, too fringe. Now, billion-parameter models are running on single GPUs. Quantization tricks are making local inference not just possible, but fast. Multimodal models are slipping into workflows without breaking them.

These aren’t lab demos anymore. They’re becoming part of the toolbelt — the kind of thing you stop thinking of as “AI” and just start using.

This week’s GPT-OSS vs Gemma 3N back-and-forth is just one example of how fast the landscape is shifting. One release triggers another, kernels get patched, docs are updated, and suddenly the thing you were waiting for is already on your desk. What used to be a roadmap item is now something you can run today, in production, on local hardware, and without breaking your workflow.

Written by

Written by

SHARE

How does gpt-oss compare to Gemma 3n architecture?

Recent

Aug 12, 2025

Aug 12, 2025

Visionary AI investor Flat Capital Invests in Pieces to Accelerate Artificial Memory For Individuals and the Enterprise

We’re thrilled to welcome Flat Capital as a new investor in Pieces. Learn more about this exciting partnership and what it means for the future of local-first AI.

We’re thrilled to welcome Flat Capital as a new investor in Pieces. Learn more about this exciting partnership and what it means for the future of local-first AI.

Aug 12, 2025

Aug 12, 2025

From IDE to deployment: 9 Best AI tools for Python

We put the top AI tools for Python coding to the test, not just to see which writes code the fastest, but which actually feels good to use, fits into your workflow, and makes building in Python more enjoyable.

We put the top AI tools for Python coding to the test, not just to see which writes code the fastest, but which actually feels good to use, fits into your workflow, and makes building in Python more enjoyable.

Aug 12, 2025

Aug 12, 2025

Beyond the cloud: SLMs, local AI, agentic constellations, biology and a high value direction for AI progress

We’ve all heard it — “the future of AI is in the cloud.” But the real story is that the future might be smaller, closer, and more personal. From Small Language Models (SLMs) to local-first AI, agentic constellations, and even bio-inspired designs, the next big leap in AI isn’t about scale. It’s about building smarter, faster, more private tools that actually work for you.

We’ve all heard it — “the future of AI is in the cloud.” But the real story is that the future might be smaller, closer, and more personal. From Small Language Models (SLMs) to local-first AI, agentic constellations, and even bio-inspired designs, the next big leap in AI isn’t about scale. It’s about building smarter, faster, more private tools that actually work for you.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.