Product Updates

Jan 23, 2025

New for 2025 – more local models in Pieces, including Qwen Coder and Phi-4

Pieces has added more local models to power the copilot and long term memory including Qwen coder and Phi-4

I wrote recently about our internal upgrade, changing how Pieces handles local models to use Ollama.

One reason for this change was to make it easier and quicker for us to bring new local models to Pieces.

I’m excited to announce that we’ve just released our first batch of new models, including Qwen 2.5 Coder, a model that a lot of folks have asked for.

The local models available through Pieces

We’ve extended the model catalog to include new and updated models from Google, IBM, Meta, Microsoft, and Mistral, as well as adding models from Qwen and StarCoder.

Here’s our complete set.

Google

Model Name	Parameters	Description	Ollama Model Page
Gemma 2	2B, 9B, 27B	Google Gemma 2 is a high-performing and efficient model, featuring a brand new architecture designed for class-leading performance and efficiency.	Gemma 2
Gemma 1.1	2B, 7B	Gemma 1.1 is a new open model developed by Google and its DeepMind team. It’s inspired by Gemini models at Google.	Gemma
CodeGemma	7B	CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following.	CodeGemma

IBM

Model Name	Parameters	Description	Ollama Model Page
Granite Code	3B (2k context window), 3B (128K context window), 8B, 20B, 34B	Granite Code is a family of decoder-only code mode	Granite Code
Granite 3.1 Dense	2B, 8B	The IBM Granite models are text-only dense LLMs trained on over 12 trillion tokens of data, demonstrated significant improvements over their predecessors in performance and speed in IBM’s initial testing. They are designed to support tool-based use cases and for retrieval augmented generation (RAG), streamlining code generation, translation and bug fixing.	Granite 3.1 Dense
Granite 3 Dense	2B, 8B	The IBM Granite models are text-only dense LLMs trained on over 12 trillion tokens of data, demonstrated significant improvements over their predecessors in performance and speed in IBM’s initial testing. Granite-8B-Instruct now rivals Llama 3.1 8B-Instruct across both OpenLLM Leaderboard v1 and OpenLLM Leaderboard v2 benchmarks.	Granite 3 Dense
Granite 3 MoE	1B, 3B	The IBM Granite models are long-context mixture of experts (MoE) Granite models from IBM designed for low latency usage. The models are trained on over 10 trillion tokens of data, the Granite MoE models are ideal for deployment in on-device applications or situations requiring instantaneous inference.	Granite 3 MoE

Microsoft

Model Name	Parameters	Description	Ollama Model Page
Phi-4	14B	Phi-4 is a 14B parameter, state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets.	Phi-4
Phi-3.5 Mini	3.8B	Phi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available websites with a focus on very high-quality, reasoning dense data.	Phi-3.5
Phi-3	3B (Mini), 14B (Medium)	Phi-3 is a family of open AI models developed by Microsoft. This is available in 4K and 128K token context windows.	Phi-3
Phi-2	2.7B	Phi-2 is a small language model capable of common-sense reasoning and language understanding. It showcases “state-of-the-art performance” among language models with less than 13 billion parameters.	Phi-2

Mistral

Model Name	Parameters	Description	Ollama Model Page
Mixtral 8	7B	The Mixtral large Language Models (LLM) are a set of pretrained generative Sparse Mixture of Experts.	Mixtral 8
Mistral	7B	Mistral is a 7B parameter model, distributed with the Apache license. It is available in both instruct (instruction following) and text completion.	Mistral

Qwen

Model Name	Parameters	Description	Ollama Model Page
QwQ Preview	32B	QwQ is an experimental research model focused on advancing AI reasoning capabilities.	QwQ
Qwen 2.5 Coder	0.5B, 1.5B, 3B, 7B, 14B, 32B	The latest series of Code-Specific Qwen models, with significant improvements in code generation, code reasoning, and code fixing.	Qwen 2.5 Coder

StarCoder

Model Name	Parameters	Description	Ollama Model Page
StarCoder 2	15B	StarCoder2 is the next generation of transparently trained open code LLMs	StarCoder 2

Why use local models?

If you’ve not come across local models before, let’s take a moment to dive into them. These are models that, put simply, run locally – they are LLMs that run on your device instead of in the cloud.

I’m a huge fan of local models. It feels like every week they are getting more and more powerful, and the gap between the quality of results you get from a model you can run on your local machine, and a model that runs in the cloud on racks of hardware worth hundreds of thousands is getting smaller and smaller.

What are the benefits of running AI locally

To me, there are 3 big advantages to local models.

AI Governance – Companies are putting restrictions on AI usage to stop private data being sent to cloud-based LLM providers. Some companies enforce local models only so no private data leaves your device.
Environmental impact – AI has a huge power need, leading to large amounts of carbon emissions contributing to climate change. On-device AI has a much lower power need, making them greener.

What are the disadvantages of local models?

Local models are great but not perfect. The disadvantages are:

You need a powerful computer – To run a local model you need a reasonably beefy machine. For every billion parameters you will need a GB of RAM on top of what your system is using. For example, if you want to run a 7B parameter model then you will need at least 8GB of RAM, if not more depending on what else you are running. Ollama takes advantage of GPUs, such as those from NVIDIA, or the GPU cores built into Apple Silicon processors.
These models run slower – Local models run slower than cloud models. This is down to the hardware - no matter what GPU you are running, it will be less powerful than a rack of dedicated AI server-grade GPUs.
The results may not be as good – In general the results of running a local model will not be as good as running a cloud model. These models are smaller, meaning they have encoded less information. These models are getting better all the time though. This is especially true with local models trained on a specific task so that they don’t need to be as large as their corpus of information is smaller. For example, models that are trained to code need little knowledge of tourist attractions in Paris!

Take these models for a spin!

Pieces supports local models the same way it supports cloud models – everything you can do in Pieces with Claude, for example, you can do locally using Llama, Qwen, or Phi-4. Pieces manages everything for you, so you can enable a local model with a couple of clicks. Once you have a local model downloaded and activated, you can use it offline.

Give these new models a spin, and let me know your thoughts on X, Bluesky, LinkedIn, or our Discord. And if you are not using Pieces yet, give it a try for free!

Written by

Jim Bennett

New for 2025 – more local models in Pieces, including Qwen Coder and Phi-4

...

Try Pieces

Recent

Judson Bonneville on writing documentation at Pieces

Jul 22, 2025

How I write documentation at Pieces

Learn about a real-world use case for using AI tools to write production documentation from soup to nuts: voice-to-text, thought-process checks, and assisted structuring all the way to a finished piece of effective, thoughtful technical writing

Jul 21, 2025

The rise of on-device AI and the return of data ownership

Discover how on-device AI is reshaping the tech landscape by prioritizing privacy, speed, and user control, marking a powerful shift toward true data ownership and away from cloud dependency.

Jul 11, 2025

A different perspective on prompt evaluation

Learn what prompt evaluation is, why it matters in AI development, and how to systematically assess prompt quality to improve performance, accuracy, and reliability across use cases

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

Model Name	Parameters	Description	Ollama Model Page
Llama 3.2	1B, 3B	The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.	Llama 3.2
Llama 3	8B	Meta Llama 3, a family of models developed by Meta Inc. are new state-of-the-art. Llama 3 instruction-tuned models are fine-tuned and optimized for dialogue/chat use cases and outperform many of the available open-source chat models on common benchmarks.	Llama 3
Llama 2	7B, 13B	Llama 2 is released by Meta Platforms, Inc. This model is trained on 2 trillion tokens, and by default supports a context length of 4096. Llama 2 Chat models are fine-tuned on over 1 million human annotations, and are made for chat.	Llama 2
CodeLlama	7B, 13B, 34B	Code Llama is a model for generating and discussing code, built on top of Llama 2. It’s designed to make workflows faster and efficient for developers and make it easier for people to learn how to code. It can generate both code and natural language about code. Code Llama supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.	CodeLlama