Pieces now powered by Ollama for enhanced local model integration
Chat with LLMs offline or in secure environments using Pieces. Seamlessly switch between cloud and local models, now powered by Ollama.
We’re excited to announce that running local models with Pieces is now powered by Ollama!
Pieces is the only AI developer assistant that has a deep managed integration with local models, allowing you to chat with an LLM offline, or in privacy or security-focused environments.
This deep integration is completely seamless for you as a user, you can select the model of your choice, from the massive cloud models like Gemini, Claude, or GPT-4o, then with a couple of clicks, change to an on-device model running locally with access to all the context you would want, from file and folders of code to the Pieces Long-Term Memory.
You can even do this mid-conversation, starting a discussion with Claude, then switching to Llama if you go offline, such as on a plane, and continuing the same conversation with the same context.
And now, we’re excited to announce that the Pieces local models are even better, powered by Ollama!
Improved GPU integration
We’ve heard from a number of users that they haven’t had the best experience using their GPU with local models.
Most of the time it works, but there have been a few users who have just been unable to run on their GPU. By using Ollama, all these hiccups should go away.
Anyone with a supported GPU will immediately get the benefit of faster chats with local LLMs, and as Ollama adds support for more backends, these will be available to Pieces as soon as Ollama is updated.
Using a GPU makes local models substantially faster, with a reduced impact on your system.
All the hard work is offloaded, so it reduces the impact on your system RAM and CPU keeping your machine more responsive.
To get an idea of the speed, here’s the same question running on Llama 3, using just a CPU,
and using an NVIDIA.
In this video, the left view is the Pieces Desktop app using just a CPU, and on the right using an NVIDIA GeForce RTX 2050 Ti Laptop GPU. The question is asked, and at about 4 seconds in the copilot starts work.
For the GPU, the response starts streaming back after 1 second. For the CPU, the response takes over 11 seconds to start streaming, and the speed at which the tokens are coming is noticeably slower.
More efficient use of CPU and VRAM
Local LLMs are large, hence the name of large language models. Even the local ones, sometimes known as SLMs, or small language models, for example, phi-3-mini, are still pretty big.
For the best performance, the model needs to be entirely loaded into the VRAM of the GPU. However, what if the model is bigger than the available VRAM?
Ollama can handle this automatically, moving parts of the model to system RAM as needed.
Your GPU works with the model components that are in VRAM, and your CPU handles those in system RAM, as this is generally faster than swapping memory in and out.
Before Ollama, we had to do these calculations ourselves, and may not have always got it right.
Now with our Ollama integration, you’ll get the most out of the available VRAM, ensuring your models run as fast as possible.
Faster time to new models
Models are being released all the time, and it feels like an almost weekly request to support the latest shiny new models.
By leveraging Ollama, we can bring these models to you faster, thanks to a consistent prompting interface, and ease of deployment.
The effort on our side to validate the model with the context that we send is reduced, so expect our model catalog to grow rapidly. We’ve already got plans to drop some new models very soon, so watch this space, but please reach out to us if you have requests for specific models!
Oh, and a sneak peek – we’ve had folks asking about bringing their own local models for environments where their AI governance rules only allow for local models. Watch this space…
What does this mean for you as a user?
When you upgrade to Pieces 11, there will be an additional automated step in the upgrade process to install Ollama locally if you don’t already have it running.
Once installed, you will then need to re-download any of the models that you were using. Pieces will automatically manage Ollama for you, spinning it up when needed, allowing you to enjoy their cute llama in your taskbar or menu.
That’s it, you’ll then be able to take advantage of better GPU acceleration, and better use of your system resources.
What’s next?
If you use local models, then Pieces is by far the best option for your developer copilot, bringing features like Long-Term Memory and custom context to models that can run completely offline.
With our new version, powered by Ollama, we’ve not only made Pieces faster, but we’re ready to add more of the models you care about to boost your productivity as a developer.
I can’t wait to see what you build powered by Pieces.