IBM outperforms OpenAI? What 50 LLM tests revealed
50 LLMs benchmarked for real dev tasks: see how IBM outperformed OpenAI and what it means for the future of AI tooling.
There is a huge range of LLMs (Large Language Models) available these days, with the big names in foundation models being OpenAI, Google, and Anthropic.
What a lot of LLM users don’t realize is that there are a lot more models out there from companies like Microsoft, IBM, and Meta. These can be as good as, if not better than, the big-name models.
In this article, we'll explore an llm leaderboard and their llm rankings to see how various models stack up against each other.
Pieces offers a range of LLMs for you to use from providers like OpenAI, Microsoft, IBM, Google and more, with both cloud models and on-device available.
I decided to put most of them through their paces by running a series of llm benchmarks to see which models were the best.
👀🔥The results were eye-opening, to say the least, with some fantastic results from models you may not expect and some big names not doing so well on the AI leaderboard.
Read on for my results and conclusions.
Which is the best model to use?
When you interact with the Pieces Copilot, behind the scenes, your prompts and context are sent to a Large Language Model, or LLM. This includes the chat history, any file or folders added as context, and, of course, the relevant information extracted from the Long-Term Memory, which utilizes retrieval-augmented generation to enhance the model's responses.
Pieces offers a range of LLMs to choose from, including some of the latest cloud-hosted models from OpenAI, Anthropic, and Google.
It also offers fully managed on-device SLMs, or small language models, that are run completely locally from a wide range of providers, such as Microsoft, IBM, Google, Meta, and more.
This diverse selection includes both proprietary and open-source models, catering to various needs and preferences.
We often get asked questions like "Which is the best model to use?".
This is a very subjective answer, as models have different styles of output. Some are more verbose, some have better knowledge in particular topics.
➡️ Local models have privacy at the expense of speed and require powerful hardware.
⬅️ Cloud models need an internet connection, and so on.
Factors like cost-efficiency and suitability for enterprise applications also play a role in model selection.
So I thought it would be fun to build some automated testing for all the models and score them based on how well they answered a range of questions, as well as how quickly they run. This LLM comparison would help identify the top llms across various criteria.
Which one is the winner? Keep reading to find out!

To test each LLM, I built a small test harness using the Pieces C# SDK. This harness would run a range of tests multiple times, then report an average score for the accuracy of the results, along with an average time to first token, and time to complete the response.
These AI benchmarks were designed to evaluate various aspects of model performance.
The test scenarios
I built tests for the following scenarios:
Converting JSON to a markdown table
Reading about a nuget package in a browser and answering questions on it
Summarizing an email chain
Suggesting code changes based on reading a GitHub issue in a browser and using a folder of code
Extracting information from a Reddit conversation in a browser
Providing a fix for a warning in VS Code
Apart from the conversion of JSON data to markdown, these tests all used the Pieces Long-Term Memory, leveraging retrieval-augmented generation capabilities.
The test setup
All tests were run on my M3 MacBook Air, with 24GB RAM.
I tested all the cloud LLMs that Pieces supports, along with any local LLM that has 15B parameters or fewer. Anything larger than that is too big to run on my computer, so I was unable to test Granite Code 34B or QwQ 32B ,for example.
💡This limitation highlights the importance of considering parameter count and hardware requirements when selecting models for specific use cases.
To run each scenario, the test harness would do the following:
Clear my Pieces LTM so every test starts from the same place
Download the local model if needed (cue Jim deleting a load of things to make space)
Run the scenario, capturing the time to the first token and the time to get the complete response
Evaluate the response by looking for certain specific text in the output
Repeat the process 5 times and get an average
This is not a perfect test – it would take a human to truly evaluate the response, but this doesn’t scale and is very subjective, hence looking for specific text in the output.
This test also took a long time to run. All cloud models could be evaluated in an hour or so. For the on-device models, it took an entire weekend!
The models under test
This is the complete list of all 50 models tested, including some of the latest LLM models and new LLM models:
Cloud Model Providers | Cloud Models | Local Model Providers | Local Models |
---|---|---|---|
Anthropic | Claude 3 Haiku Chat | Code Gemma 1.1 7B | |
Claude 3 Opus Chat | Gemma 1.1 2B | ||
Claude 3 Sonnet Chat | Gemma 1.1 7B | ||
Claude 3.5 Sonnet Chat | Gemma 2 2B | ||
Codey (PaLM2) Chat | Gemma 2 9B | ||
PaLM2 Chat | IBM | Granite 3 Dense 2B | |
Gemini Chat | Granite 3 Dense 8B | ||
Gemini-1.5 Flash Chat | Granite 3 MOE 1B | ||
Gemini-1.5 Pro Chat | Granite 3 MOE 3B | ||
Gemini-2.0 Flash Experimental Chat | Granite 3.1 Dense 2B | ||
OpenAI | GPT-3.5-turbo Chat | Granite 3.1 Dense 8B | |
GPT-4 Chat | Granite 3B Code | ||
GPT-4 Turbo Chat | Granite 3B 128k Code | ||
GPT-4o Chat | Granite 8B Code | ||
GPT-4o Mini Chat | Meta | CodeLlama 7B Chat | |
CodeLlama 13B Chat | |||
Llama 2 7B Chat | |||
Llama 2 13B Chat | |||
Llama 3 8B | |||
Llama 3.2 1B Chat | |||
Llama 3.2 3B Chat | |||
Microsoft | Phi-2 Chat | ||
Phi-3 Mini 4k | |||
Phi-3 Mini 128k | |||
Phi-3 Medium 4k | |||
Phi-3 Medium 128k | |||
Phi-3.5 Mini | |||
Phi-4 | |||
Mistral | Mistral 7B Chat | ||
Qwen | Qwen 2.5 Coder 0.5B | ||
Qwen 2.5 Coder 1.5B | |||
Qwen 2.5 Coder 3B | |||
Qwen 2.5 Coder 7B | |||
Qwen 2.5 Coder 14B | |||
StarCoder | StarCoder-2 15B |
Why do different models give different results?
Before I give the results, I do want to take a moment to look at why different models will give different results.

We can’t always send the same data to each model
From the Pieces side, running the same prompt with the same context available will mean that we have the same data to send to the model.
But we can’t always send the same data.
The differences are:
The system prompt – we need to use a different system prompt for different models to guide it in the most successful way. Reasoning models need different system prompts than non-reasoning models to get the best results.
The context size – when we package up data from the LTM, we are limited to the context window size of the model. If you are using a model with a 4K token context window, then we can only send 4,000 tokens. This limits the context from the LTM that we can send, compared to a model with a 128K token context window, where we can send 32x the amount of tokens.
Models are trained on different data
The training data for each model is different, with newer models having newer training data.
Some models are trained on limited sets of data; for example, code models are trained more on code than other information (so great with Python, not so great with the history of French literature).
Some models are even trained on textbooks or synthetic textbooks.
Models work differently
Each type and generation of model works differently as the model creators tune and improve the models.
Different model sizes affect the models capability
The more parameters, (in theory) the more capable the model is. Expect the same model with less parameters to not be as good as one with more. The largest llm and biggest llms often demonstrate superior performance in complex tasks.
Local model speed is dependent on hardware
The number of parameters and the context size of a local model determines how much hardware is needed – if it needs too much VRAM for the GPU it’s running on, it will partly drop to CPU, which is very, very slow.
Larger models will run very slow on the hardware I have but may be faster on machines with more VRAM. This affects token processing speed and overall model performance.
The results
So, which model won?
Well, ‘won’ is a subjective measure – these models were tested for accuracy and performance, so the results depend on what you are most interested in.
I’ve split the results into 4: the most accurate, the fastest to first token, the fastest response, and the best overall.

The most accurate – Phi-4: 82%

Runners up: GPT-4o (78%), Granite 3.1 Dense 8B (78%)
The most accurate model was Microsoft’s Phi-4. This is a very capable model that can run on your local device. Yup, the best wasn’t a huge cloud model running on racks of NVIDIA GPUs, it was a model you can download at less than 10GB, and run on a reasonably capable machine with 20GB of RAM or VRAM.
It’s interesting to see 2 local models in the top 3 – with Granite 3.1 Dense 8B from IBM coming in barely behind GPT-4o, a massive cloud LLM from OpenAI. Having an 8B parameter local model from IBM outdo most of the models from OpenAI, and all the models from Anthropic and Google was very interesting to see.
This demonstrates that some of the best LLM models are not necessarily the largest or most well-known.
The fastest to first token – Claude 3 Opus: 2.2s

Runners-up: Gemini 2.0 Flash Experimental (2.4s), Gemini 1.5 Flash (2.5s)
Claude 3 Opus is blisteringly fast, bringing back the first token in 2.2s. This puts it slightly faster than all the other cloud models.
In general, though, the cloud models were fast, with the slowest cloud model, GPT-4 Chat, only returning the first token 0.9s slower than Claude 3 Opus. The speed also came with some good accuracy, coming in at 72%.
Local models were noticeably slower.
The fastest local model was Code Gemma 1.1 7B at 7s to first token, but with a terrible accuracy of 5%.
Although Phi-4 is the most accurate, it was also one of the slowest, coming in the 3rd slowest in my testing at just over 1min 42s.
To be fair to the model, it is a 14B parameter model, so it exceeds my hardware and would be much better on a device with a massive GPU.
The fastest to a complete response — Gemini 1.5 Flash: 1.6s

Runners-up: Gemini 2.0 Flash (1.7s), PaLM2 (1.9s)
Gemini 1.5 Flash came third for the time to first token but brought back the complete result the fastest.
The response times of the cloud models were a bigger range than the time to first token, dropping down to 11.8s for GPT-4 Turbo chat.
Local models again were slower.

The fastest to a full response was Granite 3 MOE 1B. A small model, hence the speed, returning the response in a respectable 4.5s, but with only 13% accuracy. Phi-4, our most accurate model, was again very slow at over 2min 14s.
The overall winner
Working out the overall winner is hard – a lot of it depends on what is the most important to you. Would you sacrifice speed for accuracy, for example?
To work out the winners, I decided that the accuracy was more important – so my not very scientific method was:
Score the models from 50-1 in order of accuracy, time to first token, and total response time.
Add these scores up, but double the accuracy score as this is the most important component.

GPT-4o from OpenAI
This came second for accuracy, which played a big part in it coming first.
For performance, it was in the top quarter, 14th for time to first token, and 11th for time for the complete response.
The runners-up are:
GPT-4o mini and PaLM 2.
GPT-4o mini makes sense – it is smaller than GPT-4o, making it slightly faster, and although the speed sacrifices quality, the accuracy was not far behind GPT-4o, at 77% for GPT-4o mini and 78% for GPT-4o.
PaLM2 makes a surprising appearance here, though.
This is an older model from Google and has been deprecated since October 2024 but surprisingly outdoes the more modern Google Gemini models. It is fast and has an accuracy of 73%.
So, what model should you use?
Despite all this, the answer is “it depends”.
If you want privacy and have a powerful machine, then Phi-4 might be your best bet.
If you need speed, then try the Gemini models. You’ll also get different answers based on your use case – for example, some developers swear by Qwen Coder for its coding capabilities, while others prefer Claude 3.5 Sonnet.
I was surprised not to see Llama 3.2 coming in very high. it’s a capable model but in my testing was outclassed, coming in 19th out of the 50 models.
My personal favorites are Granite 3.1 Dense 8B for on-device (I love the results from Phi-4, but it’s too slow on my machine).
For the cloud, the GPT-4o mini gives me a nice balance of speed and accuracy. So, wins for IBM and OpenAI.
Both are very good with the Pieces Long-Term Memory, as well as day-to-day tasks. And keep an eye out for new models; we have more coming all the time, including models with improved multimodal capabilities and enhanced reasoning abilities.
Your best bet, however, is to try them all and see which one works for you.
Download Pieces today if you don’t already have it, and put the models through their paces – especially leveraging LTM-2.
And keep an eye out for new models, we have more coming all the time.
And please share your thoughts on your favorite model with us at Pieces on X, Bluesky, LinkedIn, or our Discord.
Disclaimer: This comparison of popular LLMs highlights the rapid evolution of natural language processing technology. With new models like DeepSeek-R1, Grok 3, and LlaMA 3.3 constantly emerging, the landscape is changing fast. Whether you're exploring open-source options under Apache 2.0 or proprietary solutions for enterprise use, the LLM space is dynamic and diverse. Be sure to keep an eye on our updates, as we'll continue tracking the latest advancements and shifts in this fast-moving field.
