AI & LLM

Feb 5, 2025

What is the best LLM for coding as of today?

Discover the best LLM for coding - whether you’re generating code or just asking questions, understanding cloud vs local LLMs can make you more effective.

The LLM space is growing rapidly, with new LLMs or updated models appearing almost weekly. As developers, we’ve embraced LLMs to help us code faster, allowing the LLM to generate the code it can write, so that we can focus on the code only we humans can write. But which is the best LLM for coding? How can we decide which one to use? Do we use a cloud model or a local model?

In this post, I’ll put a number of LLMs through their paces with a set of coding challenges to see how they fare. All through Pieces of course – one of the big upsides of Pieces is that you can choose from a huge range of models, and even switch mid conversation.

The models

We have 3 models competing to be the best – all cloud models, I’m not comparing offline AI here, and you do get different results with cloud LLMs compared to local SLMs. Our competitors are:

Model	Provider	Cloud or local	Notes
Gemini 2 Flash	Google	Cloud	This is the latest experimental model from Google, with low latency and enhanced performance, designed to power agentic experiences.
GPT-4o mini	OpenAI	Cloud	This is a cost-efficient, low-latency model from OpenAI that outperforms GPT-4, and is orders of magnitude cheaper than previous OpenAI models.
Claude 3.5 Sonnet	Anthropic	Cloud	Claude 3.5 sonnet is the latest model from Anthropic, and when released was topping all the benchmarks. This was a popular model for code generation for a time, with almost double the success rate of previous Anthropic models at code generation.

The challenges

To test these models, I’ve derived a set of challenges that mimic some real-world developer scenarios, some covering boilerplate code generation, and others focusing on solving harder problems. I’ll be approaching these in C#, to test how well these models do with this popular language that is used in a wide range of enterprise companies (and it’s my favourite language as well).

One point to note – I am testing the same prompt with all the models. You will get better results with different prompts, as well as iterating on your prompts or using prompt chaining to fix issues from the first prompt. Some of these are generic prompts, others are specific to the

Challenge 1 – SDK generation from an OpenAPI spec

A fairly regular task is interacting with an API.

A good API should have an OpenAPI spec, defining the endpoints, what verbs are supported, the format of the data you send and receive, and so on. Whilst there are tools to generate SDKs from OpenAPI specs, the advantage of using an LLM is that you can have more control over the generation – using specific HTTP libraries, ignoring endpoints, or conforming to certain naming standards.

This is a tricky challenge, as the models need to understand a lot of JSON and create a reasonable amount of code. This is a task that will probably do better on models with larger context windows.

I’m going to use the Swagger PetStore – this is a canonical example for SDK generation, with a small API surface area. This defines an API with 3 groups of endpoints defined by tags - for pets, a store, and users. Each endpoint sends or receives a JSON body defined as a schema.

In Pieces, for each model I’ll create a new Pieces Copilot chat, adding the OpenAPI spec as file context.

Prompt:

This is a swagger spec for the petstore API. I need a C# SDK for this. Provide all the code for the SDK, including:

- Classes that represent all the schemas, with correctly cased properties using attributes to define the JSON property names

- One class per tag that provides access to all the API endpoints. Name the functions using the operation IDs, using the correct casing for C#

- Handling all errors correctly

Gemini 2 Flash

The LLM generated 4 code files – one with all the models for the schemas, then one for each of the 3 groups of endpoints.

The code wasn’t perfect:

Quite a few endpoints did not have code created for them across all 3 tags.
The code uses NewtonSoft.Json, not System.Text.Json which is the newer, preferred JSON library
No instructions were provided for installing nuget packages - you need to install Microsoft.AspNet.WebApi.Client and NewtonSoft,Json.
No namespaces
No sample code is available showing how to run this code
In the endpoint methods, the URL fragment starts with a leading slash. The design here is to specify a base URL when you create a HTTPClient that is passed to the API (C# likes to share an instance of HTTPClient), and the fragment is added. However, this fragment must not start with a leading slash, otherwise the route is not resolved correctly. This means all the methods have to be modified before the code is run. This is a particularly nasty bug as the error that comes back is a 404 not found, so hard to track down.
Once these issues were resolved, I was able to generate sample code that worked but would still need to generate the remaining endpoints.

GPT-4o mini

The LLM generated 2 code files – one with all the models, and one with all the APIs. It also provided example code to call a method on the API.

The code wasn’t perfect:

The code uses NewtonSoft.Json, not System.Text.Json which is the newer, preferred JSON library
No instructions were provided for installing nuget packages – you need to install Microsoft.AspNet.WebApi.Client and NewtonSoft,Json.
No namespaces
The code used a method on the HTTPClient called PostAsJsonAsync without the relevant using directive
As with Gemini, the endpoint fragments have a leading slash. In addition, the example code has a base URL without a trailing slash, which is required for the code to work.
Once these issues were fixed, the sample code provided ran fine, accessing the API correctly. I was also able to run a few other methods on the SDK without issues.

Claude 3.5 Sonnet

Here the LLM was pretty incomplete. It started by suggesting that I use a tool like the OpenAPI generator.

Then it provided only examples, giving the model class for only one schema, only one API class with just one endpoint, and an example called the one generated method.

The one area this did well in, is it did provide namespaces – unlike the previous 2 models.

Of the little code that was provided:

It needed a nuget package that wasn’t mentioned in the using directives, and no instructions on installing this was provided, leaving you to guess which one is needed.
The same leading/trailing slash issue was present.
The code assumed you would fill in the missing bits - for example a method that returned a Pet without Pet actually being defined anywhere

This code could not be run without a lot of work.

The winner:

The winner here is GPT-4o mini. The results were not perfect, probably a 7 out of 10, but the generated code was the most complete, and took the least amount of effort to run.

Challenge 2 – Find a bug

In the code generated by Gemini and GPT-4o mini in the previous challenge, there was a nasty bug where none of the API end points would work.

Both LLMs suggested an API class where an HTTPClient was created up front and passed to the API so that it could be shared. This is a C# best practice. The base URL needed to be set when creating the HTTPClient.

Again, a best practice as that way the API can be repointed to different URLs, such as a staging or beta channel. The bug however was an annoying quirk when providing the URLs. The base URL is set on the HTTPClient, then the fragment is set inside each API endpoint.

var httpClient = new HttpClient { BaseAddress = new Uri("https://petstore.swagger.io/v2") };

var response = await _httpClient.PostAsJsonAsync("/pet", pet);

In the generated code, the BaseAddress doesn’t have a trailing slash, and the fragment passed to PostAsJsonAsync has a leading slash. This unfortunately doesn’t work, and gives a 404 instead of a hopeful error. The correct way is to have a trailing slash on the BaseAddress, and no leading slash on the fragment passed PostAsJsonAsync.

Let’s see if the LLMs can catch this bug using the code generated by GPT-4o mini. I’ll use a prompt that has a very cut down version of the code that has the error.

Prompt: The following code returns a 404 not found. The endpoint exists with the specified verb, and I can call it with curl. What might be causing this error. Consider the way this is being called in C#

```var httpClient = new HttpClient { BaseAddress = new Uri("https://petstore.swagger.io/v2") };
var petApi = new PetApi(httpClient);
var newPet = new Pet { Name = "Doggie", PhotoUrls = new List<string> { "http://example.com/dog.jpg" } };
var pet = await petApi.AddPet(newPet);

public class PetApi
{
    private readonly HttpClient _httpClient;

    public PetApi(HttpClient httpClient)
    {
        _httpClient = httpClient;
    }

    public async Task<Pet> AddPet(Pet pet)
    {
        var response = await _httpClient.PostAsJsonAsync("/pet", pet);
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsAsync<Pet>();
    }
}

Gemini 2 Flash

Gemini failed to detect the bug - it suggested that I add a Content-Type header to application/json. This makes sense to suggest I set this, but this is set automatically under the hood by the PostAsJsonAsync method. This fails to find the trailing/leading slash issue.

GPT-4o mini

GPT-4o, the winner in the last challenge, also failed. It gave more suggestions - including the header that Gemini suggested. It also suggested checking the URL, looking for network issues, and checking authentication. All helpful hints, but not the actual cause.

Claude 3.5 Sonnet

This is our last hope for detecting this bug - and it also failed. Once again, the content type was suggested, and it also suggested installing the NewtonSoft.Json nuget package. Helpful to make the code compile, but in this case the code is already running to give the 404, so a bizarre suggestion.

The winner

In this case, the winner was Stack Overflow. This just shows that sometimes there are problems an LLM can’t solve.

Challenge 3 – what was I just reading

In the last challenge, we saw that sometimes LLMs just can’t answer the questions you have. They might well have the relevant information, but if it is only one Stack Overflow answer, the prompt you need to get the answer would be so specific that it would be hard to write successfully.

However, as developers, we often research issues or code that we need, then unfortunately forget what we researched when we need it again a few days/weeks/months later. This is where Long-Term Memory is such a powerful feature.

So imagine the scenario – you’ve hit this bug with getting a 404 on our API, and you remember that you hit the same bug a while ago and Stack Overflow helped you, but can’t remember the fix.

Let’s see how well our 3 LLMs do with the Pieces Long-Term Memory

Prompt: What was I reading about the httpclient baseaddress in stack overflow?

Gemini 2 Flash

Gemini gave me a great answer – it gave me a deep link to the question I had read, and provided a summary – include a trailing slash in the BaseAddress but no leading slash in the URL fragment. It also provided a simple, 4-line code sample to show this, lifted from the Stack Overflow post.

GPT-4o mini

GPT-4o mini was good, but not as good as Gemini. It mentioned the article by name, described it, gave correct information on the requirements for leading and trailing slashes, and gave the same code example from Stack Overflow. It didn’t provide the deep link, but a follow up question explicitly asking for the link gives it.

Claude 3.5 Sonnet

Claude was the least helpful LLM in this case. It just provided the deep link and no other information.

The winner

The winner here was Gemini 2 Flash. It provided all the information needed and the deep link with a single prompt.

Challenge 4 – knowledge of new features

The tech space is constantly moving, with new tools and frameworks being released all the time, and updates to the more popular tools coming on a regular cadence.

Whilst this is great for those of us who keep a close eye on the tech ecosystem, it’s less good for LLMs – these are trained very irregularly, with new models being trained on new data, and old models not being re-trained.

A good test for models is to ask about something new to see how much it knows. That way you know that you are getting the latest answers.

Let’s see what the LLMs know about the latest version of the .NET framework, the underlying ecosystem that C# uses. .NET 9 was released Nov 12th 2024, so two and a half months before this post.

Prompt: Summarize all the new features in .NET 9.

Gemini 2 Flash

Gemini 2 doesn’t know anything about .NET 9 - it tells me that it is still under development and not yet released. Despite this being unhelpful, at least Gemini isn’t making anything up.

GPT-4o mini

GPT-4o mini was more than happy to tell me the features in .NET 9, based on its knowledge up till October 2023 over a year before it was released, and 4 months before the first preview was announced. The feature list is questionable - new libraries, new APIs, generic information like that. This is less helpful than Gemini, as it is pure hallucination based on features announced in .NET 8.

Claude 3.5 Sonnet

Like GPT-4o, Claude is also very helpful at providing me with the wrong information. It gave similar generic features released in an earlier version of .NET, again driven by hallucinations.

The winner

Despite all the LLMs not being able to provide the right answers having not been trained on recent enough data, Gemini 2 Flash is the winner here as it categorically tells me it is lacking this information, instead of making things up. This is the right thing to do - I can then do and do research myself and later rely on the Pieces Long-Term Memory to provide relevant context to the LLM.

Which LLM wins

Out of these 4 challenges, the scores are:

Gemini 2 Flash – 2 wins

GPT-4o – 1 win

Claude 3.5 Sonnet - 0 wins

So on paper, with these 4 challenges, Gemini 2 Flash is the winner. However, it is important to note that this is not an extensive test, and different LLMs work better for different programming languages, and the way you prompt.

The best LLM for code generation might vary depending on your specific needs and the coding tasks at hand.

My recommendation to you is to try them all and find what works best for your situation, and also don’t be afraid to switch LLM to get better results.

One of the key features of Pieces is the LLM selection – you can choose from a wide range to allow you to find the one you like best, based on your preferences, or your organization's AI governance rules. You can also switch LLM as needed, so if you don’t like the responses — switch mid conversation!

When choosing an LLM for coding, consider factors like:

Context window size for handling larger codebases
Fill-in-the-middle capabilities for code completion
Performance on benchmarks l
IDE integration options (e.g., JetBrains plugins or Continue for VS Code)
API availability for custom integrations
Enterprise deployment options for commercial LLMs

Remember, the best open-source LLM for code generation might not always be the best choice for enterprise use, and vice versa. It's essential to evaluate your specific needs, including factors like test generation capabilities, developer productivity impact, and the specific programming languages you work with most often.

If you have a favourite model you use in Pieces, let me know on X, Bluesky, LinkedIn, or our Discord.