Local large language models (LLMs) and their growing traction
Explore why local LLMs are gaining traction for secure, private, and cost-effective AI development workflows.
While AI is fascinating and is advancing at a steady pace week over week, a major percentage of people (consumers) think AI is risky and do not trust it.
The numbers are as high as 52% of U.S. adults who are concerned about AI becoming embedded into their daily lives.
The concern is natural, as AI involves being trained on data and increases the likelihood of personal data and sensitive information being leaked. In fact, many big companies do not allow their developers to use AI to generate code (one of the most common use cases) or have a long list of guidelines to follow before using AI tools, and one of the main reasons behind this is security.
Most AI tools come with a caution that they might use your data for future training, and for those who deal with sensitive data, one way to solve this is by switching to SLMs or by running large language models locally, as security, enhanced privacy and control are foundations of local large language models.
But, what is a local LLM?
Local large language models are AI language models that can be run directly on a user's personal device, such as a laptop, desktop computer, or smartphone, rather than relying on cloud-based servers. They stand out because of features like privacy, offline availability, and lower operating costs, and are often used by developers for complex structured tasks.
In this article, I cover fundamentals on local large language models, some common examples, tools that support the use of local models, and how you can use local large language models for complex structured tasks.
Why Local LLMs?
If you are an engineering leader or a decision maker, some of the key considerations that you should keep in mind while choosing a new technology are: business needs, scalability, long-term impact, and cost-effectiveness. Cloud-hosted AI is great, but it also comes at a cost (dollar-wise, security concern-wise, and even compliance-wise). This is where local LLMs shine.
While security is the most popular reason why developers choose to host LLMs locally, it is also helpful in reducing costs on running inference compared to cloud-hosted APIs and infrastructure. Since you are not dependent on any third-party cloud provider, vendor lock-in also does not exist. What really stands out for me is that models are always accessible, independent of network connectivity, and also compliant with HIPAA and GDPR, since data is secure and stays with you.
If you are looking for free local large language models to start with, you can use tools like Ollama and LMStudio that allow you to download and run open-source models like Llama 2, Llama 3, Mistral and Phi-3-mini directly on your computer.
Local vs Cloud LLMs
Both Cloud and Local LLMs come with their own set of pros and cons. The choice should depend on your use case, organization metrics, and criteria that matter to you: scalability, security, and cost.
Recently, I came across this really cool open-source project called “local deep researcher” by Langchain, that makes use of local LLMs to build a fully local web research and report-writing assistant.
And then there’s this tweet of a dev who is using local LLMs to code on a flight (so useful for those long flights when you have a lot of work to finish)

I have seen devs and businesses use local LLMs for a variety of tasks, both complicated and simple ones.
Personally, a hybrid approach is more favorable for me.
📑When I need to work with sensitive data (like revenue reports or medical records), I choose a local LLM, but when I want something to be done quickly, I choose cloud-based LLMs (like simply summarising a blog post or helping me research).
While choosing tools to suit my use case, I look for tools that allow me to switch between models (cloud and local). Pieces is one such example, and if you want to know more about other AI tools, you can read this blog.
But how do you, as a decision maker, make a choice?
To help you with that, here’s a detailed table highlighting the key differences between the two:
Feature | Local LLMs | Cloud-Based LLMs |
---|---|---|
Data Privacy | Full control over data. | Data is sent to external servers, which can cause compliance risks (GDPR, SOC2, HIPAA). |
Security | Is more secure, since it does not depend on third-party services. | Security depends on the cloud provider, and it might lead to vulnerabilities. |
Cost | Cost required for initial hardware setup to host the models. | Usually follows the pay-as-you-go model, and it can increase with increase in use. |
Scalability | Needs hardware support to scale. | Easy to scale, as compute is available when needed. |
Latency & Performance | Low-latency as processing happens locally. | Network latency due to API calls and rate limits. |
Customization & Fine-Tuning | Full control over model fine-tuning and domain-specific optimizations. | Limited customization; fine-tuning options depend on the provider. |
Offline Availability | Works without an internet connection. | Requires internet connectivity for inference and API access. |
How can I run a large language model locally?
From the explanations above, running large language models locally might seem like a tough job, but with the right tools, it isn’t.
I will list below some methods that I have personally tried:
Using tools that support running LLMs locally in-built
For my day-to-day tasks, I use Pieces, mainly because it offers all the features I need.
For example, I can use its copilot for research purposes, turn on the long-term memory feature to act as a memory for me, help me quickly access resources and solutions, and also use that as context to generate more accurate responses.
All of this comes with support for local models built-in.
It already has Ollama activated and ready to use, you can then simply download models of your choice to use them locally.

Using LLM – a command line utility tool
For Python developers who might have thought, "Which Python library provides tools for working with large language models locally?" your search ends here! LLM is a CLI utility and Python library for interacting with Large Language Models, both via remote APIs and locally.
If you are on a Mac and have Homebrew installed, all you need to do is brew install llm, and Windows users can use the default command to install Python libraries: pip install llm.
LLM defaults to using OpenAI models, but you can use plugins to run other models locally.
There are also plugins for GPT4all, Llama, the MLC project, and MPT-30B, as well as additional remote models. You can use the command llm models list to see all available models—remote and the ones you’ve installed, including brief info about each one.

Using tools like Ollama
It is a free, open-source tool that lets you run LLMs locally.
You can simply download it and then use the command ollama run modelname to run models of your choice. You can find the list of all models in their model library.
If you would like to learn more, here’s a comprehensive guide to local large language models with Ollama.

Things to keep in mind when running LLMs locally
Running LLMs locally comes with many benefits, but you should also consider all the edge cases and some best practices to get the most out of it.
Below are some hot tips on running local large language models sourced from personal experience, ML researchers, and Reddit forums:
Know which model to choose
Choosing the right model usually depends on the use case, cost, and metrics like llm paramaters, optimal model parameters.
Here’s an article comparing some of the most commonly used LLMs with challenges such as SDK generation, finding a bug, etc, and can help you decide the best LLM that you can run both locally and on cloud.
Cost
If you want to run models for free, you should choose some open-source models, and you can do that by looking at LLM benchmarks. Several benchmarks compare LLMs.
I recommend the HuggingFace LLM Leaderboard; this will help you decide the best local large language model for your use case.
Optimal model parameters
Usually, models are available with different amounts of parameters (1b, 7b, etc.). This influences performance and hardware requirements.
For e,g, 1B models are lightweight and suitable for general-purpose tasks like text generation, basic reasoning, and can run efficiently on systems with 16GB RAM.
Params
Choose models depending on param size, as it can impact accuracy. Refer to the Hugging Face leaderboard, it compares 100+ text embedding models across 1000+ languages.
Hardware considerations
Since your models are running locally, you need to have hardware that is optimized to run the models of your choice.
Models with smaller optimal model parameters will require minimal hardware setup, and it increases with the increase in parameter size (32B models will have the highest accuracy but will also be resource-intensive).
CPU considerations
It is good to have a multi-core processor CPU with high clock speeds to handle data preprocessing, I/O operations, and parallel computations.
GPU requirements
GPUs are the most important component for running LLMs. For running models like GPT or BERT locally, you need GPUs with high VRAM capacity and a large number of CUDA cores.
RAM requirements
Along with GPU, RAM is also an important component. Usually 64 GB DDR4/DDR5 is good enough for running large models.
Here’s an informative article, on hard requirements that can teach you more on how to run ML models locally in production.
Software optimization
When the hardware and model parts are sorted, you will want to focus on performance and reducing resource utilization.
This is where you can optimize by using quantized models (4-bit, 8-bit), as they use less VRAM and still have great accuracy.
To speed up inference, you can use tools like LoRA and TensorRT.
To further improve performance, you can keep drivers such as CUDA updated.
How to use local LLMs
We have talked about what local LLMs are, how to run them, and how they work. Now it is time to see how we can integrate them into our workflows to be more productive.
As a developer, some of the key use cases of local large language models for me are:
Coding
I know it might sound like a clichéd example, but when you write enterprise code or have to work with sensitive data, cloud-based LLMs are not the right decision.
So, running your own models or tools that support running local LLMs is a good idea.
Here’s an example of me using Pieces within VSCode and running the Gemma 2 model locally:

Knowledge transfer
Whenever you leave or join a new company, you are either expected to transfer knowledge to a new person or be the one who receives new knowledge.
Usually, it is a lot of information to process in a few days and is mostly scattered across documents, tools like Jira, and codebases.
This data is also confidential, so you cannot use a cloud-based LLM. This is where you can use self-hosted LLMs to index codebases and documentation.
Automating context-aware suggestions
While coding, we often look at various resources – documentation, Stack Overflow, codebases, and internal notes. It can be hard to remember where we found a solution, and sometimes it involves private data.
This is where features like long-term memory or context awareness can help as it remembers all the resources you access and uses them as context.
It also allows you to use models locally, so the AI responses are not only accurate for your needs but also secure, as you are hosting the models yourself.
Here's an example of how I used Pieces Copilot to help build a GitHub Action.
It referred to the resources I read using its long-term memory feature and used the Gemma 2 model to provide the solution:

Go local & reduce costs
The article covered all the details you need to know to run LLMs locally.
From one developer to another, while both local LLMs and cloud-hosted LLMs have their advantages, a hybrid setup works best for our workflows.
I personally like having the ability to choose between local models and cloud-hosted models.
For example, if I am working with my company’s secret data, I prefer using a local model.
However, if I need to understand a few blocks of public code, I prefer cloud-hosted models because they are simpler and faster.
But, if you are a decision maker, and have to choose between the two, especially in enterprises, I would advise you to choose local LLMs, based on these factors:
Cost saving over time – While there might be an initial setup cost to run models locally (though using it with tools like Pieces does not require any fancy hardware; I am using it on a 16GB MacBook Pro M1), it reduces the recurring cost of cloud subscriptions, which can also increase significantly with rising demand.
Regulatory compliance – You must be aware that Meta decided in 2024 not to release updated AI models in the EU because of strict regulations. Using local LLMs mitigates this.
Data control & privacy – Since you are not dependent on third-party cloud providers, you do not have to worry about sensitive data being leaked.
Change in external policies – Cloud-based AI providers frequently update pricing models, governance rules, and data policies, so creating dependency on them can create issues in the future.
As a decision-maker, you will also need to consider the total cost of ownership. Here are some resources to help you with that:
Paper by Dell on how on-premises solutions can lead to 38% to 88% reduction in costs.
Costs and benefits of your own LLM.
A Comparative Analysis of Total Cost of Ownership for Domain-Adapted Large Language Models versus State-of-the-art Counterparts in Chip Design Coding Assistance by Cornell University.
