AI & LLM

Feb 24, 2025

Why companies are turning to small language models? (SLMs)

Discover what Small Language Models (SLMs) are and how they can revolutionize AI adoption. Learn how SLMs offer efficiency, privacy, and cost-effective AI solutions for businesses.

The image features a visual graphics of Small Language Models (SLMs) highlighting key benefits such as improved data privacy, lower costs, and offline accessibility. Icons of AI chips, cloud servers, and security shields illustrate the impact of SLMs on business strategy and AI governance

Since ChatGPT hit the mainstream, the usage of generative AI products has grown at an exponential rate. We are now at the point where AI is no longer in the realm of computer science but is firmly in the hands of everybody regardless of their technical background.

Whilst these tools are hugely beneficial for employee productivity, they also are proving to be quite the headache for corporate leadership as they try to balance this desire for employees to use AI with security and privacy requirements.

One solution to this problem is Small Language Models, or SLMs, which are Generative AI models optimized to run locally, on your device. In this post, I’ll cover what SLMs are, and the pros and cons of using them in the workplace.

The Plateau of Large Language Models (LLMs)

Image source: object box.io

LLM has become a well-known term when we think of AI. We refer to chatting with an LLM, writing apps against an LLM, and so on. LLM stands for Large Language Model – these are very Large AI Models that are trained using Large amounts of data to take input and give output using human-readable Language.

How large is ‘large’?

By large, I am referring to models that need dedicated racks of hardware to run. Model size is measured in parameters – the number of values that are loaded into the model. These values for LLMs are typically 4-byte numbers, and LLMs have hundreds of billions if not trillions of parameters.

For example, GPT-4o is rumored to be a 1.8 trillion parameter model, which at 4 bytes per parameter needs 7.2 terabytes. And this is just the values loaded by the model, you also have the model itself. This size needs to be stored somewhere, and loaded into memory to process your prompts.

To run these LLMs you need dedicated hardware – racks of interconnected top-of-the-line GPUs (graphics processing units) at tens of thousands of dollars each. This means you need to send your prompt, and any associated data to the cloud, either hosted by a third party or hosted in your own (very expensive) cloud infrastructure.

What are small language models (SLMs)?

The alternative to large language models are Small Language Models, or SLMs. These are the same kind of model as an LLM, but they are smaller – they have less parameters, in the single or double-digit billions, and use small numeric types to store these parameters.

For example, you can get a 3 billion parameter model using 4-bit integers, weighing in at only 1.5 gigabytes instead of the 7.2 terabytes of GPT-4o.

Suddenly this is small enough to run on a personal device like a desktop or laptop. These are also small enough for you to run on the GPU inside that personal computer, rather than relying on a rack of dedicated hardware.

💡 When you are not using the GPU for all the mathematical calculations needed to render a 3D game, you can instead use the accelerated computation capabilities to process AI models like SLMs.

Although these models run on personal computers, it is fair to say that the performance may not be close to cloud models, simply because of the hardware. Your top-of-the-range home GPU will never be as fast as a rack of server-class GPUs that cost 5x what your computer cost each.

But that’s not to say they are slow – the performance can be very acceptable with the right model and the right hardware.

Although these models are small, they are mighty! Researchers have been employing a wide range of techniques to ensure that they are very capable models, sometimes on par with the cloud-based LLMs.

Tricks like training these models just on a specific domain, like coding. It’s amazing how many parameters you don’t need when your model is trained just on code, and not on everything on the internet, from code to opera, to world history.

Examples of SLMs

There is a huge range of SLMs out there – not only foundation models (the base models created by researchers and AI companies) but also fine-tuned variants created by researchers or hobbyists from these foundation models.

HuggingFace is a great resource for free and open source models, with literally hundreds of models available that you can download and use.

Source: Lu et al., 2024

Some of the top SLMs are:

Llama by Meta – The latest version of the Llama models by Meta come in a range of sizes, from 1B parameters up to 70B. These are fast and powerful general-purpose models that can handle text and vision, and are used as the basis of a lot of fine-tuned models.
Phi by Microsoft — The Microsoft Phi models are designed to provide correct information as much as possible, and they were trained on textbooks and textbook quality data, rather than the general internet where the information can be blatantly wrong.
Qwen Coder by Alibaba – The Qwen coder model is trained to help coders, rather than being a general-purpose model. This means that it’s terrible at recipes, but very good at code, matching the coding capabilities of LLMs like GPT-4o.

To use these models, you will need some kind of framework or application to run them. One ideal solution is Pieces, where we support a huge range of SLMs that you can use to interact with both the Pieces Long-Term Memory, as well as your code projects.

What about medium language models?

The line between SLMs and LLMs can be blurred. There are a few references out there to MLMs, medium language models, but really the term isn’t popularly used.

A good distinction (and this is a personal take as there are no hard and fast rules) is that an SLM can be run on a personal computer (albeit sometimes with a powerful GPU), and an LLM needs the cloud to run.

But there are no real ‘rules’ as to what makes an SLM vs an LLM.

For example, the Llama models from Meta are generally considered SLMs, but the 70B parameter version is bordering on an LLM and needs either a cloud setup or a ridiculously powerful device like an NVIDIA Digits.

How can you use SLMs

There are many ways to use SLMs, but the easiest is using Pieces! We support a wide range of SLMs that run completely offline, powered by Ollama.

Wherever you are using the Pieces copilot, such as in developer environments like VS Code, or in our flagship desktop app, you can use one of many models, and even switch models mid-conversation to help find the one that is right for you.

Managing AI governance with SLMs

As a leader in an organization, one of the biggest reasons for caring about SLMs is to support your AI governance policies.

AI governance is the processes, practices, and guardrails you put in place to ensure that AI tools are used responsibly, ethically, and inline with both external regulations, as well as internal policies on data privacy and corporate IP. It’s an extension of existing policies that take into consideration the different landscape of AI.

A visual representation of key features in AI governance, showcasing compliance, data privacy, risk management, transparency, and ethical AI usage in enterprise environments

As a very basic example, think of extending a basic privacy rule of “don’t send customer data outside of the organization” to add “and this includes ChatGPT”.

As these AI tools become part of our day-to-day toolkit, it is very easy for knowledge workers to simply not think about the consequences of sending corporate data to ChatGPT, after all, it’s just like putting it in a spreadsheet right? Well, no.

Why should you care about AI governance?

Back in the early days of ChatGPT, employees at Samsung leaked private source code, and the contents of a confidential meeting to ChatGPT.

At the time, OpenAI terms and conditions said it would train their models on anything submitted to ChatGPT – essentially they would train the model on this private information from Samsung that could then be extracted from later model iterations with the right prompt.

Online tools are starting to offer privacy controls, such as the OpenAI data controls, but these rely on the users configuring these correctly. One incorrect configuration and OpenAI has your private data.

And even if you have these controls enabled correctly, there may be regulatory requirements banning you from sending confidential information outside of your organization.

Imagine the fines and reputational damage of the press finding out you leaked patients' private medical data to Meta, or the individual education plans of kids with special education needs to Google Gemini!

There are also trust issues as LLMs become available from different countries.

We live in a time where the probabilities of cyber espionage and other crimes are high, sometimes led by nation-states. As different countries release LLMs, in some cases in close collaboration with the local governments, can these models be trusted?

We are already seeing models that limit information based on government propaganda, and even bills proposed by overly paranoid government members to fine or imprison anyone who uses models from certain countries.

Have we got to the point that a nation-state could hack another through LLM responses, such as recommending software with backdoors?

It is not outside the realms of possibility, and there is research into this area, such as this paper introducing CodeBreaker, an LLM-assisted backdoor attack on code completion models.

Certainly, it is likely that information sent to cloud LLMs is being harvested – whether it is from nation states essentially spying, or advertising tech companies harvesting your data, the same as they have been doing with your online activities.

How can SLMs help?

The biggest benefit of SLMs is that you can run them locally, without any data being sent off of your machine outside of your corporate network.

These models are running disconnected from the internet. Models are not code, they are neural network architectures and parameters. They don’t execute anything, so do not have the capability to run arbitrary code or connect to external systems.

The code you use to run them is typically an open-source framework like Llama.cpp or Ollama, that you can either audit yourself to ensure it is doing nothing nefarious or rely on the wisdom of the crowd, where these frameworks wouldn’t be used so much if there was anything untoward.

You can even write your own code to run these models using open-source libraries that are vetted and used by large technology organizations.

This means that an employee using an SLM would not be breaking any policies or regulations regarding sensitive, private, or customer data that they are allowed to access and process locally – this is no different from a compliance perspective than using Excel or running a local Python script.

Shadow IT

Shadow IT is the term given to IT tools and services used by employees at an organization that are not approved or managed by the internal IT teams. These are tools that employees generally use to be more productive – and can include using software in ways that the company does not allow (such as complex financial modeling on shared spreadsheets that are not tracked and controlled), or employees purchasing software themselves, such as a ChatGPT subscription.

Shadow IT goes against IT policies, and particularly the use of AI services will go against AI governance.

So why do employees use shadow IT?

It’s not to be malicious, it’s because employees have a job to do and they have found a better tool to do this, often after fighting with the process to get these tools through the correct channels.

What well-paid employee wouldn’t spend $20 a month of their own money to make their job easier if the company won’t provide the tools?

There are 2 solutions to shadow IT – the carrot and the stick. The stick method is to block and punish, blocking access to unapproved systems and bringing disciplinary action against those who use them.

This is not ideal – you are losing great workers, and decreasing morale.

It is better to consider the carrot method – understand why employees are using shadow IT in the first place, and provide them with the right tools.

A timeline graphic illustrating key milestones in educating about Shadow IT, highlighting risks, best practices, and company policies to manage unauthorized software use in the workplace

If the reason is the tool got declined because of cost, then chances are the software they are using is cheaper than the costs of firing them and finding a replacement. If the reason is AI governance, then the best thing to do is to provide them with the tools they want to use in a compliant way – for example giving them access to SLMs to run locally.

How can SLMs not help with AI governance?

One important area of AI governance is understanding that the AI can be wrong and can provide incorrect information, or hallucinate and literally make things up, such as what happened to a lawyer who asked ChatGPT for prior cases, and it literally made them up. It is up to you as the user to ensure that you validate the results.

Due to the smaller amount of information encoded in an SLM, there is a higher chance it will be less useful for the task at hand compared to a cloud model. Your employees need to understand this and be even more critical of the responses.

Help employees who have limited internet access

SLMs running on device are great for AI governance, and they are also great for employees who have limited internet access.

This could be for a number of reasons – they travel a lot and airplane WiFi can be questionable, or they are in remote areas with limited or expensive access to the internet, such as oil rigs, remote mining areas, or disaster zones.

If you have employees who need to work from such constrained environments, SLMs are a great solution. Whilst they have good connectivity they can download the SLMs they need, then run these when they are offline.

Hardware requirements to run SLMs on device

As mentioned earlier in this article, SLMs can benefit from specific hardware in your computer to run faster. This can be using a GPU, or even an NPU, a neural processing unit, which is a dedicated chip for AI workloads.

GPUs and NPUs are optimized to run math very quickly, which is what is needed for AI.

VRAM

GPUs and NPUs are measured not only on their speed but also on VRAM – video RAM, the amount of additional memory provided on the device. This memory is needed to hold the SLMs, so the larger the model, the more VRAM you need.

Most frameworks for running models are able to shift the models in and out of system RAM using the CPU if the GPU doesn’t have enough, but this can be a big performance hit.

As a general rule of thumb, for models with 4-bit quantization, you need as many GB of VRAM as billions of parameters in the model - so if you have a 3B parameter model you need at least 3GB of VRAM. This is just a rule of thumb and depends on a range of factors such as the quantization and type of model.

Hardware choices

There are a range of hardware options for devices with GPUs and NPUs, with unfortunately conflicting standards, so hardware choice is dependent on what models you want to run, and how you want to run them.

The main GPU/NPU families are:

NVIDIA

NVIDIA is the top GPU manufacturer, they even claim to have invented the GPU as they released the first off-the-shelf GPU you could add to a PC in 1999, though the term actually originated with Sony in 1994. NVIDIA GPUs power the majority of the AI in the cloud, making them a $3T company. NVIDIA has GPUs for desktop and laptop devices, as well as dedicated hardware such as the NVIDIA Digits.

Generally, NVIDIA hardware has the most support and compatibility as they have focused very heavily on the developer ecosystem with CUDA, a platform for leveraging the capabilities of the GPU that has been built into a lot of AI tools. Frameworks like Ollama are able to use CUDA to run very large models very quickly.

AMD

AMD has been playing catch up with NVIDIA for a while but has the advantage that they also create general-purpose CPUs as well (competing with Intel as the core chip in an x86/x64 PC).

GPUs from AMD do not have the range of support that NVIDIA GPUs do, though frameworks like Ollama do support some of their hardware. Their ROCm platform is the equivalent of CUDA but has nowhere near the same adoption.

Apple Metal

Apple provides a GPU API called Metal that works on older Apple devices with NVIDIA and AMD GPUs, as well as the newer Apple Silicon with integrated GPU and NPU cores.

Metal doesn’t have the range of support of NVIDIA devices, but any Apple device you buy has the relevant hardware built in. Frameworks like Ollama support Metal.

Intel

Intel has GPUs and dedicated NPUs in their Core Ultra devices. These are less well supported than NVIDIA or AMD, and use a framework called OpenVINO to run. OpenVINO started life as vision models and has relatively poor software support. A lot of frameworks including Ollama don’t support Intel GPUs and NPUs.

Qualcomm Snapdragon X

The Qualcomm Snapdragon X processors are the first Qualcomm ARM64 processors to have a powerful NPU built in. These are currently not well supported, and Ollama only supports Llama 3.2 on this hardware, no other models.

Copilot+ PCs

Microsoft is trying to add an abstraction over different NPUs with their Copilot+ PC model. The idea here is any Copilot+ PC will have an NPU of a certain speed, and in theory work with all Copilot+ software that runs locally.

As of now this only supports Qualcomm Snapdragon X processors, one series of AMD devices, and one series of Intel processors. Although there is a big push towards selling Copilot+ PCs, there is currently little support outside of Microsoft for this model, and indeed even inside Microsoft, with their flagship Phi-Silica SLM for Copilot+ PCs only showing support for Snapdragon X processors.

Which one should you buy?

It depends on what framework you are after.

Pieces uses Ollama, so any device with an NVIDIA or AMD GPU, or any Apple device will run accelerated.

In general though, for Windows or Linux NVIDIA is the way to go due to the large amount of support for these GPUs, and the fact that they are generally the most performant. But this is a growing space, and Microsoft is pushing hard on Copilot+ PCs, so there will be lots of interesting developments in the near future.

Cost considerations

Getting approval for hardware purchasing for devices with GPUs can be difficult, as they generally tend to be more expensive than devices without a GPU.

For example, Lenovo Thinkpads have been a workhorse for business laptops for years. They start at less than $1,000 but if you want an NVIDIA GPU then you need to be paying at least $1,500 more – and this is for bottom-of-the-range models.

Imagine having to sign off on an extra $1,000 per machine for a workforce of thousands. Combine this with perception as well – having a non-technical leader asking why developers need ‘gaming laptops’.

You need to ensure you have good justifications for the increased spend.

One justification is the operational costs of not paying more. Knowledge workers can be more productive with AI, and if your governance rules mean running SLMs on device, then the extra hardware cost easily pays for itself.

I see the average developer in the US as costing a company around $1,000 a day in wages, benefits, and other costs.

If you can save them 1 day's productivity, then the device has already paid for itself.

Should your team use SLMs?

AI governance is becoming increasingly important as companies embrace the upsides and the down of generative AI in the workplace. One way to ensure your employees are not using public AI services with private or secure data, and are compliant with your rules is to use SLMs running locally on device.

Try Pieces for free now as a way to run a wide range of offline models locally, keeping your data private and secure. And share your thoughts on AI governance and how SLMs can help with me on X, Bluesky, LinkedIn, or our Discord.

Jim Bennett, head of dev advocate at Pieces

Written by

Jim Bennet

Why companies are turning to small language models? (SLMs)

…

Get started - it's free

Recent

Judson Bonneville on writing documentation at Pieces

Jul 22, 2025

How I write documentation at Pieces

Learn about a real-world use case for using AI tools to write production documentation from soup to nuts: voice-to-text, thought-process checks, and assisted structuring all the way to a finished piece of effective, thoughtful technical writing

Jul 21, 2025

The rise of on-device AI and the return of data ownership

Discover how on-device AI is reshaping the tech landscape by prioritizing privacy, speed, and user control, marking a powerful shift toward true data ownership and away from cloud dependency.

Jul 11, 2025

A different perspective on prompt evaluation

Learn what prompt evaluation is, why it matters in AI development, and how to systematically assess prompt quality to improve performance, accuracy, and reliability across use cases

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.