Back

Nov 22, 2024

Nov 22, 2024

Managing context for repository-aware code generation in no time

Dive into practical insights managing context in repository-aware code generation. Learn how RAG architecture works and how to present snippets in LLM

In recent months and years, Large Language Models (LLMs) such as OpenAI’s 4o and Anthropic’s Claude have continually raised the bar for code generation capabilities, enabling developers to streamline coding tasks, refactor complex codebases, and even generate entire software components. From an end-user perspective, this presents a challenge, how to show the model enough of your own code so that the generated code fits functionally and stylistically in your own repository. 

In code generation workflows, Time to First Token (TTFT) is a crucial performance metric. Longer latency impacts developer productivity, particularly when generating code interactively, as in integrated development environments (IDEs). Various factors influence TTFT, such as model size, retrieval complexity, and network latency, especially when using online repositories. When designing a code-generation tool we must balance TTFT performance with generation quality.

One way to approach this is with fine-tuning, ensuring that our TTFT will be as fast as possible due to no additional prompt processing, and our fine-tuned LLM will have memorized the style and content of all of our custom, proprietary code.

While this might handle the code styling problem, your fine-tuned model will only be up to date on the data provided during fine-tuning. 

Once you pull the main or check out your colleagues' development branch your fine-tuned LLM might be stale.

Retrieval-Augmented Generation (RAG) can navigate this particular issue by maintaining up-to-date local indexes, and presenting the LLM with the code context as it appears on your machine.

RAG has in the past been limited by small context windows and poor needle-in-a-haystack (NIH) performance, but with leading LLM providers now offering 100k+ token context sizes and improved NIH benchmarking we can now effectively leverage RAG for repository scale code code generation problems.


What is RAG?

LLMs only know their training data. If you want to ask an LLM about something private to you (your software projects, emails, etc.), you need to either fine-tune the LLM on your data (thereby adding it to the training data), or use RAG. 

If you want to ask a question about a file on your computer, you could just paste the file into the chat. RAG essentially automates this process, by searching your machine for relevant files and ‘pasting’ them into the conversation behind the scenes.


How does RAG architecture work?

The traditional RAG strategy for natural language text involves chunking into snippets, embedding, and search. 

At the simplest level we could split all text by paragraph, embed with a small open source BERT model from Hugging Face, and search this vector database by embedding our user input, comparing vector cosine similarity and appending the most relevant paragraphs to our system prompt. RAG for code generation presents unique problems. 

How should we effectively chunk code? Bracket matching? Line breaks? 

Some files might be thousands of lines long, some code might contain no natural paragraph-like breaks, and code dependencies might exist across different files. 

How can we ensure each function is chunked with its helper methods and utilities, a script is chunked with its config file, and a test is chunked along with its dependent methods?

 Should we keep our snippets mutually exclusive or allow for overlap and potentially fill our downstream prompts with duplicated information? Maybe we could use downloaded SDKs or syntax parsing tools like tree-sitter to track dependencies, though this might incur significant technical debt, varied support across different programming languages, and failures when working with incomplete, work-in-progress code.

Furthermore, we need our chunking method to be fast. To ensure our LLM has up-to-date knowledge of any code that has changed between conversations, we will have to periodically re-index (chunk and embed). Changes could be small, or significant (i.e. checking out a branch for PR review). Any significant delays in indexing will impact our user experience and TTFT. 

Once we have effectively chunked our code, we must choose a suitable embedding and search strategy we could choose any open-source code embedding model from Hugging Face or Facebook’s Neural Code Search.

Pro tip 💡: While a small transformer model might yield better performance, we will need to take into account our users’ own hardware limitations and weigh the computational cost of indexing against the weaker search performance of simpler models. 

Given the decisions we have made for chunking, we might also need to refine our search strategy. If we are using our user’s input message as our search query, how can we ensure the right code is retrieved? 

The user’s message might have misnamed a function or class, it might be overly general and retrieve too much information, or overly specific and neglect retrieving critical components. 

> We could opt for query expansion, or utilize retrieved snippets and repurpose them as search queries for a graph-like traversal of our embeddings. 

> We could also design an agentic system, utilizing an LLM to reformulate our query and filter our context, but with repeated sequential LLM calls we run the risk of significantly increasing our TTFT (not to mention the cost of our API usage). 

Finally, with our context retrieved we are ready to build our final prompt.

How to present our snippets to the LLM

We can generate a file tree and add file paths to each snippet to augment our context. If our snippets overlap we may need to deduplicate.

How should we ensure that contiguous blocks of code across snippets are presented together? How much message history should we include from the current and historic LLM chats to ground the LLM to the user’s preferences and ensure we are generating code consistently tailored to the user’s needs and prompting style?

At Pieces, we have built a tailored stack for RAG-enabled code generation, agnostic to programming language and hardware limitations, and constantly updated to respond to the quirks and capabilities of the newest, state-of-the-art LLMs

Try adding your code as context to the Pieces co-pilot chat to try it out. 

We perform as much processing as possible on-device to maximize code generative performance while minimizing TTFT overhead and API costs to deliver an optimal user experience.

Written by

Written by

Kieran McGimsey

Kieran McGimsey

SHARE

SHARE

Title

Title

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.