Exploring text segmentation in retrieval-augmented generation (RAG)
Uncover how effective text segmentation can enhance information retrieval and boost the performance of your generation models. Discover techniques and best practices to transform your approach to RAG.
Exploring text segmentation in retrieval-augmented generation (RAG)
Retrieval-Augmented Generation (RAG) systems leverage collections of text documents to generate responses to user’s questions.
To find the most useful text segments regarding a search, text segmentation plays a crucial role, and impacts the relevance and accuracy of not only the retrieved information but also the generated response built on it.
What are the methods for text segmentation?
When a RAG system pulls context from a large document, good segmentation ensures each segment is cohesive and relevant to the query, giving the generative model the context it needs.
On the other hand, poor segmentation—whether through too-short, fragmented segments or large, unfocused chunks—can lead to incomplete, irrelevant, or confusing responses. Choosing the right segmentation method is therefore important in making RAG applications work effectively.
Method #1 Heuristic-based segmentation
A first approach to segmentation is to make use of existing break points in the text, such as paragraph breaks, headings, or other special markers such as double newlines in code.
This method is easy to set up but works best when documents are well-organized since it relies on the existing segmentation clues in the document.
In structured documents, rule-based segmentation can produce coherent segments, but in unstructured or inconsistent text, this approach can fall short.
Method #2 Cohesion-based segmentation
To go beyond existing segmentation clues in documents, semantic patterns can be identified to break the content along.
The goal is to segment the text in such a way that the segments are separated by the content they cover, maximizing cohesion within each segment.
One way to approach this is by identifying shifts in word usage.
TextTiling, for example, looks at word counts within sliding windows and detects shifts in word frequency to find topic boundaries.
This approach doesn’t interpret the actual meaning of words; it only considers how often and in what patterns words appear together.
Extensions to this approach consider the similarity between words in addition to word counts.
Cohesion-based segmentation can be effective for documents with clear topic shifts, like articles or reports, but can struggle in technical documents or documents with repetitive content.
Method # 3 Neural network-based segmentation
Neural network-based approaches use embeddings—dense vector representations of text—to identify optimal segmentation points.
These methods are especially effective for complex or domain-specific documents because they can have a deeper understanding of a text segment and detect similarities in meaning.
A common approach is to use pre-trained language models to create embeddings for each sentence or paragraph in a document.
These embeddings ideally encapsulate the content of a segment in a form that makes it easy to calculate similarity scores between adjacent texts.
High similarity scores suggest two texts cover the same topic and should not be separated, low similarity scores can indicate topic shifts, suggesting a potential segment boundary.
To increase accuracy in the segmentation, neural networks can additionally be fine-tuned to the specific target texts, or if human-annotated examples are available, even to the segmentation boundaries.
This allows for a wide range of options for improving segmentation results, even in unstructured text.
Pre- and post-processing for improved segmentation
In addition to the core segmentation algorithm, pre-and post-processing steps can significantly impact the accuracy of the segmentation.
In pre-processing, any part of the text that is not relevant to the segmentation, such as so-called stopwords, or verb conjugation, can be disregarded to determine the segmentation boundary.
In post-processing, the segmentation boundaries can then be adjusted to split the text at reasonable points, just as at the end of a sentence.
In a structured text like source code, it can be helpful to match existing structures such as braces or indentation.
Overall, neither method of text segmentation fits to every task, and the pros and cons of each approach have to be weighed carefully.
To better evaluate the appropriate segmentation approach, it can be helpful to collect human-annotated segmentation boundaries for a subset of documents.
Different methods of segmentation can then be directly compared on their performance and nuanced design decisions, such as pre- and post-processing can be made with evidence backing them.
Another, more direct approach for evaluation is to evaluate the output of the whole RAG system, with a set of desired answers to pre-defined questions.
Differences in text segmentation can then be evaluated by their downstream effect on the generated response.
This approach is most useful if the rest of the RAG pipeline is unlikely to change since it optimizes the text segmentation to fit not only to the provided documents but also to the specific language model.
For those looking to experiment with different text segmentation methods in a custom RAG pipeline, give Pieces for Developers a try—it’s a powerful tool to keep your code organized and streamline your workflow every step of the way.