AI & LLM

May 6, 2025

Building Long-Term memories using hierarchical summarization

Learn how hierarchical summarization builds efficient Long-Term memories by capturing, summarizing, and organizing data for faster context retrieval.

A recent post by the Anthropic Safeguards research team discussed how they use hierarchical summarization to monitor computer use.

Interestingly, this is similar to a process that we use at Pieces to help build context from captured memories so that we can make them available in the Pieces Copilot when you ask questions of the Pieces Long-Term Memory.

In this post, I’ll look at what hierarchical summarization is, and how it is used internally in Pieces.

What is hierarchical summarization?

Hierarchical summarization in the AI space is the process of using large language models to create summaries of context, then repeating the process to create summaries of summaries.

To quote Anthropic:

…we adapted prior work on recursive and decision-focused summarization to AI safety monitoring: we first summarize individual interactions, then summarize the summaries. We call this approach hierarchical summarization

Anthropic does this to evaluate if their computer use capabilities are being used for nefarious purposes.

For example, checking to see if you are using this capability to build a click farm.

It is impossible to detect malicious or bad behavior from a single interaction, such as clicking a button on a website.

However, by looking at patterns in summarized data, such as thousands of requests to click the same button across multiple machines, you can spot bad actors.

This kind of summarization can be applied to any kind of data.

For example, often employees will send status reports to their manager that are summaries of their week's work.

Their manager then summarizes these and sends the summary to their manager, who summarizes all the summaries from their reports and sends them to their manager, and so on.

At each stage, the recipient receives the correct amount of information.

Summarization as a concept isn’t new – for example, back in 2014, Christensen et al. published a paper called “Hierarchical Summarization: Scaling Up Multi-Document Summarization” which described a process for manually creating such hierarchies for documentation.

The rise of AI for natural language processing (NLP), makes this summarization much easier to automate.

Hierarchical summarization in Pieces

The Long-Term Memory continually captures context from the active windows on your screen.

This is a huge amount of information, and it’s unlikely you will need every single word captured and made available verbatim, and if we did, it would fill your hard drive pretty quickly, meaning we can’t offer 9 months of Long-Term Memory.

To reduce this, Pieces creates detailed summaries of this information, capturing what is important and storing that.

Even with detailed summaries, this is still a lot of information, which could be slow to search, so Pieces then creates summaries of these summaries, building up a hierarchy of summaries based on information types.

This allows Pieces to quickly find top-level summaries based on your prompts, then use that to get the summaries that make up that top-level summary to find relevant context to send to the LLM.

This use of summarization also allows different LLMs with different context sizes to be used.

For example, if you are using a local LLM with a smaller context window, then the higher-level summaries can be sent; if you are using a cloud LLM with a larger context window, then lower-level summaries can be sent.

You can see an example of the results of this in the Workstream Activities view in the Pieces desktop app.

These activities are summaries created every 20 minutes, containing sections that are in themselves summaries grouped by different themes, such as based on a project, tasks, discussions, and so on.

Hallucination propagation

One of the downsides to summarization is the propagation of hallucinations.

Hallucinations are where LLMs create incorrect information, making things up rather than admitting they don’t know something.

If you have any hallucinations in the original source content or the lower-level summaries, then as you create another summary, this too will have the hallucination, and so on up the hierarchy.

If you ground the LLM with hallucinated data, you will just get more hallucinated data.

With Pieces, we want to ensure all the summaries are well grounded in actual information. And interestingly enough, we can do this by putting less effort into pre-processing.

Reducing hallucinations

When Pieces captures information from your running applications, it uses optical character recognition, or OCR, to extract text from temporary screenshots.

OCR is a traditional AI problem where text is extracted from an image.

The upside is that this works for every application – from text on a webpage, to code in a YouTube video, or a presentation shared in a meeting in Microsoft Teams.

The downside is that it can be hard to correctly get the text. For example, 0 and O, one is a zero, the other is a capital letter O. It’s hard for OCR to always get the correct results.

Other hard problems for OCR can be multi-column text.

For example, if you have an article with 2 columns, such as a lot of scientific papers, you will have lines of text that consist of words from one or more sentences on the left, and one or more sentences on the right. If the OCR doesn’t understand the multi-column layout, then it might assume each row is just one sentence.

Other OCR issues are things like application layout – a chat tool has a list of names down the side, and a conversation open. The list of names has nothing to do with the conversation, so just capturing raw text could lead to incorrectly attaching names to context.

Originally, we did a lot of pre-processing to tidy up the text before we started summarizing it, really trying hard to clean up the data.

For example, correcting column errors, adding headings, and focusing on readability.

Ironically, this actually made the summaries worse!

The more pre-processing we did, the more hallucinations were created, and the worse the final summaries.

Our ML experts put a huge amount of effort into tuning this pipeline, doing just enough cleaning to deal with major issues, but not too much that led to hallucinations.

This also had the added benefit of reducing the amount of AI power needed to get a summary.

This is one of those fun traits of AI.

Sometimes, the more specific you are and the more information you give, the worse the outcome. Our AI team has run huge amounts of analysis to hit the sweet spot.

Future directions

At pieces, we are constantly working on improving our Long-Term Memory. We recently released our second-generation Long-Term Memory engine LTM-2.5, and we’re hard at work on the next iterations.

As we build out the capabilities of the Long-Term memory engine, it is important that we can ensure you find the answers you want quickly using natural language prompts, with accurate results bringing together information captured across multiple applications over multiple time periods.

By creating these hierarchical summaries, we can bring together relevant information for fast retrieval.

If you haven’t already installed it, download Pieces for free now, and try out the Long-Term Memory powered by our hierarchical summaries.

Written by

Jim Bennett

Building Long-Term memories using hierarchical summarization

...

Get started

Recent

Judson Bonneville on writing documentation at Pieces

Jul 22, 2025

How I write documentation at Pieces

Learn about a real-world use case for using AI tools to write production documentation from soup to nuts: voice-to-text, thought-process checks, and assisted structuring all the way to a finished piece of effective, thoughtful technical writing

Jul 21, 2025

The rise of on-device AI and the return of data ownership

Discover how on-device AI is reshaping the tech landscape by prioritizing privacy, speed, and user control, marking a powerful shift toward true data ownership and away from cloud dependency.

Jul 11, 2025

A different perspective on prompt evaluation

Learn what prompt evaluation is, why it matters in AI development, and how to systematically assess prompt quality to improve performance, accuracy, and reliability across use cases

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.