AI & LLM

Jul 11, 2025

A different perspective on prompt evaluation

Learn what prompt evaluation is, why it matters in AI development, and how to systematically assess prompt quality to improve performance, accuracy, and reliability across use cases

It’s been nothing but a crazy roller coaster of the AI boom, hundreds of tweets, LinkedIn posts daily about AI, and how it helps build and ship faster.

Probably one of the craziest turns took the writing, but frankly, ai-assistance started shaping all areas.

Some folks are desperate to not follow the boom and falling really behind it, while others take advantage of the ai boom and love it. Sometimes both (like me), but at the core of all this comes a prompt, and how to evaluate to use it so that it works for you.

And who can tell that better than the people working at an AI company?

Anyway, I had a chat with Antreas Antoniou, our Senior ML Research Scientist, and we tried to walk through some steps.

Why does one prompt work better than another?

That’s the question I keep coming back to, not “What’s the best prompt?” but rather, why does one prompt perform better than another?

The answer isn’t just about vibes or clever phrasing. It’s about structure.

To understand prompt effectiveness, you need more than intuition. You need visibility into performance: controlled inputs, repeatable testing, and tracked outputs. You need systems that help you assess not just how a prompt sounds, but how well it handles edge cases, varying contexts, and model updates.

That’s where test prompts come in.

Think of them like unit tests, but for language behavior. You’re feeding standardized inputs into models and observing how they perform on tasks like summarization, reasoning, rewriting, translation, and evaluating how they behave.

Now, let’s be real, our engineers are probably better equipped to build side-by-side model comparisons or publish deeply optimized LLM prompts. But what I want to do here is something different:

I want to share what I’ve learned about prompt evaluation in a way that’s dead simple, accessible to anyone, technical or not.

And it matters. Because as we’ve all seen (sometimes very publicly), it’s not always the model that fails; sometimes, it’s the prompt.

CNET published AI-generated finance articles riddled with factual errors. Apple paused its AI-written news summaries after pushing out misleading alerts. These weren’t just model failures. They were failures in prompt design and evaluation.

That’s why prompt evaluation has become a core discipline in working with LLMs.

At the heart of it, it’s about clarity: how well you explain what you want, how much context you provide, and how clearly you define what “good” looks like.

In many ways, it’s about breaking down complex goals into small, testable pieces and that’s exactly what I’m going to show you.

(Oh, and if you’re still unsure about the difference between LLMs and GenAI, or don’t know the distinction between AI assistants and agents.

What makes a “good” prompt?

There’s no perfect formula.

But good prompts, the kind that actually hold up under scrutiny, tend to show strong performance across a few essential areas:

Relevance: Does the AI stay on-topic and actually address the task?
Factual Accuracy: Are the claims verifiable and grounded in truth?
Clarity: Is the output readable, coherent, and easy to follow?
Consistency: Do similar prompts yield similar quality?
Bias & Fairness: Are the results free from harmful assumptions or stereotypes?

At Pieces, we’ve tested far more than just a few dozen. Some were used to automate growth workflows. Others supported internal content production.

And many were built to test the boundaries of AI-assisted technical writing, because when you’re a startup, you need to grow fast. Like… yesterday.

What surprised me most was watching how engineers actually approached prompt design.

Let’s just say, they don’t call themselves prompt engineers. They just do the thing, and it works, as well as sometimes it doesn’t. And that’s okay. They redefine and search for a new solution.

Lessons from Antreas: context is the prompt

Antreas thought deeply about what makes prompts effective. He calls it context engineering, and it hit me hard because it put words to something I’d been doing intuitively.

“Every detail in your prompt context, including persona, structure, and even tone deeply influences the model’s output.”

Antreas shared his own strategy: he feeds the model his personal writing style so the output better mirrors his voice.

He sometimes creates multiple personas to simulate a committee of reviewers (and that was something completely new to me), just to see how each would interpret and critique the same content.

That alone changed how I think about feedback loops.

He walked me through techniques like:

Few shot prompting: prompting a few examples to help it learn the pattern and generate similar outputs.
Meta prompting: prompting the AI to improve its own prompt.

And it doesn’t stop at theory. He demoed how he scores and refines AI-generated headlines by explaining why he’d rank one better than another.

One part that stuck with me was from a Sakana AI paper he mentioned, which was about an “AI scientist” system that could fully automate scientific discovery.

It ideated, coded, ran experiments, and wrote papers and one of those papers was accepted at a top conference without reviewers knowing it was written by a machine.

(Diagram from the AI Scientist project illustrating how LLMs can autonomously generate ideas, run experiments, and write research papers in a full-loop scientific workflow.)

This isn’t just writing automation. This is systems-level creativity.

Your take on the prompt evaluation

#1: Create a persona who does the job

Start by building a clear persona: name, role, background, point of view. What do they care about? What are they trying to solve? You can even feed ChatGPT someone’s profile, tone of voice, or writing samples and say, “Act like this person evaluating this prompt.”

Cringe? A little. Useful? 100%.

#2: Simulate a review committee

Create multiple distinct personas (e.g., “Skeptical Engineer,” “Optimistic PM,” “Exhausted Tech Writer”) and have each “review” the same prompt or output. It’s like running your prompt through a panel of AI critics to stress-test how it holds up from different angles.

Great way to spot blind spots, tone mismatches, or assumptions you didn’t know you were making.

#3: Layer in engineering context + use real tools

When evaluating prompts for technical tasks (dev workflows, code generation, etc.), lean on real engineering context:

Ask clarifying questions
Reference established docs or internal guides
Actually run the outputs inside your product or editor

And if you want more structure, here are some tools our engineering team swears by:

Promptfoo – Open-source tool for batch testing prompts side-by-side. Fast, simple, and very dev-friendly.
LLM-as-a-Judge – Uses another LLM to evaluate outputs based on your criteria.
DeepEval – Blends human review + AI scoring to keep quality consistent.
PromptBench (Microsoft) – Standardized benchmarking framework to test performance across real-world tasks.

Prompt evaluation isn’t about perfection. It’s about building a repeatable feedback loop that makes your prompts better, clearer, and more useful, one iteration at a time.

Written by

Hanna Stechenko

A different perspective on prompt evaluation

…

Get started

Recent

Sep 17, 2025

Prototypes: the glue of Long-Term Memory

Explore how prototypes lay the foundation for long-term memory in AI. Learn why early experiments, iteration, and design “blueprints” are critical for building durable, context-rich intelligence.

Sep 15, 2025

Why developers need AI that actually gets Their context

Tired of re-explaining your codebase to AI every week? Discover why developers need context-aware AI that remembers your workflow. Learn how Workstream Activity, Sources, and Time Ranges in Pieces give you control, continuity, and a searchable memory for your entire dev process.

Sep 11, 2025

AI memory explained: what Perplexity, ChatGPT, Pieces, and Claude remember (and forget)

Discover the different types of AI memory, how they work, key use cases, and the best prompting approaches to get accurate, context-aware responses

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.

our newsletter

Sign up for The Pieces Post

Check out our monthly newsletter for curated tips & tricks, product updates, industry insights and more.