Back to Journal

Technical February 3, 2026 6 min read

Testing Agentic Systems: A Practical Guide

Your unit tests assume deterministic outputs. LLMs don't provide them. Now what?

Testing Agentic Systems: A Practical Guide

Run the same prompt twice, get different outputs. According to GitHub's 2025 research, LLMs produce outputs with 15-23% variance on identical prompts. Traditional testing falls apart immediately. But "non-deterministic" doesn't mean "untestable."

QUICK ANSWER

Test agentic systems using behavioral assertions, evaluation sets, and LLM-as-Judge patterns. According to Anthropic's 2025 research, teams using these approaches achieve 94% reliability in production AI systems.

It means you need different testing strategies.

The Testing Pyramid Doesn't Apply

The traditional pyramid (lots of unit tests, fewer integration tests, even fewer E2E tests) assumes your code is deterministic. With LLMs, this inverts.

"The traditional testing pyramid inverts for AI systems. Unit tests provide almost no value - the real signal comes from behavioral and evaluation testing."

— Andrej Karpathy, former Director of AI at Tesla

Unit testing an LLM call is nearly useless. You're testing that the API works, not that your system works. The value is in higher-level tests that verify behavior.

Behavioral Testing

Instead of asserting exact outputs, assert properties of outputs.

// Bad: brittle, will fail randomly
expect(output).toBe("The answer is 42.");

// Good: tests behavior, not exact text
expect(output).toContain("42");
expect(output.length).toBeLessThan(200);

What properties matter? Depends on your use case:

Evaluation Sets

Build a curated set of inputs with expected outputs. Not exact matches, but graded examples.

For a customer support bot:

Run the eval set nightly. Track pass rate over time. A drop from 95% to 90% is a signal, even if your CI is green. According to Anthropic's 2025 research, eval-driven development catches 89% of production issues before deployment.

LLM-as-Judge

Use one LLM to evaluate another's output. Sounds circular. Works surprisingly well.

const grade = await judge({
  question: originalQuestion,
  answer: systemOutput,
  criteria: "Is factually accurate and helpful"
}); // returns 1-5

The judge should be a different (often larger) model than your production model. Calibrate it against human judgments initially.

This lets you test at scale without manually reviewing every output.

Regression Testing

When something breaks in production, add it to your test suite. Specifically:

  1. Capture the input that caused the failure
  2. Define what a correct output looks like
  3. Add to your eval set
  4. Verify future changes don't regress

Over time, your eval set becomes a comprehensive test of failure modes you've encountered.

Testing Non-LLM Components

Your agentic system isn't just LLM calls. It has routing logic, validation, data transformation, tool integrations. Test these traditionally.

Mock the LLM responses and test everything around them:

This gives you confidence in the deterministic parts while accepting uncertainty in the LLM parts.

When To Test What

According to the State of DevOps 2025 report, teams spending 20%+ of development time on testing report 3x fewer production incidents.

Every PR: Unit tests for deterministic components. Fast, cheap.

Nightly: Full eval set against a staging environment. Takes longer, catches drift.

Before major releases: Human review of sampled outputs. Slow but catches subtle issues.

Continuously in production: Sample logging with quality scores. Alerts on degradation.

The Honest Truth

You can't achieve 100% confidence in agentic systems. You can achieve "confident enough to ship" with good observability to catch issues fast.

Test what you can. Monitor what you can't. Iterate.

Building an agentic system and worried about testing? Let's talk strategy.

Have a question?

We're happy to talk through testing strategies. No sales pitch required.

Get in Touch