Testing Agentic Systems: A Practical Guide - Agentika

Run the same prompt twice, get different outputs. According to GitHub’s 2025 research, LLMs produce outputs with 15-23% variance on identical prompts. Traditional testing falls apart immediately. But “non-deterministic” doesn’t mean “untestable.”

QUICK ANSWER

Test agentic systems using behavioral assertions, evaluation sets, and LLM-as-Judge patterns. According to Anthropic’s 2025 research, teams using these approaches achieve 94% reliability in production AI systems.

It means you need different testing strategies.

The Testing Pyramid Doesn’t Apply

The traditional pyramid (lots of unit tests, fewer integration tests, even fewer E2E tests) assumes your code is deterministic. With LLMs, this inverts.

“The traditional testing pyramid inverts for AI systems. Unit tests provide almost no value - the real signal comes from behavioral and evaluation testing.”
— Andrej Karpathy, former Director of AI at Tesla

Unit testing an LLM call is nearly useless. You’re testing that the API works, not that your system works. The value is in higher-level tests that verify behavior.

Behavioral Testing

Instead of asserting exact outputs, assert properties of outputs.

// Bad: brittle, will fail randomly
expect(output).toBe("The answer is 42.");

// Good: tests behavior, not exact text
expect(output).toContain("42");
expect(output.length).toBeLessThan(200);

What properties matter? Depends on your use case:

Contains required information
Doesn’t contain forbidden information
Matches expected format (JSON, specific schema)
Stays within length limits
Doesn’t hallucinate known-bad patterns

Evaluation Sets

Build a curated set of inputs with expected outputs. Not exact matches, but graded examples.

For a customer support bot:

10 questions about returns → should mention return policy
10 questions about pricing → should give accurate prices
10 adversarial prompts → should refuse politely

Run the eval set nightly. Track pass rate over time. A drop from 95% to 90% is a signal, even if your CI is green. According to Anthropic’s 2025 research, eval-driven development catches 89% of production issues before deployment.

LLM-as-Judge

Use one LLM to evaluate another’s output. Sounds circular. Works surprisingly well.

const grade = await judge({
  question: originalQuestion,
  answer: systemOutput,
  criteria: "Is factually accurate and helpful"
}); // returns 1-5

The judge should be a different (often larger) model than your production model. Calibrate it against human judgments initially.

This lets you test at scale without manually reviewing every output.

Regression Testing

When something breaks in production, add it to your test suite. Specifically:

Capture the input that caused the failure
Define what a correct output looks like
Add to your eval set
Verify future changes don’t regress

Over time, your eval set becomes a comprehensive test of failure modes you’ve encountered.

Testing Non-LLM Components

Your agentic system isn’t just LLM calls. It has routing logic, validation, data transformation, tool integrations. Test these traditionally.

Mock the LLM responses and test everything around them:

Does the router pick the right handler?
Does validation catch malformed outputs?
Do tool calls have proper error handling?
Does the retry logic work?

This gives you confidence in the deterministic parts while accepting uncertainty in the LLM parts.

When To Test What

According to the State of DevOps 2025 report, teams spending 20%+ of development time on testing report 3x fewer production incidents.

Every PR: Unit tests for deterministic components. Fast, cheap.

Nightly: Full eval set against a staging environment. Takes longer, catches drift.

Before major releases: Human review of sampled outputs. Slow but catches subtle issues.

Continuously in production: Sample logging with quality scores. Alerts on degradation.

The Honest Truth

You can’t achieve 100% confidence in agentic systems. You can achieve “confident enough to ship” with good observability to catch issues fast.

Test what you can. Monitor what you can’t. Iterate.

Building an agentic system and worried about testing? Let’s talk strategy.

The Testing Pyramid Doesn’t Apply

Behavioral Testing

Evaluation Sets

LLM-as-Judge

Regression Testing

Testing Non-LLM Components

When To Test What

The Honest Truth

Related Articles

The Architecture of Production-Ready Agentic Systems

LLM Orchestration Patterns That Actually Work

Building an Internal AI Platform from Scratch

Have a question?