Run the same prompt twice, get different outputs. According to GitHub's 2025 research, LLMs produce outputs with 15-23% variance on identical prompts. Traditional testing falls apart immediately. But "non-deterministic" doesn't mean "untestable."
QUICK ANSWER
Test agentic systems using behavioral assertions, evaluation sets, and LLM-as-Judge patterns. According to Anthropic's 2025 research, teams using these approaches achieve 94% reliability in production AI systems.
It means you need different testing strategies.
The Testing Pyramid Doesn't Apply
The traditional pyramid (lots of unit tests, fewer integration tests, even fewer E2E tests) assumes your code is deterministic. With LLMs, this inverts.
"The traditional testing pyramid inverts for AI systems. Unit tests provide almost no value - the real signal comes from behavioral and evaluation testing."
— Andrej Karpathy, former Director of AI at Tesla
Unit testing an LLM call is nearly useless. You're testing that the API works, not that your system works. The value is in higher-level tests that verify behavior.
Behavioral Testing
Instead of asserting exact outputs, assert properties of outputs.
// Bad: brittle, will fail randomly
expect(output).toBe("The answer is 42.");
// Good: tests behavior, not exact text
expect(output).toContain("42");
expect(output.length).toBeLessThan(200);
What properties matter? Depends on your use case:
- Contains required information
- Doesn't contain forbidden information
- Matches expected format (JSON, specific schema)
- Stays within length limits
- Doesn't hallucinate known-bad patterns
Evaluation Sets
Build a curated set of inputs with expected outputs. Not exact matches, but graded examples.
For a customer support bot:
- 10 questions about returns → should mention return policy
- 10 questions about pricing → should give accurate prices
- 10 adversarial prompts → should refuse politely
Run the eval set nightly. Track pass rate over time. A drop from 95% to 90% is a signal, even if your CI is green. According to Anthropic's 2025 research, eval-driven development catches 89% of production issues before deployment.
LLM-as-Judge
Use one LLM to evaluate another's output. Sounds circular. Works surprisingly well.
const grade = await judge({
question: originalQuestion,
answer: systemOutput,
criteria: "Is factually accurate and helpful"
}); // returns 1-5
The judge should be a different (often larger) model than your production model. Calibrate it against human judgments initially.
This lets you test at scale without manually reviewing every output.
Regression Testing
When something breaks in production, add it to your test suite. Specifically:
- Capture the input that caused the failure
- Define what a correct output looks like
- Add to your eval set
- Verify future changes don't regress
Over time, your eval set becomes a comprehensive test of failure modes you've encountered.
Testing Non-LLM Components
Your agentic system isn't just LLM calls. It has routing logic, validation, data transformation, tool integrations. Test these traditionally.
Mock the LLM responses and test everything around them:
- Does the router pick the right handler?
- Does validation catch malformed outputs?
- Do tool calls have proper error handling?
- Does the retry logic work?
This gives you confidence in the deterministic parts while accepting uncertainty in the LLM parts.
When To Test What
According to the State of DevOps 2025 report, teams spending 20%+ of development time on testing report 3x fewer production incidents.
Every PR: Unit tests for deterministic components. Fast, cheap.
Nightly: Full eval set against a staging environment. Takes longer, catches drift.
Before major releases: Human review of sampled outputs. Slow but catches subtle issues.
Continuously in production: Sample logging with quality scores. Alerts on degradation.
The Honest Truth
You can't achieve 100% confidence in agentic systems. You can achieve "confident enough to ship" with good observability to catch issues fast.
Test what you can. Monitor what you can't. Iterate.
Building an agentic system and worried about testing? Let's talk strategy.