Using Evals to Test Programmable AI
Why evals are becoming the unit tests of programmable intelligence.
"There's one kind of program that we don't understand even in principle," said David Deutsch, a renowned physicist, referring to AI. If there's no way to understand what's going on inside an AI system, then how can you debug and test the systems that use it? One of those types of systems is programmable AI, which lets you define workflows where each step is executed by AI agents. Well, testing programmable AI is about making sure all the involved AI-driven components behave consistently. To do that, you can write and execute many single tests that evaluate responses and behaviors based on inputs. Those tests are called evals. Keep reading to see how they work.
This article is brought to you with the help of our supporter, Apideck.
Apideck helps you integrate with multiple APIs through a single connection. Instead of building and maintaining integrations one by one, use our Unified APIs to connect with hundreds of leading providers across Accounting, HRIS, CRM, Ecommerce, and more.
All connections are built for secure, real-time data exchange, so you can trust that the data you access is always accurate and protected.
Ship integrations 10x faster, reduce dev costs, and accelerate your product roadmap without the integration bottleneck.
AI systems behave like magic black boxes. At least to our current understanding, there's no way to explain their behavior or why their responses aren't always deterministic. Debugging AI feels like a nightmare. With so many unknown parts, where do you even begin? Trying to test for a correspondence between input and output will fail because the results aren't reproducible and deterministic. Trying to test for specific parts of the AI system is very error-prone and difficult to reproduce because we don't have all the information about those different parts. What is left is making the outputs as deterministic as possible and testing only for those parts that you know should never change. Read on to see how that can be possible.
Enter evals. In 2023, OpenAI publicly released "a framework for evaluating LLMs and LLM systems." What does that even mean? Since it's not 100% accurate to check if there's a match between a certain input and output, you can write functions that do it, but in a more abstract way. You can even use AI to verify if an output is accurate, when its shape can vary considerably. There are, for example, evals that test if a model can solve basic algebra problems, solve 2-dimensional mazes, or even detect sarcasm in a piece of text. Evals evolved as an open-source, community-oriented project to include a registry with more than 400 items, their corresponding datasets, and multiple solvers, including ones that use competing solutions like Anthropic Claude and Google Gemini. Evals got so popular that, in April 2025, OpenAI decided to release it as an API and make it a part of its product offering. The API isn't much more than a CRUD that lets you create, read, and run evals. The power isn't in the API itself but in what it provides. Along with this API, you have a few other options from different vendors, so you don't get locked in.
As you can imagine, evals as a concept exploded in popularity. This explosion led to a number of eval types and their respective classes of tools. There are, for instance, benchmark evals that let you compare answers from different models, security evals that test adversarial inputs, jailbreak attempts, and toxicity, and other custom evals that are task-specific and can test anything. Among the many available tools, I want to mention a few because I feel they represent what the industry is able to offer, following the great introduction OpenAI1 has given us all. The first one worth mentioning is LangSmith2, a SaaS tool that lets teams "debug, test, and monitor AI app performance." Contrary to what its name implies, it can monitor any AI app, not just the ones that use LangChain. Then you have PromptPex3, an open-source solution by Microsoft, focused on generating tests for prompts. Speaking of prompts, the next tool on my list, ChainForge4, lets "evaluate the robustness of prompts and text generation models." It does that nicely using a visual programming UI. Finally, you can cover the security area with Garak5, an LLM vulnerability scanner open-sourced by NVIDIA. There are many more evals tools out there, I'm sure. However, these are the ones I feel you should take a look at.
Even with all these available tools, running evals isn't always a piece of cake. To begin with, you should version all your prompts and datasets so you know what you're running your tests against. Then, remember to run the evals in your CI/CD pipeline so you take advantage of their results to change the status of your builds. If you're using a different model to evaluate the results of evals, then be careful. Make sure the second model can correctly evaluate those test results. Finally, always make sure the results are fully documented so you can quickly compare them across versions to check for improvements (or regressions). While using evals seems like a big step forward, integrating it with all your tools certainly introduces complexity and takes time. Be aware of these details before you make a decision.
Whether or not you think evals are good, programmable AI requires moving from a manual trial-and-error approach to a more systematic and reproducible technique. Altogether, even with the complexity related to integrating evals into your toolchain, I see it as a positive change. In my head, there's no doubt that the future will see evals becoming a default in any AI setup. In the same way testing has become normalized in software and API development, evals will be mandatory going forward.
OpenAI Evals are available at https://github.com/openai/evals
LangSmith is available at https://www.langchain.com/langsmith
PromptPex is available at https://github.com/microsoft/promptpex
ChainForge is available at https://www.chainforge.ai/
Garak is available at https://github.com/NVIDIA/garak