Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 44 additions & 1 deletion evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
Behavioral evaluations (evals) are tests designed to validate the agent's
behavior in response to specific prompts. They serve as a critical feedback loop
for changes to system prompts, tool definitions, and other model-steering
mechanisms.
mechanisms, and as a tool for assessing feature reliability by model, and
preventing regressions.

## Why Behavioral Evals?

Expand All @@ -30,6 +31,48 @@ CLI's features.
those that are generally reliable but might occasionally vary
(`USUALLY_PASSES`).

## Best Practices

When designing behavioral evals, aim for scenarios that accurately reflect
real-world usage while remaining small and maintainable.

- **Realistic Complexity**: Evals should be complicated enough to be
"realistic." They should operate on actual files and a source directory,
mirroring how a real agent interacts with a workspace. Remember that the agent
may behave differently in a larger codebase, so we want to avoid scenarios
that are too simple to be realistic.
- _Good_: An eval that provides a small, functional React component and asks
the agent to add a specific feature, requiring it to read the file,
understand the context, and write the correct changes.
- _Bad_: An eval that simply asks the agent a trivia question or asks it to
write a generic script without providing any local workspace context.
- **Maintainable Size**: Evals should be small enough to reason about and
maintain. We probably can't check in an entire repo as a test case, though
over time we will want these evals to mature into more and more realistic
scenarios.
- _Good_: A test setup with 2-3 files (e.g., a source file, a config file, and
a test file) that isolates the specific behavior being evaluated.
- _Bad_: A test setup containing dozens of files from a complex framework
where the setup logic itself is prone to breaking.
- **Unambiguous and Reliable Assertions**: Assertions must be clear and specific
to ensure the test passes for the right reason.
- _Good_: Checking that a modified file contains a specific AST node or exact
string, or verifying that a tool was called with with the right parameters.
- _Bad_: Only checking for a tool call, which could happen for an unrelated
reason. Expecting specific LLM output.
- **Fail First**: Have tests that failed before your prompt or tool change. We
want to be sure the test fails before your "fix". It's pretty easy to
accidentally create a passing test that asserts behaviors we get for free. In
general, every eval should be accompanied by prompt change, and most prompt
changes should be accompanied by an eval.
- _Good_: Observing a failure, writing an eval that reliably reproduces the
failure, modifying the prompt/tool, and then verifying the eval passes.
- _Bad_: Writing an eval that passes on the first run and assuming your new
prompt change was responsible.
- **Less is More**: Prefer fewer, more realistic tests that assert the major
paths vs. more tests that are more unit-test like. These are evals, so the
value is in testing how the agent works in a semi-realistic scenario.

## Creating an Evaluation

Evaluations are located in the `evals` directory. Each evaluation is a Vitest
Expand Down
Loading