A benchmarking system for evaluating OpenClaw agents across various tasks.
PinchBench loads task definitions from the tasks/ directory and provides a framework for creating and benchmarking OpenClaw agents. Each task includes:
- Task metadata (ID, name, category, timeout)
- User prompt
- Expected behavior description
- Grading criteria
- Automated grading functions (where applicable)
- LLM judge rubrics (where applicable)
Run the benchmark script using uv (no virtual environment setup needed):
uv run benchmark.pyThis will:
- Load all tasks from the
tasks/directory - Display a summary of loaded tasks
- Run the configured agent across the selected tasks and emit results
The TaskLoader class handles:
- Reading task markdown files
- Parsing YAML frontmatter
- Extracting task sections (Prompt, Expected Behavior, Grading Criteria, etc.)
- Creating structured
Taskobjects
The OpenClawAgent class provides:
- Agent initialization with configuration
- Task execution interface
- Result tracking structure
The BenchmarkRunner class orchestrates:
- Task loading and management
- Agent creation
- Benchmark execution across tasks
- Result aggregation
Tasks are defined in markdown files with YAML frontmatter:
---
id: task_01_example
name: Example Task
category: example
grading_type: automated
timeout_seconds: 120
workspace_files: []
---
## Prompt
[User-facing task prompt]
## Expected Behavior
[Description of expected agent behavior]
## Grading Criteria
- [ ] Criterion 1
- [ ] Criterion 2
## Automated Checks
```python
def grade(transcript: list, workspace_path: str) -> dict:
# Grading logic
return scores
```The system includes 10 benchmark tasks:
- task_01_calendar - Calendar Event Creation
- task_02_stock - Stock Price Research
- task_03_blog - Blog Post Writing
- task_04_weather - Weather Script Creation
- task_05_summary - Document Summarization
- task_06_events - Tech Conference Research
- task_07_email - Professional Email Drafting
- task_08_memory - Memory Retrieval from Context
- task_09_files - File Structure Creation
- task_10_workflow - Multi-step API Workflow
The script uses Python's built-in logging with:
- Console output (INFO level)
- File output to
benchmark.log - Structured log messages for debugging
- Python >= 3.10
- PyYAML >= 6.0.1
Dependencies are automatically managed by uv using inline script metadata.
To extend the system:
- Add new tasks: Create markdown files in
tasks/following the template - Customize agent execution: Adjust the
execute_taskmethod inOpenClawAgent - Tune grading: Update grading logic and rubrics in task definitions
- Report results: Add downstream reporting for your benchmarks
Assuming you have a benchmark results JSON file, here are some helpful jq snippets:
- List all task scores:
jq '.tasks[] | {task_id, score: .grading.score}' file.json- Show per-task pass/fail and total score:
jq '.tasks[] | {task_id, passed: .grading.passed, score: .grading.score}' file.json- Sort tasks by score (ascending):
jq '.tasks | sort_by(.grading.score)[] | {task_id, score: .grading.score}' file.json- Sort tasks by execution time (ascending):
jq '.tasks | sort_by(.execution_time)[] | {task_id, execution_time: .execution_time}' file.json- Aggregate average score across tasks:
jq '{average_score: ([.tasks[].grading.score] | add / length)}' file.json- Filter to failed tasks only:
jq '.tasks[] | select(.grading.passed == false) | {task_id, score: .grading.score}' file.jsonSee project license file.
