PinchBench - OpenClaw Agent Benchmarking System

A benchmarking system for evaluating OpenClaw agents across various tasks.

Overview

PinchBench loads task definitions from the tasks/ directory and provides a framework for creating and benchmarking OpenClaw agents. Each task includes:

Task metadata (ID, name, category, timeout)
User prompt
Expected behavior description
Grading criteria
Automated grading functions (where applicable)
LLM judge rubrics (where applicable)

Quick Start

Run the benchmark script using uv (no virtual environment setup needed):

uv run benchmark.py

This will:

Load all tasks from the tasks/ directory
Display a summary of loaded tasks
Run the configured agent across the selected tasks and emit results

Script Features

Task Loading

The TaskLoader class handles:

Reading task markdown files
Parsing YAML frontmatter
Extracting task sections (Prompt, Expected Behavior, Grading Criteria, etc.)
Creating structured Task objects

Agent Runtime

The OpenClawAgent class provides:

Agent initialization with configuration
Task execution interface
Result tracking structure

Benchmark Runner

The BenchmarkRunner class orchestrates:

Task loading and management
Agent creation
Benchmark execution across tasks
Result aggregation

Task Structure

Tasks are defined in markdown files with YAML frontmatter:

---
id: task_01_example
name: Example Task
category: example
grading_type: automated
timeout_seconds: 120
workspace_files: []
---

## Prompt

[User-facing task prompt]

## Expected Behavior

[Description of expected agent behavior]

## Grading Criteria

- [ ] Criterion 1
- [ ] Criterion 2

## Automated Checks

```python
def grade(transcript: list, workspace_path: str) -> dict:
    # Grading logic
    return scores
```

Current Tasks

The system includes 10 benchmark tasks:

task_01_calendar - Calendar Event Creation
task_02_stock - Stock Price Research
task_03_blog - Blog Post Writing
task_04_weather - Weather Script Creation
task_05_summary - Document Summarization
task_06_events - Tech Conference Research
task_07_email - Professional Email Drafting
task_08_memory - Memory Retrieval from Context
task_09_files - File Structure Creation
task_10_workflow - Multi-step API Workflow

Logging

The script uses Python's built-in logging with:

Console output (INFO level)
File output to benchmark.log
Structured log messages for debugging

Dependencies

Python >= 3.10
PyYAML >= 6.0.1

Dependencies are automatically managed by uv using inline script metadata.

Development

To extend the system:

Add new tasks: Create markdown files in tasks/ following the template
Customize agent execution: Adjust the execute_task method in OpenClawAgent
Tune grading: Update grading logic and rubrics in task definitions
Report results: Add downstream reporting for your benchmarks

Results Exploration (jq)

Assuming you have a benchmark results JSON file, here are some helpful jq snippets:

List all task scores:

jq '.tasks[] | {task_id, score: .grading.score}' file.json

Show per-task pass/fail and total score:

jq '.tasks[] | {task_id, passed: .grading.passed, score: .grading.score}' file.json

Sort tasks by score (ascending):

jq '.tasks | sort_by(.grading.score)[] | {task_id, score: .grading.score}' file.json

Sort tasks by execution time (ascending):

jq '.tasks | sort_by(.execution_time)[] | {task_id, execution_time: .execution_time}' file.json

Aggregate average score across tasks:

jq '{average_score: ([.tasks[].grading.score] | add / length)}' file.json

Filter to failed tasks only:

jq '.tasks[] | select(.grading.passed == false) | {task_id, score: .grading.score}' file.json

License

See project license file.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.kilocode		.kilocode
plans		plans
tasks		tasks
.gitignore		.gitignore
README.md		README.md
SKILL.md		SKILL.md
benchmark.py		benchmark.py
crab.txt		crab.txt
lib_agent.py		lib_agent.py
lib_grading.py		lib_grading.py
lib_tasks.py		lib_tasks.py
lib_upload.py		lib_upload.py
pinchbench.png		pinchbench.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PinchBench - OpenClaw Agent Benchmarking System

Overview

Quick Start

Script Features

Task Loading

Agent Runtime

Benchmark Runner

Task Structure

Current Tasks

Logging

Dependencies

Development

Results Exploration (jq)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PinchBench - OpenClaw Agent Benchmarking System

Overview

Quick Start

Script Features

Task Loading

Agent Runtime

Benchmark Runner

Task Structure

Current Tasks

Logging

Dependencies

Development

Results Exploration (jq)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages