Skip to content

ScuttleBot/skill

 
 

Repository files navigation

PinchBench - OpenClaw Agent Benchmarking System

A benchmarking system for evaluating OpenClaw agents across various tasks.

PinchBench terminal output

Overview

PinchBench loads task definitions from the tasks/ directory and provides a framework for creating and benchmarking OpenClaw agents. Each task includes:

  • Task metadata (ID, name, category, timeout)
  • User prompt
  • Expected behavior description
  • Grading criteria
  • Automated grading functions (where applicable)
  • LLM judge rubrics (where applicable)

Quick Start

Run the benchmark script using uv (no virtual environment setup needed):

uv run benchmark.py

This will:

  1. Load all tasks from the tasks/ directory
  2. Display a summary of loaded tasks
  3. Run the configured agent across the selected tasks and emit results

Script Features

Task Loading

The TaskLoader class handles:

  • Reading task markdown files
  • Parsing YAML frontmatter
  • Extracting task sections (Prompt, Expected Behavior, Grading Criteria, etc.)
  • Creating structured Task objects

Agent Runtime

The OpenClawAgent class provides:

  • Agent initialization with configuration
  • Task execution interface
  • Result tracking structure

Benchmark Runner

The BenchmarkRunner class orchestrates:

  • Task loading and management
  • Agent creation
  • Benchmark execution across tasks
  • Result aggregation

Task Structure

Tasks are defined in markdown files with YAML frontmatter:

---
id: task_01_example
name: Example Task
category: example
grading_type: automated
timeout_seconds: 120
workspace_files: []
---

## Prompt

[User-facing task prompt]

## Expected Behavior

[Description of expected agent behavior]

## Grading Criteria

- [ ] Criterion 1
- [ ] Criterion 2

## Automated Checks

```python
def grade(transcript: list, workspace_path: str) -> dict:
    # Grading logic
    return scores
```

Current Tasks

The system includes 10 benchmark tasks:

  1. task_01_calendar - Calendar Event Creation
  2. task_02_stock - Stock Price Research
  3. task_03_blog - Blog Post Writing
  4. task_04_weather - Weather Script Creation
  5. task_05_summary - Document Summarization
  6. task_06_events - Tech Conference Research
  7. task_07_email - Professional Email Drafting
  8. task_08_memory - Memory Retrieval from Context
  9. task_09_files - File Structure Creation
  10. task_10_workflow - Multi-step API Workflow

Logging

The script uses Python's built-in logging with:

  • Console output (INFO level)
  • File output to benchmark.log
  • Structured log messages for debugging

Dependencies

  • Python >= 3.10
  • PyYAML >= 6.0.1

Dependencies are automatically managed by uv using inline script metadata.

Development

To extend the system:

  1. Add new tasks: Create markdown files in tasks/ following the template
  2. Customize agent execution: Adjust the execute_task method in OpenClawAgent
  3. Tune grading: Update grading logic and rubrics in task definitions
  4. Report results: Add downstream reporting for your benchmarks

Results Exploration (jq)

Assuming you have a benchmark results JSON file, here are some helpful jq snippets:

  • List all task scores:
jq '.tasks[] | {task_id, score: .grading.score}' file.json
  • Show per-task pass/fail and total score:
jq '.tasks[] | {task_id, passed: .grading.passed, score: .grading.score}' file.json
  • Sort tasks by score (ascending):
jq '.tasks | sort_by(.grading.score)[] | {task_id, score: .grading.score}' file.json
  • Sort tasks by execution time (ascending):
jq '.tasks | sort_by(.execution_time)[] | {task_id, execution_time: .execution_time}' file.json
  • Aggregate average score across tasks:
jq '{average_score: ([.tasks[].grading.score] | add / length)}' file.json
  • Filter to failed tasks only:
jq '.tasks[] | select(.grading.passed == false) | {task_id, score: .grading.score}' file.json

License

See project license file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%