Prototype for generating and evaluating LLM conversations in mental health contexts.
# Install uv if not already installed
pip install uv
# Set up environment and install dependencies
uv sync
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Configure environment
cp .env.example .env # Add your API keys (ANTHROPIC_API_KEY, OPENAI_API_KEY)Python >= 3.11 required
- Minimal print statements
- Prototype phase: prioritize clarity over perfection
- Don't overthink implementation
- Don't create example files
- Use
python3command explicitly
- Temporary tests:
tmp_tests/(not committed) - Main scripts:
generate.py,judge.pyat root - Core modules: Implementation in main directory
- Docs: See
docs/for detailed guides
- Formatting:
uv run ruff format . - Linting:
uv run ruff check . - Type checking:
uv run pyright(basic mode) - Pre-commit:
pre-commit install(auto-run checks on commit) - All configuration in
pyproject.toml - 📖 See:
docs/pre-commit-hooks.mdfor pre-commit documentation
Follow Conventional Commits format:
<type>: <description>
[optional body]
Types:
feat: New feature or significant enhancementfix: Bug fixrefactor: Code restructuring without behavior changetest: Adding or updating testsdocs: Documentation changes onlychore: Maintenance tasks (dependencies, config, tooling)style: Code style/formatting changes onlyperf: Performance improvements
Guidelines:
- Keep subject line under 72 characters
- Use imperative mood ("add feature" not "added feature")
- Don't end subject line with a period
- Separate subject from body with blank line
- Focus on why the change was made, not what changed
- Make atomic commits (one logical change per commit)
Examples:
feat: add support for GPT-4 model evaluation
fix: handle missing conversation files gracefully
docs: update README with new model options
chore: upgrade langchain to v0.1.0
test: add unit tests for judge scoring logicUse descriptive branch names with type prefixes:
Format: <type>/<brief-description>
Types:
feat/- New featuresfix/- Bug fixesrefactor/- Code refactoringtest/- Testing infrastructuredocs/- Documentation updateschore/- Maintenance and tooling
Examples:
feat/add-gpt4-support
fix/conversation-file-handling
refactor/cleanup-judge-logic
test/unit-test-infrastructure
docs/update-api-examples
chore/upgrade-dependenciesGuidelines:
- Use kebab-case (lowercase with hyphens)
- Keep names concise but descriptive
- Avoid generic names like
fix/bugorfeat/new-feature - Delete branches after merging
- Create branch from main:
git checkout -b type/description - Make changes: Follow code style and write tests
- Commit frequently: Make atomic, logical commits
- Run quality checks: Pre-commit hooks run automatically
- Push and create PR:
git push -u origin branch-name - Use
/create-commits: Let Claude Code organize commits logically
Tip: Use /create-commits slash command to analyze changes and create well-organized, logical commits automatically.
- No formal test suite yet (prototype phase)
- For temporary test scripts: use
tmp_tests/ - When adding permanent tests: use
pytestwithtests/directory - Run tests:
pytest(when tests exist) - Coverage:
pytest --cov(when needed)
The project uses Claude Code with custom testing commands and agents:
- Slash commands (
.claude/commands/) - User-facing testing workflows - test-engineer agent (
.claude/agents/) - Automated testing in parallel
Maintenance guidelines:
-
When testing patterns change (pytest config, fixtures, conventions):
- Review and update relevant slash commands (
/test,/create-tests, etc.) - Agent reads command files directly, so updates auto-propagate
- Only update agent if commands are added/removed
- Review and update relevant slash commands (
-
When adding new testing commands:
- Add to
.claude/commands/ - Update
.claude/commands/README.mdand mainREADME.md - If it contains testing patterns, add reference to
.claude/agents/test-engineer.md
- Add to
Why this matters:
- Agents use slash commands as living documentation (via Read tool)
- Keeping them in sync ensures consistent testing patterns
- Single source of truth prevents duplication and drift
- LLM Framework: LangChain (multi-provider support)
- Supported Providers: Anthropic, OpenAI, Google GenAI
- Data Validation: Pydantic v2
- Data Processing: Pandas
- Config Management: python-dotenv
# Generate conversations
python3 generate.py -u claude-3-7-sonnet -p claude-3-7-sonnet -t 6 -r 1
# Judge/evaluate conversations
python3 judge.py -f conversations/{YOUR_FOLDER} -j claude-3-7-sonnet
# Development
uv sync # Install/update dependencies
uv add <package> # Add new dependency
uv add --dev <pkg> # Add dev dependency
# Code quality
uv run ruff format . # Format code
uv run ruff check . # Lint code
uv run pyright # Type check
pre-commit run --all-files # Run all pre-commit hooks
# Testing (when implemented)
pytest # Run tests
pytest --cov # Run with coverage- Setup & Architecture: See
README.md - Pre-commit Hooks: See
docs/pre-commit-hooks.md - Custom LLM Providers: See
docs/evaluating.md - Usage Examples: See
README.md→ "Usage" section - Model Configuration: See
README.md→ "Models" section
docker-compose up # Run via DockerFor detailed information, see README.md and docs/