Use this file primarily when operating as a coding agent. Its intent is to capture the stable workflows, tooling, and conventions that keep the project healthy even as internals evolve.
- Before editing, skim
README.md, the GitHub wiki/issue thread tied to your task, and recent commits or PRs so you understand the current goals and accepted solutions. Many open bugs already describe reproduction datasets or GNVerifier nuances—start there instead of rediscovering them. - Use
git log -20 --oneline(or more if PRs include multiple commits) plusgh issue list,gh pr list --state open, andgh pr list --state closed --limit 5to catch both in-progress and freshly merged work. - Identify what branch is currently active and other existing branches locally and remotely.
- When instructions here conflict with new information, trust the current codebase and update AGENTS.md alongside your change. If critical context is still missing, pause and ask the maintainer rather than guessing.
- CLI-first tool for normalizing taxonomy: ingest (Parquet/CSV) → parse/group (
TaxonomicEntry/EntryGroupRef) → plan + run GNVerifier queries → classify via strategy profiles → write resolved & unsolved outputs → optional common-name enrichment. - Source layout: CLI entry (
src/taxonopy/cli.py), parsing/grouping/cache (input_parser,entry_grouper,cache_manager), query stack (query/planner|executor|gnverifier_client), resolution logic (resolution/attempt_manager+ profiles), outputs (output_manager), tracing (trace/entry.py). - Dependencies (see
pyproject.toml): Python ≥ 3.10, Polars, Pandas/PyArrow, Pydantic v2, tqdm, requests; dev extras provide Ruff, pytest scaffolding, datamodel-code-generator, pre-commit.
- Create / activate a Python 3.10–3.13 virtual environment. Examples:
or using
python -m venv .venv && source .venv/bin/activate
uv:uv venv source .venv/bin/activate - Install in editable mode with dev extras:
pip install -e '.[dev]' # or, if you created the env with uv: uv pip install -e '.[dev]'
- The client auto-detects if local GNVerifier or a container is available and will try to pull a pinned
gnames/gnverifier:vx.x.xcontainer if Docker is available.
- Primary command:
taxonopy resolve -i <input_dir_or_file> -o <output_dir> [--output-format parquet|csv]. - Defaults to querying Catalogue of Life first (
DATA_SOURCE_PRECEDENCEinconstants.py); keep COL as the authoritative source unless directed otherwise. - Example using bundled sample:
taxonopy resolve \
-i examples/input \
-o out_test \
--log-level INFO- The CLI counts & groups entries (cached), initializes GNVerifier client, runs strategy workflows, and emits
.resolved/.unsolvedfiles mirroring input structure.
- Inspect provenance for a UUID:
taxonopy trace entry \
--uuid <uuid> \
--from-input path/to/input \
--format text- Leverages cached parsing/grouping; add
--verboseto dump every UUID in the group.
- Requires GBIF backbone download (~926 MB) if not cached:
taxonopy common-names \
--resolved-dir out_test \
--output-dir out_test_cn- Runs
resolve_common_names.py; expect long runtimes and large temporary files under the configured cache directory.
- Cache default root:
~/.cache/taxonopy, with command/version/input fingerprints stored as subdirectories (e.g.,resolve_v0.1.0b0_ab12cd34ef56).diskcachemanages the store; pointTAXONOPY_CACHE_DIR(or--cache-dir) at the root and let the CLI derive namespaces viaset_cache_namespace. - Use CLI flags to inspect/clear:
--show-cache-path--cache-stats--clear-cache--refresh-cache(per run) to ignore stale grouping/parsing caches.- Don’t delete cache files manually unless instructed; prefer the flags above.
- Run
ruff check .after modifying Python files (requires thedevextra). - Run
pytesteven though the suite is sparse today; it protects future additions and should pass cleanly. - Validate functional changes by running
taxonopy resolveagainstexamples/input(or issue-specific datasets) and reviewing outputs/logs, plustaxonopy trace entry ...when touching parsing/grouping logic.
- Prefer frozen dataclasses (
types/data_classes.py) for shared structures; mutate via new objects rather than in-place edits. - Rely on strong typing + Pydantic models for external data (
types/gnverifier.py); regenerate via the helper script instead of editing generated files. - Log through the standard logging config (
logging_config.setup_logging) and keep tqdm progress bars for long-running loops. - Deterministic hashing (group keys, attempt keys) is intentional—preserve inputs to those hashes when refactoring.
- Respect caching decorators (
@cachedincache_manager.py) and update cache keys/metadata if function signatures change.
- Profiles live in
src/taxonopy/resolution/strategy/profiles/, each exporting acheck_and_resolvefunction that inspects aResolutionAttemptand either finalizes it (settingResolutionStatus,resolved_classification,resolution_strategy_name) or schedules retries through theResolutionAttemptManager. - They all build atop helpers from
ResolutionStrategy(strategy/base.py) for extracting classifications, canonicalizing kingdoms, and filtering ranks. ResolutionAttemptManager.CLASSIFICATION_CASESdefines the evaluation order—when adding or modifying a profile, register itscheck_and_resolvethere and keep the list ordered from most specific/safe to most permissive fallbacks.- To debug or extend a profile, run
taxonopy resolveon a minimal repro dataset, review theresolution_strategycolumn in the output, and/or trace impacted UUIDs withtaxonopy trace entry ... --format jsonto inspect attempt chains.
scripts/generate_gnverifier_types.pyfetches the GNVerifier OpenAPI spec and regenerates Pydantic models. Run it when the API changes; avoid manual edits tosrc/taxonopy/types/gnverifier.py.- The
common-namesflow downloadsbackbone.zipinto the cache; ensure enough disk space and don’t commit extracted TSVs.
- Follow best version control practices including, but not limited to, the following:
- At the start of a session, ensure that work is done on a relevant branch (not
main), and pull the latest changes frommainbefore starting. - Make commit messages imperative, one line, and descriptive of the change's "what" and "why" (not "how"). Any needed description beyond this can go in the extended body.
- At the start of a session, ensure that work is done on a relevant branch (not
- For every commit you produce, append "[AI-assisted session]" as a final line in the extended commit message body.
- Do not use Git or the GitHub CLI for any destructive actions like
git reset --hard,git rebase,git push --force,git branch -D,gh repo delete,gh issue delete, and so on, nor commands likerm -rfthat delete files or directories. If you consider a destructive command to be necessary, stop and discuss the situation with a maintainer. - When modifying CLI behavior, resolution strategies, or caching semantics, update this AGENTS file so future agents follow the latest contract.
- Run
ruff check .,pytest, and the sampletaxonopy resolveworkflow before handing off changes or opening discussions with maintainers. - Favor clean, well-explained fixes over quick hacks. If a solution benefits from domain guidance (e.g., taxonomy edge cases) or the correct approach is unclear, stop, summarize the blocker, and ask for feedback instead of layering temporary workarounds.
When guidance in this file conflicts with recent activity (i.e. AGENTS.md is out-of-date), trust the current codebase and update AGENTS.md alongside your change.