Pipeline for collecting and merging species metadata from multiple sources. Covers birds, mammals, insects, reptiles, and amphibians. All configuration (locales, taxon groups, API settings) lives in config.yml.
- Setup
- Contributing
- Project Structure
- Taxon Groups
- Pipeline
- Step 1 - AviList
- Step 2 - iNaturalist
- Step 3 - eBird
- Step 4 - Wikidata
- Step 5 - Wikipedia
- Step 6 - Macaulay Library
- Step 7 - Xeno-Canto
- Step 8 - observation.org
- Step 9 - Claude
- Step 10 - Images
- Step 11 - Build
- Web Server
- Data Sources
- License
- Funding
- Partners
git clone https://github.com/birdnet-team/birdnet-taxonomy.git
cd birdnet-taxonomy
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor Claude translations and Xeno-Canto lookups, add API keys to a .env file:
ANTHROPIC_API_KEY=...
XC_API_KEY=...
For sub-path deployment behind a reverse proxy (e.g. https://example.com/taxonomy/),
add the URL prefix:
ROOT_PATH=/taxonomy
HOST_NAME=https://birdnet.cornell.edu
The web app will then generate links under that prefix and accepts deployments where the reverse proxy either preserves the prefix or strips it before forwarding.
HOST_NAME is used for absolute image URLs in built metadata and API responses. With the example above, JSON image URLs and CSV image_url values are emitted under https://birdnet.cornell.edu/taxonomy/api/image/....
See CONTRIBUTING.md for contribution workflow and CODE_OF_CONDUCT.md for community expectations.
Community-maintained manual overrides live in overrides/species_overrides.csv and are applied during python -m build.metadata.
These overrides are persistent, tracked in Git, and always take precedence over fetched image data.
If an override changes the effective image URL or crop anchor, the image pipeline regenerates the same named .webp file in place. Cache freshness is tracked separately in per-image JSON sidecars under dev/images/*/.state/ or dist/images/*/.state/.
Supported columns:
scientific_name— required, exact species nameimage_url,image_author,image_license,image_source— optional, but if any one is set then all four are required and replace the fetched image metadataimage_crop_anchor— optional 3x3 crop anchor (1..9), where5is center crop,3is top-right,7is bottom-left, etc.source_url,notes— optional review context for contributors
Build validation is strict. The build fails on duplicate species rows, invalid crop anchors, partial image overrides, or species names not present in the taxonomy.
Current manual override support covers image replacement and manual crop anchoring. The crop anchor bypasses smart crop and uses a fixed 3×3 grid position in the final image pipeline.
config.py # Configuration helpers for config.yml access
config.yml # Project settings: taxonomy version, groups, filters, API params
requirements.txt # Python dependencies
bn_ids.json # Persistent BirdNET species ID registry (git-tracked)
build/
metadata.py # Merge collected sources into final metadata outputs
collectors/
_common.py # Shared collector utilities (cache, JSON I/O, shutdown)
avilist.py # Download and normalize AviList taxonomy input
claude.py # Claude enrichment for shortened/translated descriptions
ebird.py # eBird names and localized common-name collection
images.py # Batch image generation for dev/dist outputs
inat.py # iNaturalist taxa, sounds, and observation-photo fallback
macaulay.py # Macaulay Library taxon code discovery
wikidata.py # Wikidata licenses and cross-reference metadata
wikipedia.py # Wikipedia summaries, langlinks, and image metadata
xenocanto.py # Xeno-Canto scientific name mapping
observationorg.py # observation.org species ID mapping
dev/ # Development metadata snapshots and local build artifacts
dist/ # Published metadata and generated site image assets
overrides/
species_overrides.csv # Git-tracked manual image and crop overrides
raw_data/ # Cached upstream source payloads and intermediate collector output
utils/
images.py # Image download, crop, cache-state, and WebP helpers
web/
server.py # FastAPI app serving HTML, REST API, and image endpoints
static/ # Static web assets
templates/ # Jinja2 templates for home and species pages
Configured in config.yml. Birds include all species; other groups are limited to species with sound observations on iNaturalist.
| Group | iNat Taxon ID | Mode | Min sound observations |
|---|---|---|---|
| Aves | 3 | All species | — |
| Mammalia | 40151 | Sounds only | 1 |
| Insecta | 47158 | Sounds only | 5 |
| Reptilia | 26036 | Sounds only | 5 |
| Amphibia | 20978 | Sounds only | 5 |
Run collectors in order — later steps depend on earlier output. All scripts are incremental (rerunning skips already-processed species). Use --limit N to cap new items per run, or --dry-run to preview.
| Step | Command | Output |
|---|---|---|
| 1. AviList | python -m collectors.avilist |
raw_data/AviList-*.csv |
| 2. iNaturalist | python -m collectors.inat |
raw_data/inat_data.json |
| 3. eBird | python -m collectors.ebird |
raw_data/ebird_data.json, raw_data/ebird_names.json |
| 4. Wikidata | python -m collectors.wikidata |
raw_data/wikidata_data.json |
| 5. Wikipedia | python -m collectors.wikipedia |
raw_data/wikipedia_data.json |
| 6. Macaulay Library | python -m collectors.macaulay |
raw_data/macaulay_data.json |
| 7. Xeno-Canto | python -m collectors.xenocanto |
raw_data/xc_data.json |
| 8. observation.org | python -m collectors.observationorg |
raw_data/observationorg_data.json |
| 9. Claude (optional) | python -m collectors.claude |
raw_data/claude_data.json |
| 10. Images (optional) | python -m collectors.images |
dist/images/ (--dev → dev/images/) |
| 11. Build | python -m build.metadata |
dist/species_metadata.{json,csv,zip} |
Steps 1–2 collect taxonomy. Steps 3–4 enrich species with eBird descriptions, common names, external identifiers, and Wikidata images. Step 5 fetches localized Wikipedia summaries. Steps 6–8 discover Macaulay Library taxon codes, Xeno-Canto name mappings, and observation.org species IDs for cross-referencing audio sources. Step 9 uses Claude to shorten excessively long extracts and translate missing locales. Step 10 downloads species images. Step 11 merges everything into the final output — no API calls, purely offline.
Downloads the AviList Global Avian Checklist (XLSX), converts to CSV. Provides authoritative bird taxonomy and AviList IDs.
| Flag | Description |
|---|---|
--force |
Re-download even if CSV already exists |
Paginates the iNat taxa API to fetch all species for each taxon group. For birds, fetches all species. For other groups, queries the iNat sounds API to find species with audio observations meeting the min_observations threshold. Collects taxonomy, common names (all locales when all_names: true), observation counts, default photos, and Wikipedia URLs.
After group fetching, runs an observation photo fallback phase: for any species whose default taxon photo is missing or not CC-licensed, queries the iNat observations API for a CC-licensed photo from a research-grade observation (sorted by community votes). The result is stored in the obs_photo field, and unsuccessful lookups are cached in inat_data.json so later runs do not repeat the same slow checks.
| Flag | Description |
|---|---|
--group NAME |
Fetch only this taxon group |
--limit N |
Cap new species per group (0 = all) |
--save-every N |
Save progress every N new species (default: from config.yml) |
--refresh |
Bypass cached Phase 1 data and re-fetch from API |
--obs-photos-only |
Only run observation photo fallback |
--skip-obs-photos |
Skip observation photo fallback |
--refresh-obs-photos |
Recheck species cached as having no obs photo |
--avilist-only |
Only run AviList reconciliation |
--skip-avilist |
Skip AviList reconciliation phase |
--dry-run |
Preview without fetching |
Collects eBird species data in two phases:
- Phase 1 — Scraper: Scrapes eBird species pages for English descriptions (
og:description) and Macaulay Library images (og:image). Parallel fetching with configurable workers (default 4) and rate limiting (default 5 rps). Only for bird species with an eBird code. - Phase 2 — Common names: Downloads the eBird taxonomy CSV for 62 locales to collect common names in all available languages. Each locale download is cached to avoid re-fetching.
| Flag | Description |
|---|---|
--limit N |
Cap new species for scraping (0 = all) |
--workers N |
Parallel scrapers (default: 4) |
--rps N |
Max requests per second (default: 5) |
--names-only |
Only download common names (Phase 2) |
--skip-names |
Skip common names, scrape only |
--dry-run |
Preview without fetching |
Fetches species identifiers, common name labels, and images from Wikidata and Wikimedia Commons via SPARQL queries.
- Phase 1 — eBird codes: Resolves eBird species codes (P3444) for species not yet matched via AviList, using scientific name (P225) and iNat taxon ID (P3151) as lookup keys.
- Phase 2 — Identifiers: Fetches external identifiers: GBIF (P846), NCBI (P685), Avibase (P2426), BirdLife (P5257).
- Phase 3 — Labels: Fetches
rdfs:labelcommon names in all available languages. - Phase 4 — Images: Fetches Wikidata P18 images and checks Wikimedia Commons licenses (CC, PD, GFDL).
| Flag | Description |
|---|---|
--new-only |
Only species not yet in wikidata_data.json |
--no-cache |
Bypass request cache |
--dry-run |
Show species count without querying |
Fetches multilingual Wikipedia data in four phases:
- Phase 1 — English Wikipedia: Batch-fetches extracts, langlinks, page images, and Wikidata descriptions for each species' Wikipedia article. Up to 50 titles per request. Includes a search fallback for titles not found in batch results.
- Phase 1b — Extract backfill: Scans existing data for species that have an English Wikipedia URL but are missing the English extract (can happen due to API glitches during bulk fetching). Re-fetches just the extracts for those species with redirect resolution.
- Phase 2 — Locale extracts: For each target language (20 configured locales), batch-fetches intro extracts from the corresponding Wikipedia. Runs locales concurrently with a thread pool. Skips species that already have extracts for a given locale.
- Phase 3 — Image licenses: Batch-fetches license metadata (artist, license, license URL) from Wikimedia Commons for all page images found in Phase 1.
Rate-limited (default 25 rps), with exponential backoff on 429s and server errors. All phases save incrementally.
Wikipedia locales: en, de, fr, es, pt, it, nl, pl, sv, da, no, fi, cs, zh, ru, ar, ja, ko, tr, sw
| Flag | Description |
|---|---|
--limit N |
Cap new species (0 = all) |
--rps N |
Max requests per second (default: 25) |
--new-only |
Only species not yet in wikipedia_data.json |
--refetch |
Re-fetch species with few locale extracts (conflicts with --new-only) |
--dry-run |
Preview without fetching |
Discovers Macaulay Library taxon codes for all species. Birds use their eBird species code (e.g. eurblk1); non-birds get a t-prefixed numeric ID (e.g. t-11032766) resolved via the ML taxonomy API.
Resolution cascade:
- eBird code — reuses existing eBird species code for birds (instant, no API call)
- ML taxonomy API — queries by scientific name for non-birds
- Wikidata P10794 — bulk SPARQL lookup of Macaulay Library taxon IDs
- GBIF synonym fallback — resolves alternate names via GBIF, then retries the ML API
| Flag | Description |
|---|---|
--limit N |
Cap new species to process (0 = all) |
--group NAME |
Process only this taxon group |
--new-only |
Only species not yet in macaulay_data.json |
--dry-run |
Preview without API calls |
Maps each species to its Xeno-Canto scientific name. XC uses IOC taxonomy which may differ from the iNat/eBird names used in this pipeline (e.g. Dryobates pubescens → Picoides pubescens). Requires an API key in .env (XC_API_KEY=...).
Resolution cascade:
- Wikidata P2426 — bulk SPARQL fetch of XC species IDs (~31k species pre-mapped)
- XC API direct — queries by genus + epithet
- XC API epithet search — epithet-only search with group filter (catches genus transfers)
- GBIF synonym fallback — resolves alternate names via GBIF, then retries the XC API
- XC English name search — last resort, matches by common name
| Flag | Description |
|---|---|
--limit N |
Cap new species to process (0 = all) |
--group NAME |
Process only this taxon group |
--new-only |
Only species not yet in xc_data.json |
--dry-run |
Preview without API calls |
Maps each species to its observation.org species ID, enabling direct links to species pages on the platform. Observation.org uses AviList taxonomy for birds (same authority as this pipeline), so matching is straightforward.
Resolution cascade:
- Direct API search — queries
/api/v1/species/search/by scientific name - GBIF synonym fallback — resolves alternate names via GBIF, then retries the API
| Flag | Description |
|---|---|
--limit N |
Cap new species to process (0 = all) |
--group NAME |
Process only this taxon group |
--workers N |
Parallel workers (default: from config.yml) |
--save-every N |
Save every N completed species (default: from config.yml) |
--new-only |
Only species not yet in observationorg_data.json |
--dry-run |
Preview without API calls |
Uses the Claude API (Sonnet 4) for two tasks on existing Wikipedia extracts — no content is generated from scratch:
- Phase 1 — Shorten: Finds extracts exceeding
max_extract_words(default 500 words) and asks Claude to condense them totarget_words(default 150 words), preserving the original language and key facts (appearance, habitat, range, behaviour). - Phase 2 — Translate: Finds species that have an English extract but are missing translations for Claude's target locales. Sends the English text to Claude for translation into all missing locales at once.
Claude's output is stored separately in claude_data.json and overlaid on top of Wikipedia extracts during the build step. Claude only fills gaps — it never overwrites existing Wikipedia extracts for a locale.
Translation batches are grouped by the exact set of missing locales for each species, then packed by source-text size. This keeps prompts smaller and makes parallel API calls practical on large repair runs.
Claude locales: en, de, fr, es, pt, it, nl, zh, ru, ar (subset of Wikipedia locales)
| Flag | Description |
|---|---|
--limit N |
Cap total work items (0 = all) |
--batch-size N |
Species per API call (default: 12) |
--workers N |
Parallel translation workers (default: 4) |
--char-budget N |
Source-character budget per API call (default: 12000) |
--max-source-chars N |
Max source chars per species sent to Claude |
--save-every N |
Save every N completed batches |
--shorten-only |
Only shorten long extracts |
--translate-only |
Only translate missing locales |
--dry-run |
Preview without API calls |
Batch-downloads species images as WebP files with content-aware smart cropping. Each species gets two sizes stored in subdirectories:
| Size | Dimensions | Quality | Path |
|---|---|---|---|
| thumb | 150×100 | 40 | images/thumb/ |
| medium | 480×320 | 60 | images/medium/ |
Filename format: <scientific name>_<common name>_<author>.webp
Cache state: each generated image has a sidecar JSON file in .state/ storing the effective source URL and optional manual crop anchor. This keeps filenames stable while still forcing regeneration when an override changes.
Smart cropping uses YOLOv8-nano (ONNX) for animal detection. The model prefers COCO animal classes (bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe) and centers the crop on the detected subject. For tall subjects (e.g. a woodpecker on a trunk), the crop prefers the upper portion to keep the head visible. Falls back to center-crop if no animal is detected or if the ONNX runtime is unavailable.
When image_crop_anchor is set in overrides/species_overrides.csv, smart crop is bypassed and a fixed 3×3 anchor crop is used instead.
Dummy images: On startup, generates a grayscale dummy WebP (neutral gray background with centered BirdNET logo) for each size. When a download or conversion fails, the dummy is copied as the species' named file so every species has an image file.
The collector also prunes obsolete cached .webp files and stale .state metadata left behind by older naming schemes or old fallback files.
| Flag | Description |
|---|---|
--limit N |
Cap species to process (0 = all) |
--workers N |
Parallel download threads (default: from config.yml) |
--quality N |
WebP quality 1–100 (default: from config.yml) |
--dev |
Save to dev/images/ instead of dist/images/ |
--new-only |
Only species with no cached image files yet |
--dry-run |
Preview without downloading |
Merges all pre-collected data into the final metadata file. Runs purely offline — no API calls. Two phases:
Taxonomy phase:
- Cross-references iNaturalist, AviList, and Wikidata to build a canonical species list
- Resolves eBird codes from AviList and pre-collected Wikidata data
- Loads external identifiers from Wikidata (GBIF, NCBI, Avibase, BirdLife)
- Collects common names from eBird (62 locales) and Wikidata labels
- Selects the best image for each species through a priority chain:
- iNaturalist taxon photo — default photo if CC-licensed
- Macaulay Library — eBird image (source tagged as "Macaulay Library ML{asset_id}")
- Wikimedia Commons — Wikidata P18 image if CC/PD/GFDL licensed
- iNaturalist observation photo — CC-licensed photo from research-grade observations (last resort)
Merge phase: Assembles per-species descriptions from multiple sources with the following priority:
- Base layer — Wikipedia: English extract plus all locale extracts and Wikipedia URLs
- Fallback — eBird: English description only, used when no Wikipedia article exists
- Claude overlay: For each locale Claude provides, replaces the description for that locale. Claude locales are tracked in the
claude_localesfield
The effective priority is Claude > Wikipedia > eBird, applied per-locale.
The JSON output contains full multilingual descriptions. The CSV output is a lighter export and does not include description excerpts.
Image fields in the final metadata:
- JSON metadata stores an
imageobject withsrc,thumb, andmedium - CSV metadata flattens this to a single
image_urlcolumn containing the local served medium image URL
BirdNET species IDs:
Each species receives a permanent BirdNET ID in the format BN{5 digits} (e.g. BN00498). IDs are assigned once during the first build and stored in the git-tracked bn_ids.json registry. New species receive the next available number; removed species keep their ID reserved and it is never reassigned. This ensures stable, machine-readable identifiers that never change regardless of taxonomic renames or reordering.
| Flag | Description |
|---|---|
--dev |
Write to dev/ instead of dist/ |
--merge-only |
Skip taxonomy rebuild, re-merge only |
--no-zip |
Skip zip archive creation |
--reassign-ids |
Regenerate all BirdNET IDs from scratch (pre-release only) |
--dry-run |
Show stats without writing |
The raw_data/, dev/, and dist/ directories are all gitignored. Zip archives from dist/ are attached to GitHub releases.
Browse and search the dataset through a web UI and REST API.
| Flag | Description |
|---|---|
--host ADDR |
Bind address (default: 127.0.0.1) |
--port N |
Bind port (default: 8000) |
--dev |
Load metadata from dev/ instead of dist/ |
--reload |
Auto-reload on code changes |
Or with hot-reload during development:
uvicorn web.server:app --reloadSpecies lookup supports multiple identifier types. The /species/{name} and /api/species/{name} endpoints accept any of:
- Scientific name (e.g.,
Turdus merula) - Common name in any locale (e.g.,
Amsel,Merle noir) - BirdNET ID (e.g.,
BN10600) - eBird species code (e.g.,
eurblk1) - iNaturalist taxon ID (e.g.,
12727)
HTML species pages redirect to the canonical scientific name URL when accessed via an alias.
Image proxy (/api/image/{name}?size=thumb|medium) serves species images with on-demand downloading, smart cropping, and caching. Returns a dummy image (BirdNET logo on gray background) for unknown species or failed downloads. All images cached with 24-hour Cache-Control headers.
| Route | Description |
|---|---|
/ |
Home page — search, browse, filter by taxon group |
/species/{name} |
Species detail page (HTML) |
/api/image/{name}?size= |
Image proxy — thumb (150×100), medium (480×320, default) |
/api/species |
List species (JSON/CSV) with filtering, sorting, field selection; CSV omits description excerpts |
/api/species/{name} |
Single species detail (JSON) with field selection |
/api/search?q= |
Search species by name with full query options |
/api/fields |
List all available field names |
/api/groups |
List taxon groups with counts |
/api/stats |
Dataset statistics |
/docs |
Interactive API docs (Swagger UI) |
All list/search endpoints (/api/species, /api/search) support these parameters:
| Parameter | Example | Description |
|---|---|---|
fields |
?fields=scientific_name,common_name |
Return only specified fields (comma-separated) |
exclude |
?exclude=common_names,descriptions |
Return all fields except these |
locale |
?locale=en,de,fr |
Filter common_names and descriptions to specific locales |
sort |
?sort=-observations_count |
Sort by field; prefix - for descending |
group |
?group=Aves |
Filter by taxon group |
has_image |
?has_image=true |
Filter species with/without images |
has_description |
?has_description=true |
Filter species with/without English description |
description_source |
?description_source=claude,wikipedia |
Filter by description source |
min_observations |
?min_observations=10000 |
Minimum iNaturalist observation count |
max_observations |
?max_observations=50000 |
Maximum iNaturalist observation count |
format |
?format=csv |
Response format — json (default) or csv |
page |
?page=2 |
Page number (default 1) |
per_page |
?per_page=100 |
Results per page (1–500, default 50) |
The detail endpoint (/api/species/{name}) supports fields, exclude, and locale.
Examples:
# Top 10 most observed birds with images
curl '/api/species?group=Aves&has_image=true&sort=-observations_count&per_page=10&fields=scientific_name,common_name,observations_count'
# German and French names/descriptions for a species
curl '/api/species/Anas%20platyrhynchos?locale=de,fr&fields=scientific_name,common_names,descriptions'
# Export all mammals as CSV
curl '/api/species?group=Mammalia&per_page=500&format=csv' > mammals.csv
# Search with field selection
curl '/api/search?q=eagle&fields=scientific_name,common_name&per_page=20'
# Look up a species by eBird code
curl '/api/species/eurblk1'- iNaturalist — Taxonomy, common names, observation counts, and photos via the public API. Data licensed under various Creative Commons licenses by individual contributors.
- eBird — Species descriptions, images, and common names (62 locales) from the Cornell Lab of Ornithology. Species codes from the eBird/Clements taxonomy.
- Wikidata — External identifiers (GBIF, NCBI, Avibase, BirdLife), eBird codes, common name labels, and P18 images via SPARQL. Data available under CC0.
- Wikipedia — English summaries and localized article links via the REST and MediaWiki APIs. Content available under CC BY-SA 4.0.
- AviList — The Global Avian Checklist (v2025). AviList Core Team, 2025. Licensed under CC BY 4.0. doi:10.2173/avilist.v2025.
- Macaulay Library — Taxon codes for cross-referencing audio and visual media from the Cornell Lab of Ornithology.
- Xeno-Canto — Scientific name mappings for cross-referencing the world's largest shared bird and wildlife sound collection.
- observation.org — Species IDs for cross-referencing to one of Europe's largest biodiversity recording platforms. Uses AviList taxonomy for birds.
- Claude (Anthropic) — AI-powered translation of Wikipedia extracts to missing locales and shortening of excessively long extracts.
This project is licensed under the MIT License - see the LICENSE file for details.
Our work in the K. Lisa Yang Center for Conservation Bioacoustics is made possible by the generosity of K. Lisa Yang to advance innovative conservation technologies to inspire and inform the conservation of wildlife and habitats.
The development of BirdNET is supported by the German Federal Ministry of Research, Technology and Space (FKZ 01|S22072), the German Federal Ministry for the Environment, Climate Action, Nature Conservation and Nuclear Safety (FKZ 67KI31040E), the German Federal Ministry of Economic Affairs and Energy (FKZ 16KN095550), the Deutsche Bundesstiftung Umwelt (project 39263/01) and the European Social Fund.
BirdNET is a joint effort of partners from academia and industry. Without these partnerships, this project would not have been possible. Thank you!

