Astro dashboard backed by a Python pipeline that turns vendor customer stories (AWS / Microsoft / Google Cloud / Oracle / Snowflake / Databricks / Alibaba Cloud / …) into a normalized, comparable dataset.
The pipeline discovers candidate URLs from vendor sitemaps, fetches HTML,
prepares extraction bundles, and lets an agent (this assistant) author one
JSON record per page using the case-study-extraction
skill. Records are merged, seeded into SQLite, and re-exported as the JSON
the dashboard reads.
For the full system design (data flow, component responsibilities, schemas, extension recipes, and known design debt) see ARCHITECTURE.md.
66 records · 58 extracted from live vendor pages · 8 synthetic seeds.
| Vendor | Records |
|---|---|
| Alibaba Cloud | 15 |
| Databricks | 15 |
| Oracle | 13 |
| Microsoft | 10 |
| Google Cloud | 9 |
| AWS | 4 |
| Seed samples | 8 |
Average confidence ≈ 0.97, average maturity ≈ 5.9 / 6.
Each record carries a structured per-component view:
solution_components[]—{name, role, layer?}per product/service that is actually part of the implementation.layercomes fromtaxonomy.json → component_layers(Ingest/Compute/Storage/Serving/Orchestration/Governance).data_flow— one paragraph tracing how data moves end-to-end.integration_points[]— external systems, upstream sources, downstream consumers.
solution_summary targets 600–1000 characters covering ingest → process →
store → serve. Older records that predate the structured fields render via a
fallback chip list of products_used. See
.agents/skills/case-study-extraction/SKILL.md
for the full extraction contract.
npm install # install Astro tooling
python3 -m pip install -r requirements.txt
npm run dev # local dashboard at http://localhost:4321
npm run build # astro check && astro build → dist/
# Pipeline (Python 3)
npm run data:discover # walk data/sources.json sitemaps → data/discovered_urls.jsonl
npm run data:fetch # fetch raw content → data/raw_content/, append data/fetch_manifest.jsonl
npm run data:probe-fetch -- <url> # probe/update Python fetch strategies for a hard host
npm run data:extract # build prompt bundles in data/extract_jobs/{vendor}/
npm run data:merge # fold data/records/**/*.json + samples → data/case-studies.merged.json
npm run data:build # merge → seed SQLite → export data/case-studies.generated.jsonThe dashboard imports data/case-studies.generated.json, so any rerun of
npm run data:build is picked up by the next npm run dev / build.
End-to-end agent loop:
- Add the vendor to
data/sources.jsonwith a sitemap, index page(s), includeurl_patterns, andexclude_patterns. npm run data:discoverto populatedata/discovered_urls.jsonl.npm run data:fetchto mirror raw content and write the fetch manifest.npm run data:extractto producedata/extract_jobs/{vendor_slug}/{sha}.jsonbundles (self-contained prompts with cleaned text + related article image candidates + the extraction skill + the taxonomy).- Open the bundle and write the structured record to
data/records/{vendor_slug}/{slug}.jsonfollowing thecase-study-extractionskill. npm run data:buildto merge, seed SQLite, and refresh the dashboard JSON.
npm run data:fetch uses the full Python-native fetch chain by default: it
tries the project user agent, known per-host profile strategies, and
reader/wayback recovery before giving up.
Use the Diagram Extractor custom agent when a fetched page, PDF, SVG, or image
contains an architecture/workflow diagram that should be converted to Mermaid.
Give it a local path, extract job path, or source URL, for example:
Use Diagram Extractor on data/extract_jobs/<vendor>/<sha>.json and convert the
architecture image to Mermaid. Write outputs under data/diagrams/.
npm run data:extract stores article-scoped visual candidates in each bundle as
related_images[], including image URLs, source type, local asset path, alt
text, captions, nearby context, dimensions, score, and is_diagram_like. By
default it attempts a source-diverse set of candidates until it has up to six
local assets beside the bundle under data/extract_jobs/{vendor}/{sha}.assets/;
pass --image-assets-limit 0 to keep URLs only. Asset downloads use a short
timeout by default; tune it with --image-timeout-seconds for slower networks.
Low-value candidates are filtered by score before local asset download; the
default threshold is intentionally strict (--image-min-score 8) and can be
lowered when you want a wider run.
The score uses visible context plus image URL/file-name hints, so generic
backgrounds, heroes, banners, thumbnails, social cards, and Open Graph images are
filtered. Diagram-like candidates are strongly retained when hints include terms
such as architecture, workflow, topology, service map, dependency map, process
flow, sequence diagram, block diagram, ERD, UML, BPMN, swimlane, tech stack,
data lineage, or network topology. Non-diagram images also need useful hints
such as app, dashboard, screenshot, assistant, agent, GenAI, search,
recommendation, or analytics.
CSS background-image assets are included as low-confidence article imagery and
can be disabled with --no-css-background-images. The agent reads local assets
first, then visible labels, arrows, captions, SVG text, and nearby page context,
and returns Mermaid plus evidence and uncertainty notes. Generated diagram
outputs belong under data/diagrams/ and are ignored by git.
src/pages/index.astro— dashboard (top use cases, products, outcomes, maturity, vendor × use-case and industry × use-case coverage matrices, filterable record table).src/pages/cases/[slug].astro— per-record detail page.src/lib/—caseStudies.tsdata import,analytics.tsaggregations,types.tsschema.taxonomy.json— canonical vocabulary for vendors / industries / technical areas / use cases / outcomes.pipeline/usecase_intel/—discover,fetch,probe,clean,media,extract,merge_records,seed,scoring,export, plusfetch_client,settings,utils, andconfig/host-strategies.json.requirements.txt— Python extraction dependencies (beautifulsoup4,pypdf) used by the fetch/extract pipeline.data/sources.json— vendor source config (sitemaps, index pages, regex filters).data/records/{vendor_slug}/{slug}.json— authored records (tracked).data/case-studies.sample.json— synthetic seed records (is_sample: true).data/case-studies.generated.json— built artifact consumed by the dashboard (gitignored).data/usecase_intel.sqlite— built SQLite database (gitignored)..agents/skills/— agent skills.case-study-extractionis the active record-authoring contract; URL fetching now lives in Python.ARCHITECTURE.md— system architecture: data flow, component boundaries, schemas, extension recipes, known design debt.DESIGN.md— external design-system reference (Miro) used to inform the dashboard styling.