Skip to content

hyperpolymath/squisher-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

squisher-corpus

Empirical schema corpus for protocol-squisher. Crawls GitHub for real-world schema files, analyzes them with protocol-squisher’s corpus-analyze subcommand, stores results in SQLite, and mines patterns for empirical compatibility data.

Architecture

GitHub Code Search → Fetch Raw → protocol-squisher corpus-analyze → SQLite → Pattern Mining → Hypatia

Pipeline Stages (Oban Workers)

  1. SearchWorker — GitHub code search for schema files by format

  2. FetchWorker — Download raw file content, deduplicate by SHA

  3. AnalyzeWorker — Run protocol-squisher corpus-analyze via System.cmd

  4. MineWorker — Cross-schema pattern detection

  5. SyncWorker — Export patterns as Logtalk facts for Hypatia

Composer (Gleam Orchestration)

Component Language Modules Completion

Contract Types

Gleam

8

30%

Pipeline Engine

Gleam

3

30%

Tests

Gleam

22

30%

Build: cd composer && gleam build && gleam test

Supported Formats

Format Search Pattern Extension

Protobuf

syntax = "proto3"

.proto

Avro

type record

.avsc

Thrift

struct

.thrift

Cap’n Proto

struct.*@0x

.capnp

FlatBuffers

table

.fbs

JSON Schema

$schema

.json

Pydantic

class.*BaseModel

.py

Serde

#[derive(Serialize

.rs

MessagePack

msgpack definitions

.msgpack

GraphQL

type Query

.graphql

OpenAPI

openapi:

.yaml

Quick Start

# Install dependencies
mix deps.get

# Create and migrate database
mix ecto.create
mix ecto.migrate

# Seed initial search queries
mix run scripts/seed_searches.exs

# Start the application (pipeline runs via Oban)
iex -S mix

Exports

  • exports/corpus_statistics.json — Aggregate corpus statistics

  • exports/pattern_catalog.json — Discovered patterns with confidence scores

  • exports/type_frequency.json — Empirical type frequency data

  • exports/hypatia_facts.lgt — Logtalk facts for Hypatia learning engine

Prerequisites

  • Elixir ~> 1.16

  • protocol-squisher CLI on PATH (with corpus-analyze subcommand)

  • GitHub personal access token (set GITHUB_TOKEN env var)

License

SPDX-License-Identifier: MPL-2.0

Copyright (c) 2026 Jonathan D.A. Jewell (hyperpolymath) <j.d.a.jewell@open.ac.uk>

Architecture

See TOPOLOGY.md for a visual architecture map and completion dashboard.

About

Empirical schema corpus for protocol-squisher — crawls GitHub for real-world schemas, analyses them, and mines compatibility patterns.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

 
 
 

Contributors