Cross-repo pass + community detection + paginated cross_project_links (PR 3/5)#378
Draft
Shidfar wants to merge 10 commits into
Draft
Cross-repo pass + community detection + paginated cross_project_links (PR 3/5)#378Shidfar wants to merge 10 commits into
Shidfar wants to merge 10 commits into
Conversation
Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching
Activates the linker files added by the prior cherry-picks: - Makefile.cbm: add 14 servicelink_*.c to PIPELINE_SRCS, add 14 TEST_SERVICELINK_*_SRCS test declarations, extend ALL_TEST_SRCS - pass_servicelinks.c: restore the LINKERS dispatch table to the full 14-entry list and remove the empty-table guard - pipeline.c: allocate cbm_sl_endpoint_list_t at function top (alongside path_aliases) so cleanup can free it safely even when the early cancel check goto's into cleanup before ctx is declared - test_main.c: register the 14 suite_servicelink_* test suites
Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.
Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.
Before: unfiltered = 898,308 bytes (~224K tokens)
After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller
summary_only = 1,028 bytes (~257 tokens)
Activates the files added by the prior cherry-picks: - Makefile.cbm: add pass_communities.c and pass_crossrepolinks.c to PIPELINE_SRCS; add TEST_COMMUNITIES_SRCS, TEST_ENDPOINT_PERSISTENCE_SRCS, and TEST_CROSS_PROJECT_LINKS_SRCS to ALL_TEST_SRCS - pipeline_internal.h: declare cbm_pipeline_pass_communities - pipeline.c: call cbm_pipeline_pass_communities after the service-link pass; call cbm_persist_endpoints to persist collected endpoints; call cbm_cross_project_link to compute cross-project links after dump - test_main.c: register suite_communities, suite_endpoint_persistence, and suite_cross_project_links - tests/test_endpoint_persistence.c: restored (exercises cbm_persist_endpoints which lands in this PR)
Removes stale-fact drift from the fork era (language/agent counts, install one-liner, feature bullets) flagged in PR DeusData#295's close comment. No URL substitutions involved — README's links already pointed at DeusData; this only reverts the content body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cross-repo protocol matching, Louvain community detection on the resulting graph, and the paginated
cross_project_linksMCP tool that surfaces matches.Stacked on #377 — please review #376 then #377 first.
Commits
feat: add cross-repo protocol linking and community detection— createspass_crossrepolinks.c(the producer↔consumer matcher, exact 0.95 + normalized 0.85) andpass_communities.c(Louvain clustering on service-link edges). Stores results in a separate_crosslinks.db(this storage choice is unified in PR 5).feat: add paginated summary guard to cross_project_links— new params:limit(default 100, max 1000),offset,summary_only. Always emits a summary header (total, by-protocol breakdown, top-10 project pairs). Unfiltered output dropped from ~225K tokens to ~9K tokens on a 19-project cache in the original measurement from Cross-project HTTP edges + unified storage + paginated cross_project_links #295.build: wire cross-repo pass + community detection into pipeline— callscbm_pipeline_pass_communities,cbm_persist_endpoints, andcbm_cross_project_linkfromcbm_pipeline_run; restorestests/test_endpoint_persistence.c(its dependencycbm_persist_endpointslives in this PR); wires Makefile + test_main.cTest plan
./scripts/test.shpasses (3810/3810, ASan + UBSan)cross_link_*battery exercises exact/normalized/same-project/no-match/multi-protocol/missing-table/http-skipped/unresolved-qn/idempotent-rerunUpstream overlap audit (re-checked against
upstream/main@ 6226972)Since this PR was opened the audit has been re-run on current upstream. Findings:
src/pipeline/pass_cross_repo.c:107-684— full cross-repo matcher; writesCROSS_*edges to per-project edges tablessrc/mcp/mcp.c:append_cross_repo_summary(line 1777) — cross-repo results surfaced inget_architecturesrc/mcp/mcp.cindex_repository(mode=cross-repo-intelligence)(line 2399) — exposes the matcher via MCPpass_communities.c(Louvain community detection) — no upstream equivalentcross_project_linksMCP tool with pagination,summary_only, and per-protocol filters — upstream surfaces cross-repo only via the narrative summary insideget_architecturepass_crossrepolinks.cas a parallel matcher (it duplicates upstream). Keeppass_communities.cand the dedicated MCP reader. The MCP reader should query upstream's per-projectCROSS_*edges directly — which is exactly the storage unification PR Storage unification + incremental parity + MCP reader migration (PR 5/5) #380 establishes.Marking remains draft until reviewed against this audit. PR #380 establishes the architectural reconciliation (cedes 4 protocols to upstream); the consolidated shape of this PR depends on how that lands.