perf(onboarding): optimize onboarding pipeline from ~5min to ~2min#2777
Merged
Conversation
… policies The onboarding task blocked for ~4 minutes waiting for all 30+ policy LLM calls to finish before marking completion. Sales demos suffered because the prospect stared at a spinner the entire time. Key changes: - Policy fan-out switched from batchTriggerAndWait to batchTrigger so the main task completes in ~35-70s instead of ~4-5min. Per-policy progress is already tracked via child metadata. - Policy fan-out moved before the linkage gate since policies don't depend on linkage — they start generating ~30-60s earlier. - Vendor and risk extraction (both independent LLM calls) now run in parallel via Promise.all instead of serially. - Owner lookup parallelized with frameworkInstances query. - Two task.updateMany calls (assignee + frequency) merged into one. - triggerVendorResearch serial for-await loop replaced with Promise.allSettled for parallel trigger.dev RPCs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The link-risks-and-vendors-to-work task was taking ~2.5 minutes because risk matching (11 LLM rerank calls) and vendor matching (10 LLM rerank calls) ran sequentially. Both phases write to independent DB tables (Risk.tasks vs Vendor.tasks) and share only a read-only taskById map, so they can safely overlap. Wraps the two mapWithConcurrency calls in Promise.all. Expected wall-time reduction: ~75s (from ~150s to ~75s) since the slower phase no longer has to wait for the other to finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-flash The LLM reranker dominates linkage wall time (~7s per entity). Gemini 3 Flash is significantly faster and cheaper while being sufficient for the scoring task (0-10 relevance rating on compliance task candidates). Uses the AI SDK gateway so the model swap is a one-line change with no new API key management needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gateway Upgrades ai@6.0.175, @ai-sdk/openai@3.0.62, @ai-sdk/anthropic@3.0.75, @ai-sdk/gateway@3.0.110, @ai-sdk/google@3.0.68 across all workspaces. The gateway v3 upgrade resolves the v2 specification compatibility warning. Reranker now uses google/gemini-3-flash via the AI gateway for faster, cheaper linkage scoring. Fixes: - gateway.ts ModelOptions type: LanguageModelV2 → LanguageModelV3 - policies.controller.ts: await convertToModelMessages (now async in v6) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inlined RDS CA bundle was dropped in cd5046c in favor of Node's default trust store. The extension was failing in dev because rds-global-bundle.pem no longer exists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same ai SDK v6 breaking change as policies.controller.ts — convertToModelMessages is now async. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…get flow Three issues caused the onboarding tracker to lose step tracking: 1. Merged "Researching Vendors and Risks..." currentStep never matched the "Creating Risks" label in the tracker. Fixed by setting currentStep to "Researching Vendors..." initially and switching to "Creating Risks..." when vendors finish or risks start. 2. policies: true was set immediately after batchTrigger (fire-and-forget) before any policy was actually generated. Removed — the tracker already derives completion from policiesCompleted >= policiesTotal. 3. The tracker auto-minimized on run.status === COMPLETED, which now fires before policies/mitigations finish. Changed to also check that all counters (policies, vendors, risks) have reached their totals before auto-minimizing. Also added a hasBackgroundWork branch to the COMPLETED render path so the tracker shows live per-step progress while child tasks are still running, instead of prematurely showing "Setup Complete". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…licy templates
Policy templates use {{#if pipeda}}...{{/if}} handlebars syntax but
the prompt only evaluated soc2 and hipaa, defaulting everything else
to false. Added a generic framework matcher that covers pipeda, gdpr,
iso27001, pci, nist, and ccpa so new framework conditionals don't
silently strip content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ound work When the main task is COMPLETED but child tasks are still running, the switch now maps to the EXECUTING render path (which has the full expandable vendor/risk/policy lists) instead of a simplified flat view that was missing the expand arrows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt metadata updatePolicy.batchTrigger() creates independent runs with no parent relationship, so metadata.parent in the policy tasks was null and completion counters never reached the onboarding task's metadata. Switched to tasks.batchTrigger() which creates child runs, restoring the parent-child relationship needed for per-policy progress tracking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The concurrencyKey was lost when switching from updatePolicy.batchTrigger to tasks.batchTrigger. Without it, the queue's concurrencyLimit: 5 applies globally across all orgs instead of per-org, causing policy runs from different onboarding sessions to block each other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each Loader2 spinner computed its own Date.now() offset independently, causing them to rotate out of phase. Moved the offset into a single useMemo so all spinners share the same animation delay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Policies now fire before mitigations (different phase) and mitigations use their own queues, so the original concern about slot starvation no longer applies. 15 concurrent cuts policy wall time roughly 3x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3-flash All three LLM calls in update-policies-helpers.ts (content generation, format reconciliation, format check) now use google/gemini-3-flash via the AI gateway. Gemini Flash is significantly faster for structured output tasks like TipTap JSON generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing vendor side-effects Three changes: 1. Policy fan-out now fires BEFORE vendor/risk extraction. Policies only need frameworks + Q&A context, so they start draining immediately instead of waiting 10-50s for vendor/risk creation. 2. All four LLM calls in onboard-organization-helpers (vendor extraction, risk extraction, vendor mitigation, risk mitigation) switched from openai gpt-4.1-mini/gpt-5-mini to google/gemini-3-flash via the AI gateway. 3. triggerVendorRiskAssessmentsViaApi and triggerVendorResearch are now fire-and-forget (void instead of await). These are side effects that were adding 2-15s to the vendor creation path without contributing to the onboarding flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plate processor
Policy generation was the single slowest step — each of the 28 policies
required a 25-35s LLM call to regenerate the entire TipTap JSON document.
Since the templates already contain the full policy structure with
handlebars placeholders ({{COMPANY}}, {{#if hipaa}}...{{/if}}), the LLM
was doing work that can be done programmatically:
- {{PLACEHOLDER}} replacement with values from the onboarding Q&A
- {{#if framework}}...{{/if}} conditional evaluation based on selected
frameworks (soc2, hipaa, pipeda, gdpr, iso27001, pci, nist, ccpa)
The new process-policy-template.ts walks the TipTap JSON tree, replaces
placeholders, evaluates conditionals (including nested and multi-node
spans), and returns the processed content. No LLM call needed.
Result: ~10ms per policy instead of ~30s. 28 policies finish in under
1 second total instead of ~60s.
Also removes ~350 lines of dead LLM-based code (generatePolicyContent,
reconcileFormatWithTemplate, aiCheckFormatWithTemplate, and all their
helpers) and the OPENAI_API_KEY requirement from update-policy.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lling useRealtimeRun's SSE stream closes when the parent run completes, causing the last few child task metadata updates (counter increments) to never reach the UI. This left counters stuck at e.g. 25/28. Switched to useRun with refreshInterval: 1000ms which polls the run metadata every second. This guarantees the UI eventually reflects all child completions regardless of when the parent run finishes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some policy templates contain instruction text ("State that...",
"Define...", "Add a...") that needs to be converted to direct policy
language. The programmatic processor can't handle these since they
require judgment.
Adds a refineCueLines step that:
1. Scans processed content for cue line patterns
2. If found, batches them into a single LLM call (claude-sonnet-4.6)
3. Splices the rewrites back into the TipTap nodes
Only fires for ~10/28 policies that have cue lines. Policies without
cue lines skip the LLM entirely and remain instant. Fails soft —
if the LLM call errors, the original instruction text is preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reranker just scores 0-10 relevance on pre-filtered candidates — doesn't need a heavy model. Flash lite should cut the per-call latency that was causing 44s stragglers in the linkage pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same issue as the policy fix — task.batchTrigger() creates independent runs with no parent relationship, so metadata.root was null in the individual mitigation tasks and counter updates never reached the onboarding run's metadata. The tracker showed 0/N permanently. Switched both fan-out tasks to tasks.batchTrigger() which preserves the parent-child hierarchy needed for metadata.root to resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
useRealtimeRun's SSE stream closes when the parent run completes, leaving metadata stale. This caused the policies page to show "Tailoring your policies" with a spinner even after 28/28 were done. Switched to useRun with refreshInterval: 1000ms in all components that track onboarding run metadata: policies-table, policy onboarding status hook, risk/vendor onboarding status hook, and ToDoOverview. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
… tracking The tracker now shows granular progress for each phase: 1. Tailoring Policies (28/28) — expandable per-policy list 2. Creating Vendors — simple checkmark when done 3. Creating Risks — simple checkmark when done 4. Linking to Controls — checkmark when linkage completes 5. Assessing Vendors (0/11) — expandable per-vendor mitigation list 6. Assessing Risks (0/12) — expandable per-risk mitigation list Previously vendor/risk creation and mitigation were conflated into one step, showing 0/N for a long time during linkage even though vendors and risks were already created. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
4 issues found across 24 files
Confidence score: 2/5
- High-confidence logic risk in
apps/app/src/trigger/tasks/onboarding/process-policy-template.ts: nested{{#if}}handling may leak content from an outer false branch becauseopenMatchdoes not short-circuit whenskipDepth > 0, which can produce incorrect rendered policy output. apps/app/src/lib/embedding/run-linkage.tsmay exceed intended rate-limit-safe throughput: running risk and vendor pipelines in parallel while each usesMATCH_CONCURRENCYcan effectively double peak concurrency.- A second rendering regression is likely in
apps/app/src/trigger/tasks/onboarding/process-policy-template.ts: trimming every text node can remove meaningful spacing between adjacent TipTap nodes and merge words in final policies, so overall merge risk is elevated beyond a minor fix. - Pay close attention to
apps/app/src/trigger/tasks/onboarding/process-policy-template.tsandapps/app/src/lib/embedding/run-linkage.ts- template correctness and concurrency caps are the main user-impacting risks.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/app/src/lib/embedding/run-linkage.ts">
<violation number="1" location="apps/app/src/lib/embedding/run-linkage.ts:621">
P2: Running risk and vendor pipelines in parallel with `MATCH_CONCURRENCY` on both sides can double peak concurrency and exceed the intended rate-limit-safe cap.</violation>
</file>
<file name="apps/app/src/app/(app)/[orgId]/risk/(overview)/hooks/use-onboarding-status.ts">
<violation number="1" location="apps/app/src/app/(app)/[orgId]/risk/(overview)/hooks/use-onboarding-status.ts:18">
P3: Passing an empty string to useRun triggers an unnecessary HTTP request when onboardingRunId is null/undefined. Consider passing null (with a type assertion) or undefined instead to rely on SWR's built-in conditional-fetch behavior, which skips the request entirely.</violation>
</file>
<file name="apps/app/src/trigger/tasks/onboarding/process-policy-template.ts">
<violation number="1" location="apps/app/src/trigger/tasks/onboarding/process-policy-template.ts:82">
P2: Trimming every text node can remove meaningful spacing between adjacent TipTap nodes, producing merged words in rendered policies.</violation>
<violation number="2" location="apps/app/src/trigger/tasks/onboarding/process-policy-template.ts:130">
P1: Nested `{{#if}}` blocks can leak content from a skipped outer false branch because `openMatch` processing doesn't short-circuit when `skipDepth > 0`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
…ta flag
Trigger.dev silently drops metadata writes to completed runs. Since
the main task completed before mitigations started, all per-entity
status updates (vendor_${id}_status, risksCompleted, etc.) were lost.
Fix: switch mitigations from tasks.trigger (fire-and-forget) to
tasks.triggerAndWait so the main task stays alive and metadata.root
remains writable. The user still redirects early via a readyForDashboard
metadata flag set before mitigations begin.
The redirect page now checks metadata.readyForDashboard instead of
only run.status === COMPLETED, so the user hits the dashboard in ~27s
while the task continues running for another ~90s to complete mitigations
with live progress tracking.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Creating Vendors and Creating Risks steps now show the total count (e.g. "Creating Vendors 11") so the user sees how many were created. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trigger.dev doesn't support Promise.all with triggerAndWait. Changed to sequential awaits — vendor mitigations finish first, then risk mitigations. Both fan-out tasks now use batchTriggerAndWait so they stay alive until all children complete, keeping the full task hierarchy alive for metadata.root writes. Individual mitigations still run with full queue concurrency (50) within each fan-out. Risk mitigations start queueing while vendor mitigations drain, so the sequential overhead is minimal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendor and risk links in the tracker now include ?tab=treatment-plan so clicking a completed assessment goes directly to the generated mitigation plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1: Nested {{#if}} inside a false block could leak content. When
skipDepth > 0 (inside a skipped block), any new {{#if}} now always
increments skipDepth without evaluating its condition or processing
content. Previously a true inner condition would emit its content
even though the outer block was false.
P2: Text node trim removed meaningful whitespace between adjacent
TipTap inline nodes, causing merged words. Removed the trim() call
from processTextNode — placeholder replacement and conditional
evaluation don't introduce extra whitespace that needs cleaning.
P2: Parallel risk+vendor matching could double peak LLM concurrency.
Reduced MATCH_CONCURRENCY from 32 to 16 so both sides combined stay
under ~32 concurrent rerank calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
2 issues found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/app/src/app/(app)/[orgId]/components/OnboardingTracker.tsx">
<violation number="1" location="apps/app/src/app/(app)/[orgId]/components/OnboardingTracker.tsx:30">
P2: The vendors step label no longer matches one emitted `currentStep` value (`Researching Vendors...`), so active-step mapping returns `null` during that phase.</violation>
<violation number="2" location="apps/app/src/app/(app)/[orgId]/components/OnboardingTracker.tsx:569">
P3: `assessing` is an active status, but item rows only treat `processing` as active, so running mitigations appear queued/idle.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic
- Initial currentStep was still 'Researching Vendors...' which doesn't match any tracker label. Changed to 'Tailoring Policies...' since that's the first step now. - Vendor/risk item rows only treated 'processing' as active (spinner). The 'assessing' status (set during creation) is also an active state and now shows a spinner instead of a clock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@cubic-dev-ai re-review this |
Contributor
@Marfuen I have started the AI code review. It will take a few minutes to complete. |
This was referenced May 7, 2026
Contributor
|
🎉 This PR is included in version 3.45.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major performance overhaul of the onboarding pipeline. Sales demos were suffering because prospects stared at a spinner for 4-5 minutes.
Orchestration changes
batchTriggerAndWait→tasks.batchTriggerso the main task completes in ~27s instead of blocking for 3-4minfor awaitloop →Promise.allSettledPolicy generation rewrite
{{COMPANY}}placeholders with Q&A values, evaluates{{#if hipaa}}...{{/if}}conditionals based on framework flagsclaude-sonnet-4.6call per policy instead of regenerating the entire documentLinkage optimization
Promise.allModel migrations
UI fixes
useRealtimeRun→useRunwith polling — SSE stream closes when parent completes, leaving metadata staletask.batchTrigger→tasks.batchTriggerfor policies, vendors, and risks to preserve parent-child metadata hierarchyCleanup
caBundleExtensionfrom trigger config (rds-global-bundle.pem no longer exists)convertToModelMessagesasync change from AI SDK v6Results
Test plan
🤖 Generated with Claude Code
Summary by cubic
Cuts onboarding from 5–6 minutes to ~2 minutes and sends users to the dashboard in ~27 seconds by parallelizing the pipeline, replacing slow policy LLMs with a fast TipTap template processor, and keeping the run alive for mitigation tracking while redirecting early. Also fixes tracking edge cases so the progress UI stays accurate during background work.
Performance
tasks.batchTrigger(fire-and-forget); vendor and risk extraction run in parallel; DB updates parallelized; mitigations usetasks.batchTriggerAndWaitwith sequentialtriggerAndWaitso metadata stays writable while areadyForDashboardflag redirects users early.google/gemini-3.1-flash-lite-previewvia@ai-sdk/gateway; lowered MATCH_CONCURRENCY to 16.google/gemini-3-flashvia@ai-sdk/gateway; upgraded toai@6and@ai-sdk/*@3; policy queue concurrency raised to 15; removedcaBundleExtension; awaitedconvertToModelMessagesper v6.UI
useRunpolling, stays visible during background work, and shows live per-step progress after the early redirect.currentSteplabel to match “Tailoring Policies…” and treat “assessing” as an active spinner.Written for commit da3618d. Summary will update on new commits.