[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114
Open
EmilySun621 wants to merge 4 commits into
Open
[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114EmilySun621 wants to merge 4 commits into
EmilySun621 wants to merge 4 commits into
Conversation
This bundles the feature work that built up on this branch:
- Custom agents: dashboard CRUD page and editor dialog (48px icon tile,
chip-style guardrails, model selector). Each custom agent now carries a
LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the
agent-service so different agents can use different models.
- Conversation history is scoped per (workflowId, agentId): switching
agent or workflow yields a different conversation list. localStorage
key: texera.workflowConversations.v1.{workflowId}.{agentId}.
- Time machine: workflow snapshot list, revert, and agent-tagged
checkpoints. New workflow-history-tool in agent-service backs the
"undo my last change" flow; amber gains a WorkflowSnapshotResource;
sql/updates/23.sql adds the snapshot table.
- Operator-aware custom-agent prompts: the system prompt now injects the
full operator catalog with a "prefer built-in operators over Python
UDFs" rule, sourced from WorkflowSystemMetadata at request time.
- LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5
and gpt-5-mini in bin/litellm-config.yaml.
- Agent panel rewritten around the (conversation list / chat) two-view
model with subscription-managed list reloads and per-step persistence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, role detection
Adds a Data Profiling Panel triggered from data-source operator properties
(CSV/JSON/Parquet/FileScan). The panel surfaces three derived views on top of
a single profile response — no new backend calls:
- Data Quality Score (0–100): completeness, duplicates, outliers, constant
columns, high-cardinality categoricals, and class-imbalance penalties,
with a colored progress bar and sub-score badges.
- Auto-Suggest Cleaning Actions: severity-sorted rules (drop sparse/ID/
constant cols, impute via median/mode, deduplicate, review outliers) with
an Add-to-Workflow button that copies an operator hint to the clipboard.
- Column Relationship Detector: heuristic ID/target/feature/datetime/
constant classification with badges per column and an auto-detected
summary section.
Wires a small "📊 Profile Data" button into the operator property editor that
opens the panel as a draggable modal seeded with the operator's file path.
Backend integration is intentionally a follow-up; the service ships a
deterministic mock so the UX is fully exercised.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble rule Adds a console.debug so we can see what operatorType is on the selected operator (helps when the rule doesn't match an unexpected name). Also broadens the profileable regex to include Text/File so anything that looks remotely like a data source shows the button. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataProfilingService now fetches the actual dataset file via DatasetService.retrieveDatasetVersionSingleFile (presign-download endpoint), parses with papaparse (first 5000 rows for performance), and runs a new pure-TS profiler that computes: - dtype inference per column (numeric / datetime / boolean / categorical / text) - per-column: count, missing, missingPercent, unique, plus dtype-specific stats - numeric: mean, median, std, min, max, ±3σ outlier count, 10-bin histogram - categorical/boolean: top-5 value counts - dataset-level: row-key duplicate count - Pearson correlation matrix across (up to 8) numeric columns If the source isn't a dataset path or any step fails (fetch / parse / empty headers), we fall back to the deterministic mock so the panel always renders. The panel header now shows a short filename (full path on hover) and surfaces fetch/parse errors inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 Problem
Researchers drag a CSV onto the canvas and start building operators blind — they don't know how many missing values there are, which columns are IDs, or whether the target variable is imbalanced. They find out after the workflow fails.
💡 Solution
One-click data profiling that reads the real CSV and tells you everything before you write a single operator.
✨ Features
1. Data Quality Score (0-100)
Single number summarizing dataset health, with sub-score breakdown:
Color scale: 🟢 90-100 Excellent · 🟡 70-89 Good · 🟠 50-69 Needs attention · 🔴 0-49 Poor
2. Suggested Cleaning Actions
Rule-based, no LLM — zero hallucination risk:
Sorted by severity: critical → warning → info.
3. Column Role Detection (auto)
Heuristic classification of each column's ML role:
Summary: "1 possible target: Species · 1 ID: Id · 4 features: SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm"
4. Per-Column Statistics
Each column card displays:
5. Overview Tabs
6. Real Data, Not Mock
Reads the actual CSV through Texera's file-service API. Parses, computes statistics, and renders in real time. Mock data fallback only if API fails.
📸 Screenshots
🎬 Demo
📁 Files Changed
New files:
data-profiling-panel/data-profiling.types.ts— Profile, Column, Suggestion, Role typesdata-profiling-panel/data-profiling.utils.ts— Quality score, suggestions, role detection algorithmsdata-profiling-panel/data-profiling.service.ts— CSV fetch, parse, compute statsdata-profiling-panel/data-profiling-panel.component.*— Panel UI with score, suggestions, columns, correlationsdata-profiling-panel/data-profiling-modal.component.ts— Modal wrapperModified (additive only):
operator-property-edit-frame— Added "📊 Profile Data" button for CSV/scan operators✅ Testing