Skip to content

[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114

Open
EmilySun621 wants to merge 4 commits into
apache:mainfrom
EmilySun621:hackathon/data-profiling
Open

[Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run#5114
EmilySun621 wants to merge 4 commits into
apache:mainfrom
EmilySun621:hackathon/data-profiling

Conversation

@EmilySun621
Copy link
Copy Markdown

@EmilySun621 EmilySun621 commented May 16, 2026

🎯 Problem

Researchers drag a CSV onto the canvas and start building operators blind — they don't know how many missing values there are, which columns are IDs, or whether the target variable is imbalanced. They find out after the workflow fails.

💡 Solution

One-click data profiling that reads the real CSV and tells you everything before you write a single operator.


✨ Features

1. Data Quality Score (0-100)

Single number summarizing dataset health, with sub-score breakdown:

  • ✅ Completeness — missing value percentage across columns
  • ✅ Duplicates — duplicate row detection
  • ⚠️ Outliers — values beyond 3 standard deviations
  • ⚠️ Constant columns — zero-information columns
  • ⚠️ High cardinality — likely ID columns
  • ⚠️ Class imbalance — skewed target distribution

Color scale: 🟢 90-100 Excellent · 🟡 70-89 Good · 🟠 50-69 Needs attention · 🔴 0-49 Poor


2. Suggested Cleaning Actions

Rule-based, no LLM — zero hallucination risk:

  • 🔧 Impute — "HbA1c has 12.3% missing — use median imputation" → [Add to Workflow]
  • 🗑️ Drop column — "smoker_flag has only 1 unique value" → [Add to Workflow]
  • 📋 Remove duplicates — "23 rows (3.0%) are exact duplicates" → [Add to Workflow]
  • 🏷️ Flag ID — "patient_id is 100% unique — drop before modeling" → [Add to Workflow]
  • 📊 Review outliers — "income has 42 outliers (5.5%) beyond 3σ" → [Copy hint]

Sorted by severity: critical → warning → info.


3. Column Role Detection (auto)

Heuristic classification of each column's ML role:

  • 🎯 Target — columns named target/label/class/outcome, or low-cardinality categoricals
  • 🏷️ ID — high-cardinality columns or names matching id/index/patient_id
  • 📊 Feature — numeric and categorical columns for modeling
  • 📅 Datetime — date/time columns
  • Constant — single-value columns (flag for removal)

Summary: "1 possible target: Species · 1 ID: Id · 4 features: SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm"


4. Per-Column Statistics

Each column card displays:

  • Name + type badge (numeric/categorical) + role badge (Target/ID/Feature)
  • Numeric columns: mean, median, std, min, max, range
  • Numeric columns: inline SVG histogram (10 bins)
  • Categorical columns: unique count, top values with counts
  • Missing value warning (red highlight if > 10%)
  • Role suggestion ("Use as input feature" / "Drop before modeling")

5. Overview Tabs

  • Columns — all column cards with stats and histograms
  • Missing — missing value summary across all columns
  • Correlations — correlation matrix for numeric columns

6. Real Data, Not Mock

Reads the actual CSV through Texera's file-service API. Parses, computes statistics, and renders in real time. Mock data fallback only if API fails.


📸 Screenshots

Quality Score + Suggestions | Column Roles + Stats -- | -- Quality Score bar, sub-score badges, cleaning action cards with "Add to Workflow" buttons | Auto-detected roles, overview stats (rows/cols/dupes), per-column histograms

🎬 Demo

  1. Open workflow → click CSV File Scan operator
  2. Properties panel → click "📊 Profile Data"
  3. Modal opens → Quality Score: 100/100 "Excellent" (Iris is clean)
  4. Column Roles: Species = 🎯 possible target, Id = 🏷️ ID, 4 📊 features
  5. Scroll columns → see histograms for SepalLength, PetalWidth
  6. Switch to diabetes dataset → Score drops, suggestions appear: "Impute HbA1c", "Drop patient_id"

📁 Files Changed

New files:

  • data-profiling-panel/data-profiling.types.ts — Profile, Column, Suggestion, Role types
  • data-profiling-panel/data-profiling.utils.ts — Quality score, suggestions, role detection algorithms
  • data-profiling-panel/data-profiling.service.ts — CSV fetch, parse, compute stats
  • data-profiling-panel/data-profiling-panel.component.* — Panel UI with score, suggestions, columns, correlations
  • data-profiling-panel/data-profiling-modal.component.ts — Modal wrapper

Modified (additive only):

  • operator-property-edit-frame — Added "📊 Profile Data" button for CSV/scan operators

✅ Testing

  • Angular typecheck: clean
  • Profile button renders on CSV File Scan operators
  • Real Iris.csv: 150 rows, 6 columns, Score 100, correct role detection
  • Histograms render for all numeric columns
  • Suggestions generate correctly for datasets with issues
  • Quality score formula produces consistent res
Screenshot 2026-05-16 at 12 38 25 PM

Emily Sun and others added 4 commits May 15, 2026 21:55
This bundles the feature work that built up on this branch:

- Custom agents: dashboard CRUD page and editor dialog (48px icon tile,
  chip-style guardrails, model selector). Each custom agent now carries a
  LiteLLM model_name (Opus 4.7 / Haiku 4.5) that is passed through to the
  agent-service so different agents can use different models.

- Conversation history is scoped per (workflowId, agentId): switching
  agent or workflow yields a different conversation list. localStorage
  key: texera.workflowConversations.v1.{workflowId}.{agentId}.

- Time machine: workflow snapshot list, revert, and agent-tagged
  checkpoints. New workflow-history-tool in agent-service backs the
  "undo my last change" flow; amber gains a WorkflowSnapshotResource;
  sql/updates/23.sql adds the snapshot table.

- Operator-aware custom-agent prompts: the system prompt now injects the
  full operator catalog with a "prefer built-in operators over Python
  UDFs" rule, sourced from WorkflowSystemMetadata at request time.

- LiteLLM: added the claude-opus-4.7 entry alongside claude-haiku-4.5
  and gpt-5-mini in bin/litellm-config.yaml.

- Agent panel rewritten around the (conversation list / chat) two-view
  model with subscription-managed list reloads and per-step persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, role detection

Adds a Data Profiling Panel triggered from data-source operator properties
(CSV/JSON/Parquet/FileScan). The panel surfaces three derived views on top of
a single profile response — no new backend calls:

  - Data Quality Score (0–100): completeness, duplicates, outliers, constant
    columns, high-cardinality categoricals, and class-imbalance penalties,
    with a colored progress bar and sub-score badges.
  - Auto-Suggest Cleaning Actions: severity-sorted rules (drop sparse/ID/
    constant cols, impute via median/mode, deduplicate, review outliers) with
    an Add-to-Workflow button that copies an operator hint to the clipboard.
  - Column Relationship Detector: heuristic ID/target/feature/datetime/
    constant classification with badges per column and an auto-detected
    summary section.

Wires a small "📊 Profile Data" button into the operator property editor that
opens the panel as a draggable modal seeded with the operator's file path.
Backend integration is intentionally a follow-up; the service ships a
deterministic mock so the UX is fully exercised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble rule

Adds a console.debug so we can see what operatorType is on the selected
operator (helps when the rule doesn't match an unexpected name). Also
broadens the profileable regex to include Text/File so anything that looks
remotely like a data source shows the button.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataProfilingService now fetches the actual dataset file via
DatasetService.retrieveDatasetVersionSingleFile (presign-download endpoint),
parses with papaparse (first 5000 rows for performance), and runs a new
pure-TS profiler that computes:

  - dtype inference per column (numeric / datetime / boolean / categorical / text)
  - per-column: count, missing, missingPercent, unique, plus dtype-specific stats
  - numeric: mean, median, std, min, max, ±3σ outlier count, 10-bin histogram
  - categorical/boolean: top-5 value counts
  - dataset-level: row-key duplicate count
  - Pearson correlation matrix across (up to 8) numeric columns

If the source isn't a dataset path or any step fails (fetch / parse / empty
headers), we fall back to the deterministic mock so the panel always renders.
The panel header now shows a short filename (full path on hover) and surfaces
fetch/parse errors inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added engine ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI dev common agent-service labels May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service common ddl-change Changes to the TexeraDB DDL dev engine frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant