[Hackathon] feat: BioFlow Genesis — the AI reads your dataset, not your prompt by yangzhang75 · Pull Request #5122 · apache/texera

yangzhang75 · 2026-05-17T12:43:11Z

Demo Video

https://drive.google.com/file/d/10wiyRbZVvXGEn5lw5Wvws5WyjD521ICz/view?usp=sharing

What changes were proposed in this PR?

This PR adds BioFlow Genesis, a drag-and-drop entry point that turns any CSV into a running ML workflow. The user drops a file onto the dashboard, an LLM profiles the columns and proposes four analyses grounded in the actual data, and one click materializes a wired Texera workflow with real sklearn training and a Python UDF that writes a five-section interpretation of the run.

The interaction is the point. Genesis reads the dataset, not the user. A biology PhD or social scientist doesn't need to know what task type fits their data, which target column matters, or how to wire operators — they drop the file, look at four typed recommendations, and click. Free-text input below the cards lets advanced users override the recommendation in plain English (typing "predict diabetes using random forest" swaps the trainer node). Genesis is not a workflow preview, not a schema mockup, not a code template — it is a working pipeline that trains a model on real data and returns results, on every drop.

How the workflow is generated

Workflow JSON generation is the most failure-prone surface in LLM-driven workflow tools — one wrong port name or one inverted link breaks the run. Most existing approaches handle this with a self-repair retry loop: prompt the LLM, validate, retry up to N times on failure. Genesis avoids that entirely by keeping the LLM out of the JSON path.

The LLM (Claude Haiku via the existing LLM_ENDPOINT) returns plain text only: profiling notes, recommended task type, target column, algorithm, and the four card titles. A deterministic Python module (core/workflow_builder.py) emits Texera JSON. The skeletons are tested code paths, so LLM output never touches operator IDs, port maps, or link wiring. Validation cannot fail at the LLM layer because the LLM doesn't produce structure. The wiring is correct by construction, the LLM token budget per workflow drops from thousands to a few hundred, and any future skeleton author edits Python instead of debugging a prompt.

The skeleton library

Classification — 6 nodes (CSVFileScan → Projection → Split → SklearnLogisticRegression or RandomForest → SklearnPrediction → AI Insight UDF).
Regression — 7 nodes, same shape as classification plus automatic feature preprocessing (median NaN imputation and categorical encoding) so real-world CSVs with mixed types work out of the box.
Exploration — 4 nodes computing Pearson correlations against the target, with the AI Insight card surfacing the strongest drivers.
AutoML — 10 nodes running three trainers (LogReg, DecisionTree, RandomForest) in parallel and aggregating their holdout accuracies.
Visualization — 5 nodes producing distribution charts (Aggregate → PieChart → AI Insight).

All skeletons compose existing Texera operators only. No new operator types were added, no engine changes, no protocol changes. The skeletons demonstrate that Texera's operator catalog is already sufficient to express the entire non-technical-researcher entry path.

The AI Insight card

The AI Insight node at the end of every skeleton is what closes the loop for non-technical users. Most workflow tools stop at "your model scored X%." The Insight UDF reads the prediction table, computes accuracy or R² depending on the task, picks the top three features from the column metadata, and emits a five-section result table — summary, top predictors, interpretation, next steps, caveat — that renders as a structured card in the result panel. This is the artifact a researcher actually hands to their team: not a number, but a readable explanation of what the model learned.

Verified end-to-end on real public datasets

The same product was run on a medical CSV and a real-estate CSV to confirm that skeleton choice comes from the data itself, not from any preset. Both ran to completion with real sklearn training, real holdout splits, and real metrics.

Pima Indians Diabetes (768 rows, originally collected by the U.S. National Institute of Diabetes — the same NIH agency, NIDDK, that funds dkNET — and the standard medical ML benchmark since 1988) was auto-detected as classification, produced a 6-node LogisticRegression pipeline, scored 72.5% accuracy on a 20% holdout split, and the AI Insight card surfaced Glucose, BMI, and Age as the top three predictors.

California Housing (20,640 rows, UC Irvine source) was auto-detected as regression, produced a 7-node LinearRegression pipeline with automatic feature preprocessing, scored R² = 0.63 and MAE around $51K, and the AI Insight card surfaced longitude, latitude, and housing_median_age as top predictors with a residuals-and-extrapolation caveat.

Free-text input: typing predict diabetes using random forest produced a workflow with SklearnRandomForest as the trainer instead of the default LogisticRegression. The LLM parsed the algorithm name from the natural-language sentence; the Python builder swapped the trainer node accordingly, and the pipeline ran end-to-end to completion.

The cross-domain switch is the part worth checking. The only user action in all three cases is one drop (or one drop plus one sentence). The skeleton diverges fundamentally between domains because the recommendation is grounded in the data, not in a template.

What is in this PR

The new directory bioflow-genesis-service/ is a standalone FastAPI service on port 9099:

core/classifier.py — data profiling, task inference, target detection, algorithm selection, four-card generation, and free-text intent parsing.
core/workflow_builder.py — skeleton builders that emit Texera JSON. Includes the preprocessing UDF code, the AI Insight UDF code, and trainer-node selection.
core/texera_client.py — thin wrapper around the standard POST /api/workflow/persist endpoint.
core/llm_client.py and core/prompts.py — LLM client and prompt templates.
api/build.py — POST /api/genesis/build endpoint consumed by the frontend.
tests/test_workflow_builder.py — 10 unit tests covering all skeletons. All passing.

Frontend integration ships in a companion commit on the same branch: a drop zone on the dashboard, a card grid modal that animates from analysis to build, a free-text natural language input below the cards, and the 5-section AI Insight card rendering in the result panel.

This PR does not touch the Amber engine, does not add any new Texera operators, and uses the standard /api/workflow/persist endpoint without modification. The LLM never emits workflow JSON, only text.

Any related issues, documentation, discussions?

See Discussion #5059.

How was this PR tested?

pytest bioflow-genesis-service/tests/test_workflow_builder.py — 10 passed.

End-to-end manual testing on three cases:

Pima Diabetes: dropped diabetes.csv onto the dashboard, four recommendation cards appeared in a few seconds, clicked the first card, 6-node workflow generated, hit Run, completed in about 10 seconds, 72.5% accuracy on the holdout split, AI Insight card rendered with Glucose / BMI / Age as top predictors.
California Housing: dropped houses.csv, four cards (all regression) appeared, clicked the first card, 7-node workflow with feature preprocessing generated, hit Run, completed in about 15 seconds, R² = 0.63, MAE around $51K, AI Insight rendered with longitude / latitude / housing_median_age.
Free-text input: dropped diabetes.csv, typed predict diabetes using random forest, hit Build, workflow generated with RandomForest trainer, hit Run, ran end-to-end to completion.

Cross-domain check confirmed: dropping the two CSVs in sequence produces fundamentally different running pipelines with no user instruction beyond the drop itself.

Future work

The drag-and-drop recommendation entry point is the core interaction shape this PR establishes. Several directions extend naturally from here:

More skeletons. Time-series forecasting, clustering, anomaly detection, and multi-class with imbalance handling all fit the existing recommendation pipeline — each is one new skeleton function plus a profiling rule.

Beyond CSV. The data-profiling layer is the only file-format-aware code in the system; the workflow builder is format-agnostic. Adding FASTQ / VCF / BAM (bioinformatics), Parquet, or image folders is a matter of new profilers and matching Texera operators. The recommendation logic and the AI Insight card stay unchanged. The path to genuinely biomedical, omics-grade workflows is short.

One-drop compound workflows. A research dataset often feeds multiple analyses. An obvious extension is letting the cards combine — "run classification AND find drivers" — by composing two skeletons into a parent workflow with shared upstream nodes, still produced from a single drop.

The intent of this submission is to demonstrate that the data-driven recommendation entry point — drop a file, see four typed analyses, click one, run — is the right interaction shape for non-technical researchers, and to ship it as a working, end-to-end system rather than a preview or design document.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude 4.7

Wire Genesis agent POST with ?source=genesis & wid, cap ReAct at 30 steps, strip delete tools from the model tool map, and align Bob prompts with plan-then-execute plus non-template Iris reference. Co-authored-by: Cursor <cursoragent@cursor.com>

…rom CSV drop

Yang Zhang and others added 5 commits May 16, 2026 14:44

checkpoint: Day 2 Genesis frontend + agent + config

fc24d87

feat: AI Insight panel + Genesis UI polish

f5fe681

checkpoint: demo-ready, all 5 skeletons verified across domains

737a73e

[Hackathon] feat: BioFlow Genesis - data-driven workflow generation f…

78d92f9

…rom CSV drop

github-actions Bot assigned yangzhang75 May 17, 2026

github-actions Bot added dependencies Pull requests that update a dependency file python frontend Changes related to the frontend GUI docs Changes related to documentations common agent-service labels May 17, 2026

yangzhang75 changed the title ~~[Hackathon] feat: BioFlow Genesis - data-driven workflow generation from CSV drop~~ [Hackathon] feat: BioFlow Genesis — the AI reads your dataset, not your prompt May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] feat: BioFlow Genesis — the AI reads your dataset, not your prompt#5122

[Hackathon] feat: BioFlow Genesis — the AI reads your dataset, not your prompt#5122
yangzhang75 wants to merge 5 commits into
apache:mainfrom
yangzhang75:hackathon-bioflow-genesis

yangzhang75 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yangzhang75 commented May 17, 2026

Demo Video

What changes were proposed in this PR?

How the workflow is generated

The skeleton library

The AI Insight card

Verified end-to-end on real public datasets

What is in this PR

Any related issues, documentation, discussions?

How was this PR tested?

Future work

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant