Improve sqllogicteset speed by creating only a single large file rather than 2 by Tim-53 · Pull Request #20586 · apache/datafusion

Tim-53 · 2026-02-26T23:00:59Z

Draft as it builds on #20576

Which issue does this PR close?

Part of Speedup execution of sqllogictests with more parallelization #20524
Follow on to Speedup sqllogictests by running long running tests first #20576 from @alamb

Rationale for this change

Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here.

What changes are included in this PR?

Building on #20576 we reuse the needed parquet file for the test instead of recreating it.

Are these changes tested?

Ran the test with following results:

	Baseline (2 files)	Optimized (1 file)
Min	33.000s	22.653s
Max	37.662s	25.489s
Avg	34.427s	24.092s

One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

Are there any user-facing changes?

alamb · 2026-02-27T11:56:06Z

Thank you 🙏

I left a note on

Fix HashJoinExec sideways information passing for partitioned queries #17197 (comment)

Asking the original authors if they could double check

adriangb · 2026-02-27T12:03:39Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

    }
 }

+// trigger ci test


Can be removed? Also just in case it's helpful: git commit -m "ci" --allow-empty --no-verify

(I think this is left over from #20566 -- when this PR gets rebased it should be removed)

adriangb

I don't think the test needs two physically distinct files. As long as it's two different execution nodes that should be good enough!

alamb

I took the liberty of rebasing this PR against main.

I think it looks good to me

alamb

I took the liberty of rebasing this PR against main.

I think it looks good to me

alamb · 2026-03-02T16:15:31Z

Thanks again @Tim-53

@Tim-53

…#20652) ## Which issue does this PR close? - Closes #20524 ## Rationale for this change `push_down_filter_regression.slt ` is the sqllogictest that takes the longest to run, even after @Tim-53 reduced its time in - #20586 While reviewing #20586 and trying to make the sqllogictest runs faster, I noticed that a substantial amount of the unit test time was spent doing zstd compression/decompression: <img width="2423" height="841" alt="Screenshot 2026-03-02 at 12 50 24 PM" src="https://github.com/user-attachments/assets/75cfe12b-3bb2-4ffa-9c36-63ca00b8c3ff" /> Thus, we can improve the test speed by skipping the zstd step ## What changes are included in this PR? 1. Don't compress the parquet files in the test ## Are these changes tested? Yes by CI Here are my performance runs using @kosiew 's new timing feature ```shell cargo test --profile=ci --test sqllogictests -- --timing-summary top ``` Main: ``` Per-file elapsed summary (deterministic): 1. 4.035s push_down_filter_regression.slt <-- takes over 4 seconds 2. 3.573s joins.slt 3. 3.492s aggregate.slt 4. 3.316s imdb.slt 5. ``` This PR ``` Per-file elapsed summary (deterministic): 1. 3.308s aggregate.slt 2. 3.290s joins.slt 3. 3.181s imdb.slt 4. 2.914s push_down_filter_regression.slt <--- takes less than 3 seconds and is no longer the tallest pole ``` ## Are there any user-facing changes? Faster tests

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026

Tim-53 mentioned this pull request Feb 26, 2026

Speedup sqllogictests by running long running tests first #20576

Merged

alamb changed the title ~~Perf/reuse parquet file push down filter regression~~ Improve sqllogicteset speed by creating only a single large file rather than 2 Feb 27, 2026

alamb mentioned this pull request Feb 27, 2026

Fix HashJoinExec sideways information passing for partitioned queries #17197

Merged

alamb mentioned this pull request Feb 27, 2026

Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime #20566

Merged

adriangb reviewed Feb 27, 2026

View reviewed changes

reuse parquet file in push_down_filter_regression test

6a36b9e

alamb force-pushed the perf/reuse-parquet-file-push-down-filter-regression branch from acce9a4 to 6a36b9e Compare March 1, 2026 13:40

github-actions bot removed the optimizer Optimizer rules label Mar 1, 2026

alamb approved these changes Mar 1, 2026

View reviewed changes

alamb marked this pull request as ready for review March 1, 2026 13:41

alamb approved these changes Mar 1, 2026

View reviewed changes

alamb added this pull request to the merge queue Mar 2, 2026

Merged via the queue into apache:main with commit 0af9ff5 Mar 2, 2026
28 checks passed

alamb mentioned this pull request Mar 2, 2026

Speedup push_down_filter_regression.slt by using uncompressed parquet #20652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sqllogicteset speed by creating only a single large file rather than 2#20586

Improve sqllogicteset speed by creating only a single large file rather than 2#20586
alamb merged 1 commit intoapache:mainfrom
Tim-53:perf/reuse-parquet-file-push-down-filter-regression

Tim-53 commented Feb 26, 2026

Uh oh!

alamb commented Feb 27, 2026

Uh oh!

adriangb Feb 27, 2026

Uh oh!

alamb Feb 27, 2026 •

edited

Loading

Uh oh!

adriangb left a comment

Uh oh!

alamb left a comment

Uh oh!

alamb left a comment

Uh oh!

alamb commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Tim-53 commented Feb 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Feb 27, 2026

Uh oh!

adriangb Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb Feb 27, 2026 •

edited

Loading