Improve sqllogicteset speed by creating only a single large file rather than 2#20586
Conversation
|
Thank you 🙏 I left a note on Asking the original authors if they could double check |
| } | ||
| } | ||
|
|
||
| // trigger ci test |
There was a problem hiding this comment.
Can be removed? Also just in case it's helpful: git commit -m "ci" --allow-empty --no-verify
There was a problem hiding this comment.
(I think this is left over from #20566 -- when this PR gets rebased it should be removed)
adriangb
left a comment
There was a problem hiding this comment.
I don't think the test needs two physically distinct files. As long as it's two different execution nodes that should be good enough!
acce9a4 to
6a36b9e
Compare
alamb
left a comment
There was a problem hiding this comment.
I took the liberty of rebasing this PR against main.
I think it looks good to me
alamb
left a comment
There was a problem hiding this comment.
I took the liberty of rebasing this PR against main.
I think it looks good to me
|
Thanks again @Tim-53 |
…#20652) ## Which issue does this PR close? - Closes #20524 ## Rationale for this change `push_down_filter_regression.slt ` is the sqllogictest that takes the longest to run, even after @Tim-53 reduced its time in - #20586 While reviewing #20586 and trying to make the sqllogictest runs faster, I noticed that a substantial amount of the unit test time was spent doing zstd compression/decompression: <img width="2423" height="841" alt="Screenshot 2026-03-02 at 12 50 24 PM" src="https://github.com/user-attachments/assets/75cfe12b-3bb2-4ffa-9c36-63ca00b8c3ff" /> Thus, we can improve the test speed by skipping the zstd step ## What changes are included in this PR? 1. Don't compress the parquet files in the test ## Are these changes tested? Yes by CI Here are my performance runs using @kosiew 's new timing feature ```shell cargo test --profile=ci --test sqllogictests -- --timing-summary top ``` Main: ``` Per-file elapsed summary (deterministic): 1. 4.035s push_down_filter_regression.slt <-- takes over 4 seconds 2. 3.573s joins.slt 3. 3.492s aggregate.slt 4. 3.316s imdb.slt 5. ``` This PR ``` Per-file elapsed summary (deterministic): 1. 3.308s aggregate.slt 2. 3.290s joins.slt 3. 3.181s imdb.slt 4. 2.914s push_down_filter_regression.slt <--- takes less than 3 seconds and is no longer the tallest pole ``` ## Are there any user-facing changes? Faster tests
Draft as it builds on #20576
Which issue does this PR close?
Rationale for this change
Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here.
What changes are included in this PR?
Building on #20576 we reuse the needed parquet file for the test instead of recreating it.
Are these changes tested?
Ran the test with following results:
One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent
DataSourceExecnodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could usesystem cpto copy the file and register two physical files while still only paying thegenerate_seriescost once.Are there any user-facing changes?