Skip to content

Improve sqllogicteset speed by creating only a single large file rather than 2#20586

Merged
alamb merged 1 commit intoapache:mainfrom
Tim-53:perf/reuse-parquet-file-push-down-filter-regression
Mar 2, 2026
Merged

Improve sqllogicteset speed by creating only a single large file rather than 2#20586
alamb merged 1 commit intoapache:mainfrom
Tim-53:perf/reuse-parquet-file-push-down-filter-regression

Conversation

@Tim-53
Copy link
Contributor

@Tim-53 Tim-53 commented Feb 26, 2026

Draft as it builds on #20576

Which issue does this PR close?

Rationale for this change

Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here.

What changes are included in this PR?

Building on #20576 we reuse the needed parquet file for the test instead of recreating it.

Are these changes tested?

Ran the test with following results:

Baseline (2 files) Optimized (1 file)
Min 33.000s 22.653s
Max 37.662s 25.489s
Avg 34.427s 24.092s

One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026
@alamb alamb changed the title Perf/reuse parquet file push down filter regression Improve sqllogicteset speed by creating only a single large file rather than 2 Feb 27, 2026
@alamb
Copy link
Contributor

alamb commented Feb 27, 2026

Thank you 🙏

I left a note on

Asking the original authors if they could double check

}
}

// trigger ci test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed? Also just in case it's helpful: git commit -m "ci" --allow-empty --no-verify

Copy link
Contributor

@alamb alamb Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think this is left over from #20566 -- when this PR gets rebased it should be removed)

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test needs two physically distinct files. As long as it's two different execution nodes that should be good enough!

@alamb alamb force-pushed the perf/reuse-parquet-file-push-down-filter-regression branch from acce9a4 to 6a36b9e Compare March 1, 2026 13:40
@github-actions github-actions bot removed the optimizer Optimizer rules label Mar 1, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of rebasing this PR against main.

I think it looks good to me

@alamb alamb marked this pull request as ready for review March 1, 2026 13:41
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of rebasing this PR against main.

I think it looks good to me

@alamb alamb added this pull request to the merge queue Mar 2, 2026
@alamb
Copy link
Contributor

alamb commented Mar 2, 2026

Thanks again @Tim-53

Merged via the queue into apache:main with commit 0af9ff5 Mar 2, 2026
28 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Mar 4, 2026
…#20652)

## Which issue does this PR close?

- Closes #20524

## Rationale for this change


`push_down_filter_regression.slt ` is the sqllogictest that takes the
longest to run, even after @Tim-53 reduced its time in
- #20586

While reviewing #20586 and
trying to make the sqllogictest runs faster, I noticed that a
substantial amount of the unit test time was spent doing zstd
compression/decompression:

<img width="2423" height="841" alt="Screenshot 2026-03-02 at 12 50
24 PM"
src="https://github.com/user-attachments/assets/75cfe12b-3bb2-4ffa-9c36-63ca00b8c3ff"
/>

Thus, we can improve the test speed by skipping the zstd step

## What changes are included in this PR?

1. Don't compress the parquet files in the test

## Are these changes tested?

Yes by CI

Here are my performance runs using @kosiew 's new timing feature
```shell
cargo test --profile=ci --test sqllogictests  -- --timing-summary top
```

Main:
```
Per-file elapsed summary (deterministic):
1.    4.035s  push_down_filter_regression.slt  <-- takes  over 4 seconds
2.    3.573s  joins.slt
3.    3.492s  aggregate.slt
4.    3.316s  imdb.slt
5. ```

This PR
```
Per-file elapsed summary (deterministic):
1.    3.308s  aggregate.slt 
2.    3.290s  joins.slt
3.    3.181s  imdb.slt
4. 2.914s push_down_filter_regression.slt <--- takes less than 3 seconds
and is no longer the tallest pole
```

## Are there any user-facing changes?

Faster tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants