Skip to content

Add --dry-run to check how much data would be processed, but don't actually run query#168

Merged
hugovk merged 2 commits into
ofek:masterfrom
hugovk:dry-run
Oct 15, 2025
Merged

Add --dry-run to check how much data would be processed, but don't actually run query#168
hugovk merged 2 commits into
ofek:masterfrom
hugovk:dry-run

Conversation

@hugovk

@hugovk hugovk commented Apr 29, 2025

Copy link
Copy Markdown
Collaborator

Here's how to run a dry-run query to see how much data would be processed by a given command's query, by adding dry_run=True, use_query_cache=False to the config:

https://cloud.google.com/bigquery/docs/samples/bigquery-query-dry-run#bigquery_query_dry_run-python

Let's expose it with the --dry-run option.

Dry run

pypinfo --dry-run --all --days 1 --percent pillow pyversion
Served from cache: False
Data processed: 179.27 MiB
Data billed: 0.00 B
Estimated cost: $0.00
pypinfo --dry-run --all --days 1 --percent --json pillow pyversion
{"last_update":"2025-04-29 12:49:20","query":{"bytes_billed":0,"bytes_processed":187974046,"cached":false,"estimated_cost":"0.00"},"rows":[]}

Real run

pypinfo --all --days 1 --percent pillow pyversion
Served from cache: False
Data processed: 179.27 MiB
Data billed: 180.00 MiB
Estimated cost: $0.01

| python_version | percent | download_count |
| -------------- | ------- | -------------- |
| 3.11           |  21.14% |        885,918 |
| 3.12           |  17.61% |        738,188 |
| 3.10           |  17.24% |        722,728 |
| 3.9            |  13.72% |        574,983 |
| 3.7            |  10.09% |        423,114 |
| 3.8            |   6.99% |        292,813 |
| 3.13           |   6.58% |        275,766 |
| None           |   4.99% |        209,134 |
| 3.6            |   1.41% |         58,972 |
| 2.7            |   0.24% |          9,901 |
| Total          |         |      4,191,517 |
pypinfo --all --days 1 --percent --json pillow pyversion
{"last_update":"2025-04-29 12:49:45","query":{"bytes_billed":188743680,"bytes_processed":187974046,"cached":false,"estimated_cost":"0.01"},"rows":[{"download_count":885960,"percent":"0.21","python_version":"3.11"},{"download_count":738283,"percent":"0.18","python_version":"3.12"},{"download_count":722784,"percent":"0.17","python_version":"3.10"},{"download_count":575054,"percent":"0.14","python_version":"3.9"},{"download_count":423088,"percent":"0.1","python_version":"3.7"},{"download_count":292835,"percent":"0.07","python_version":"3.8"},{"download_count":275799,"percent":"0.066","python_version":"3.13"},{"download_count":209172,"percent":"0.05","python_version":"None"},{"download_count":58961,"percent":"0.014","python_version":"3.6"},{"download_count":9899,"percent":"0.0024","python_version":"2.7"}]}

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @hugovk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a --dry-run option to the pypinfo CLI tool. This allows users to check the amount of data that would be processed by a query without actually running the query. The changes involve adding the --dry-run option using click, modifying the create_config function to include dry_run functionality, and adjusting the output to display the data processed and estimated cost without showing the actual results when --dry-run is enabled. Additionally, tests have been added to verify the correct behavior of the create_config function with the dry_run option.

Highlights

  • New Feature: --dry-run option: Adds a --dry-run option to the pypinfo CLI, enabling users to estimate data processing costs before running a query.
  • Configuration Modification: Modifies the create_config function to set dry_run and use_query_cache options based on the --dry-run flag.
  • Output Adjustment: Adjusts the CLI output to display data processing information without showing query results when --dry-run is active.
  • Testing: Adds a new test case to verify the correct behavior of the create_config function when dry_run is enabled.

Changelog

Click here to see the changelog
  • README.md
    • Added -n, --dry-run option to the documentation.
  • pypinfo/cli.py
    • Added --dry-run option using click (line 85).
    • Modified pypinfo function to accept dry_run parameter (line 108).
    • Modified pypinfo function to pass dry_run to create_config (line 172).
    • Modified output logic to conditionally display results based on dry_run (lines 192, 197, 210).
  • pypinfo/core.py
    • Modified create_config function to accept dry_run parameter (line 29).
    • Added logic to set config.dry_run and config.use_query_cache based on dry_run flag (lines 32-34).
  • tests/test_core.py
    • Added test_create_config_dry_run to verify create_config with dry_run=True (lines 40-46).
    • Added assertion for dry_run in test_create_config (line 36).
  • tox.ini
    • Updated coverage report commands to include --show-missing and html (lines 21-22).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A dry run's gentle grace,
No data spills all over the place,
Just costs we see,
For you and me,
Before the query we embrace.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --dry-run option to the pypinfo CLI, allowing users to estimate the data processed and cost of a query without actually running it. The changes are well-structured and include necessary modifications to the CLI, core logic, and tests. The addition of this feature enhances the tool's usability and transparency regarding resource consumption.

Summary of Findings

  • Missing Error Handling for Dry Run: The code does not explicitly handle potential errors that might occur during the dry run execution. While the dry run itself doesn't process data, it still interacts with the BigQuery API and could encounter issues like authentication failures or invalid query syntax. Adding error handling would improve the robustness of the feature.
  • Inconsistent Output for Dry Run: The output format for the dry run is slightly different from a real run, particularly when using the --json flag. While the core information is present, ensuring a consistent output structure would enhance the user experience and simplify parsing of the results.

Merge Readiness

The pull request is well-structured and introduces a valuable feature. However, addressing the missing error handling and ensuring consistent output for the dry run would further improve the quality of the code. I recommend addressing these points before merging. I am unable to directly approve this pull request, and other reviewers should review and approve this code before merging.

Comment thread pypinfo/cli.py
@codecov

codecov Bot commented Oct 8, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (0720138) to head (2eb947e).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #168   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            6         6           
  Lines          547       558   +11     
  Branches        74        43   -31     
=========================================
+ Hits           547       558   +11     
Flag Coverage Δ
macOS-latest 100.00% <100.00%> (ø)
ubuntu-latest 100.00% <100.00%> (ø)
windows-latest 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment thread pypinfo/cli.py
if run:
with create_client(get_credentials()) as client:
query_job = client.query(built_query, job_config=create_config())
query_job = client.query(built_query, job_config=create_config(dry_run))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Dry Run Metadata Parsing Error

When dry_run=True, BigQuery queries return only metadata, not data rows. The code incorrectly attempts to retrieve and parse query results via query_job.result() and parse_query_result(), which causes errors or invalid data in subsequent processing.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugovk Legitimate?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get no errors running this:

pypinfo --dry-run --all --days 1 --percent pillow pyversion
Served from cache: False
Data processed: 344.62 MiB
Data billed: 0.00 B
Estimated cost: $0.00

Adding some debug print:

            query_job = client.query(built_query, job_config=create_config(dry_run))
            print(query_job)
            query_rows = query_job.result(timeout=timeout // 1000)
            print(query_rows)
            rows = parse_query_result(query_rows)
            print(rows)

Gives:

pypinfo --dry-run --all --days 1 --percent pillow pyversion
QueryJob<project=pypinfo-hugovk, location=US, id=None>
<google.cloud.bigquery.table._EmptyRowIterator object at 0x10b0e2f90>
[['python_version', 'download_count']]
Served from cache: False
Data processed: 344.62 MiB
Data billed: 0.00 B
Estimated cost: $0.00

Indeed, rows does not contain download data.

But this PR also adds not dry_run guards around subsequent processing, so we don't get errors.

@ofek ofek left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@hugovk hugovk merged commit aa28405 into ofek:master Oct 15, 2025
19 checks passed
@hugovk hugovk deleted the dry-run branch October 15, 2025 12:24
@hugovk

hugovk commented Oct 15, 2025

Copy link
Copy Markdown
Collaborator Author

I'll do a 23.0.0 release.

@hugovk

hugovk commented Oct 15, 2025

Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants