Reduce flakiness of Monitoring system tests.#2095
Reduce flakiness of Monitoring system tests.#2095tseaver merged 1 commit intogoogleapis:masterfrom tseaver:2092-2093-monitoring-flaky-system-tests
Conversation
| from system_test_utils import unique_resource_id | ||
|
|
||
| retry_404 = RetryErrors(NotFound) | ||
| retry_503 = RetryErrors(ServiceUnavailable) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
|
No real concerns, but I hesitate to give the LGTM because I don't understand why any of the errors occur |
|
Crazy we've got one service using "503 Service Unavailable" to mean "409 Conflict" and another using "403 Forbidden" to mean "429 Too Many Requests". |
|
"We don't need no stinking RFCs." |
|
@dhermes This is a matter of debugging each particular mysterious error, starting with experimentation to isolate the circumstances under which it occurs. If @supriyagarg has time to work on this, and some reproducible error remains mysterious, she and I can work together internally to try to trace it to the underlying cause. A good first step would be to create some issues. |
|
The issues are good. supriyagarg@ is willing to work on reproducing them in isolation. |
|
@dhermes Have you seen a delete return 404 after a create? |
|
@rimey That is why I added the retries for descriptor / group deletes. I assume that the newly-created entity hasn't yet propagated to the host / layer handling the deletions. |
|
ISTM we should just merge these retries before the PR bitrots, unless @rimey or @supriyagarg need us to hold off for their own testing. |
|
I'd like to have a clear record of what errors have been observed before retries are added everywhere. #2092 reports a 503 on deleting a group. #2093 reports a 503 on creating group. This is not completely consistent with the discussion above. Where did you definitely see 404s, and did you see errors for metric descriptors? |
Ditto |
I agree. @rimey and @supriyagarg can we put a fixed end time to get this resolved before merging? Let's say EOD Monday August 22? |
|
@rimey I saw the 404s for |
|
We failed to reproduce any of these transient errors internally. Nevertheless, we have clear reports of 503s on creation and on deletion of groups in #2093 and #2092, respectively. Thank you for those. We would be grateful for any additional reports of other transient errors. @tseaver You mention "404s for I urge you not to add retries except where you have actually observed the error code in question on that type of request. We currently believe that 503 is the appropriate error code for the errors reported in #2093 and #2092. We will be updating the error message to be less misleading. |
|
@rimey yes, I saw 404s explicitly here for custom metric descriptors and here for groups. |
A 5xx error signals to a developer that the backend has failed to handle the request. The fact that the API can successfully give an error with useful information means the backend is working just fine, but the operation doesn't work for some reason. UPDATE: Your service, your call, but it will confound more developers than just me. |
|
@dhermes That is correct. In particular, the 503 is signaling that the operation failed due to a transient condition and can be retried as-is. |
|
@rimey Are you asking me to back out the |
|
@tseaver No. If it happened anywhere, we know it can happen. I'll leave the details of this PR up to you and @dhermes, but I want to make one more (admittedly unhelpful) comment: While it's correct by definition to retry on 503, retrying on 404 is generally questionable. We presume that the 404 is because the resource doesn't exist yet, but it could also be because it has been deleted. Nevertheless, I'm okay with retrying on 404 in situations like this where you have good reason to presume that it's because the resource doesn't exist yet. |
|
@rimey I'm in violent agreement that we don't want to sprinkle |
|
For the record, we changed the message for this particular 503 from "Write collision, please retry." to "The service is currently unavailable, please retry." The change is expected to roll out next week. |
|
Thanks for the heads up |
…e` (#2095) * feat: use pandas-gbq to determine schema in `load_table_from_dataframe` * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * fix some unit tests * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * bump minimum pandas-gbq to 0.26.1 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * drop pandas-gbq from python 3.7 extras * relax warning message text assertion * use consistent time zone presense/absense in time datetime system test * Update google/cloud/bigquery/_pandas_helpers.py * Update google/cloud/bigquery/_pandas_helpers.py Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> * remove pandas-gbq from at least 1 unit test and system test session --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Chalmer Lowe <chalmerlowe@google.com>
* Initial batch of changes to remove 3.7 and 3.8 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * more updates to remove 3.7 and 3.8 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates samples/geography/reqs * updates samples/magics/reqs * updates samples/notebooks/reqs * updates linting * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates conf due to linting issue * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates reqs.txt, fix mypy, lint, and debug in noxfile * Updates owlbot to correct spacing issue in conf.py * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates owlbot imports * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * removes kokoro samples configs for 3.7 & 3.8 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * removes owlbots attempt to restore kokoro samples configs * removes kokoro system-3.8.cfg * edits repo sync settings * updates assorted noxfiles for samples and pyproject.toml * update test-samples-impl.sh * updates install_deps template * Edits to the contributing documentation * deps: use pandas-gbq to determine schema in `load_table_from_dataframe` (#2095) * feat: use pandas-gbq to determine schema in `load_table_from_dataframe` * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * fix some unit tests * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * bump minimum pandas-gbq to 0.26.1 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * drop pandas-gbq from python 3.7 extras * relax warning message text assertion * use consistent time zone presense/absense in time datetime system test * Update google/cloud/bigquery/_pandas_helpers.py * Update google/cloud/bigquery/_pandas_helpers.py Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> * remove pandas-gbq from at least 1 unit test and system test session --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> * Feat: Adds foreign_type_info attribute to table class and adds unit tests. (#2126) * adds foreign_type_info attribute to table * feat: Adds foreign_type_info attribute and tests * updates docstrings for foreign_type_info * Updates property handling, especially as regards set/get_sub_prop * Removes extraneous comments and debug expressions * Refactors build_resource_from_properties w get/set_sub_prop * updates to foreign_type_info, tests and wiring * Adds logic to detect non-Sequence schema.fields value * updates assorted tests and logic * deps: updates required checks list in github (#2136) * deps: updates required checks list in github * deps: updates snippet and system checks in github to remove 3.9 * changes the order of two items in the list. * updates linting * reverts pandas back to 1.1.0 * Revert changes related to pandas <1.5 * Revert noxfile.py changes related to pandas <1.5 * Revert constraints-3.9 changes related to pandas <1.5 * Revert test_query_pandas.py changes related to pandas <1.5 * Revert test__pandas_helpers.py changes related to pandas <1.5 * Revert test__versions_helpers.py changes related to pandas <1.5 * Revert tnoxfile.py changes related to pandas <1.5 * Revert test__versions_helpers.py changes related to pandas <1.5 * Revert test_table.py changes related to pandas <1.5 * Update noxfile changes related to pandas <1.5 * Update pyproject.toml changes related to pandas <1.5 * Update constraints-3.9.txt changes related to pandas <1.5 * Update test_legacy_types.py changes related to pandas <1.5 * Updates magics.py as part of reverting from pandas 1.5 * Updates noxfile.py in reverting from pandas 1.5 * Updates pyproject.toml in reverting from pandas 1.5 * Updates constraints.txt in reverting from pandas 1.5 * Updates test_magics in reverting from pandas 1.5 * Updates test_table in reverting from pandas 1.5 * Updates in tests re: reverting from pandas 1.5 * Updates pyproject to match constraints.txt * updates pyproject.toml to mirror constraints * remove limit on virtualenv * updates owlbot.py for test-samples-impl.sh * updates to owlbot.py * updates to test-samples-impl.sh * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * further updates to owlbot.py * removes unneeded files * adds presubmit.cfg back in --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Sweña (Swast) <swast@google.com>
Towards: #2092, #2093.