fix: Robust and Idempotent Backfill for Search Pipeline Run API#166
fix: Robust and Idempotent Backfill for Search Pipeline Run API#166yuechao-qin wants to merge 1 commit intomasterfrom
Conversation
237096b to
43d763a
Compare
43d763a to
8442213
Compare
| # TODO: Do we need a final catchall backfill that inserts empty string | ||
| # for all pipeline names, which happens to not have a name in | ||
| # component_spec nor extra_data? |
There was a problem hiding this comment.
Do we need this? A catchall backfill for pipeline name if 1) extra_data and 2) component_spec_name doesn't exist/had issues?
There was a problem hiding this comment.
Update backfill from component spec, that if name does not exist to backfill with empty string.
| session: orm.Session, | ||
| key: str, | ||
| ) -> bool: | ||
| """Return True if at least one annotation with the given key exists.""" |
There was a problem hiding this comment.
This is not safe anymore.
Due to changes in API Server, pipeline names are already being inserted into the DB (while bulk inserts weren't added).
The best way to check whether the backfill is complete is to compare the number of rows in pipeline runs table and the number of pipeline run annotations.
| ) | ||
| existing_ann = orm.aliased(bts.PipelineRunAnnotation) | ||
|
|
||
| stmt = sqlalchemy.insert(bts.PipelineRunAnnotation).from_select( |
There was a problem hiding this comment.
JFYI: If we expect most annotations to be missing, we could use the INSERT ... ON CONFLICT DO NOTHING statement: https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#sqlalchemy.dialects.sqlite.Insert.on_conflict_do_nothing
But if most annotations exist, then outer join might be better.
| def run_all_annotation_backfills( | ||
| *, | ||
| session: orm.Session, | ||
| do_skip_already_backfilled: bool, | ||
| ) -> None: |
There was a problem hiding this comment.
Alexey: Look into different transaction modes
https://www.postgresql.org/docs/current/transaction-iso.html
For backfills let's use the most reliable mode (serializable).
For example: expectations is if there are multiple migrations, that failure shouldn't of been duplicate entry, but a transaction conflict error (i.e. notice rows being added from another migration).
Try to see if this can be simulated/reproduced locally.
There was a problem hiding this comment.
Style: Modules should be names as plural or uncountable nouns.
Example: database_migrations

TL;DR
Refactored database migration logic to ensure data parity by creating system annotations for all pipeline runs, even when source values are null or empty, and moved backfill functions to a dedicated module with comprehensive error handling.
What changed?
_mirror_system_annotationsfunction always creates annotation rows forcreated_byandpipeline_name, storing empty string""when source values are null/empty, with warning logs for null casesdatabase_migrate.pycontaining three idempotent backfill functions:backfill_created_by_annotations: UsesCOALESCEto handle null valuesbackfill_pipeline_names_from_extra_data: Extracts from JSON with null filteringbackfill_pipeline_names_from_component_spec: Extracts from nested JSON path with anti-join logicrun_all_annotation_backfillswraps all backfills in try-catch with configurable skip guards and single transaction commitHow to test?
Run the existing test suite - the new
test_database_migrate.pyprovides extensive coverage including:Why make this change?