fix!: make Document.id deterministic regardless of meta key order#11446
Conversation
|
@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
biswajeetdev
left a comment
There was a problem hiding this comment.
The core fix is right — str(dict) includes Python insertion order in its output, so two Document objects with identical content and logically identical meta dicts could get different IDs just because the dicts were constructed differently. json.dumps(..., sort_keys=True) produces a canonical string that is order-independent.
One thing worth calling out in the release note: the default=str argument also affects documents with non-JSON-serialisable meta values (e.g. datetime objects, custom classes). Previously str({"date": datetime(2024,1,1)}) produced "datetime.datetime(2024, 1, 1, 0, 0)" in the hash; now json.dumps(..., default=str) produces "2024-01-01 00:00:00". These are different strings, so any documents with non-serialisable meta fields will also get new IDs — not just documents where key order varied. The release note currently says only non-empty-meta documents are affected, which is true but understates the impact for users with datetime or custom-type meta.
The if self.meta else "{}" guard correctly matches str({}) == "{}" for empty-dict meta, so the "empty meta is unaffected" claim holds.
|
Thanks for catching this. You're right that the impact wording was too narrow. Updated the upgrade note to spell out both cases explicitly:
|
|
Hi @Aarkin7 Thank you for opening this pull request. I agree that a document id should be independent of meta key order. However, it's a big breaking change for production users who have indexed documents and rely on document ids to check for duplicates when re-indexing documents. We'll discuss if that change could be postponed until the next major release of Haystack (version 3.0) and be grouped with any other breaking changes that are necessary. Right now, I believe users benefit more from the stability than from making the ID meta key order-independent. |
The hash was built from dict's repr, which reflects insertion order, so two Documents with equal meta could get different IDs. Serialize meta with sort_keys=True before hashing. Empty-meta IDs are unchanged.
Two BDD scenarios pinned IDs that were computed from documents with non-empty meta, so the deterministic-id fix changes them. Recompute and update the expected values; no behavior change.
496d262 to
ff2fe99
Compare
|
We decided to postpone that fix until the next major Haystack release, which is version 3.0. I changed the target of your PR therefore to our v3 branch. We'll need to document in more detail what users need to be aware of when they migrate from v2 to v3 before merging the PR. We'll take it from here. Thanks again! |
Add a Breaking Changes entry covering the new key-sorted JSON hashing of meta when auto-generating Document.id. Documents with non-empty meta get different IDs in v3.0; the entry explains why and how to migrate (re-ingest or pass the previous id explicitly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Clarified migration instructions for auto-generated IDs in DocumentStore.
Show a runnable InMemoryDocumentStore example that seeds the store with the IDs Haystack 2.x generated and a migrate_document_ids() helper that recomputes each id with the 3.x hashing and overwrites the stored documents. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Updated migration instructions to clarify the process of migrating document IDs without rerunning the indexing pipeline. Removed outdated comments and provided a clearer example.
Process documents in batches to bound extra memory, write all new documents before deleting any, and delete only the IDs that actually changed (empty-meta documents keep their ID). Note that the DocumentStore API has no pagination, so very large indexes should be read in chunks via the backend's scroll API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Updated migration instructions and code examples for regenerating document IDs without interrupting the index.
Updated migration instructions to include Haystack 3.0 for ID regeneration.
julian-risch
left a comment
There was a problem hiding this comment.
Looks good to me! Thank you @Aarkin7 . I only added more detailed migration instructions so that it is easier for users of Haystack 2.x to upgrade to 3.0 eventually.
Related Issues
Proposed Changes:
Fixes a bug where Document.id depended on the insertion order of keys in meta. The hash was built from dict's repr, which reflects insertion order, so two Documents with the same content and the same metadata could end up with different IDs, silently breaking DuplicatePolicy.SKIP/FAIL and any cache keyed on the document ID.
Serializing meta with json.dumps(..., sort_keys=True) before hashing makes the ID order-independent. Empty-meta IDs are
preserved by keeping the legacy "{}" string, so only documents with non-empty meta get new IDs.
How did you test it?
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.