fix!: make Document.id deterministic regardless of meta key order by Aarkin7 · Pull Request #11446 · deepset-ai/haystack

Aarkin7 · 2026-05-31T18:02:18Z

Related Issues

fixes fix: make Document.id deterministic regardless of meta key order #11445

Proposed Changes:

Fixes a bug where Document.id depended on the insertion order of keys in meta. The hash was built from dict's repr, which reflects insertion order, so two Documents with the same content and the same metadata could end up with different IDs, silently breaking DuplicatePolicy.SKIP/FAIL and any cache keyed on the document ID.

Serializing meta with json.dumps(..., sort_keys=True) before hashing makes the ID order-independent. Empty-meta IDs are
preserved by keeping the legacy "{}" string, so only documents with non-empty meta get new IDs.

How did you test it?

Added two regression tests in test/dataclasses/test_document.py covering flat and nested meta key orderings.
Updated two stale hardcoded IDs in existing tests (test_init_with_parameters, test_init_with_legacy_field) since the hash changes for non-empty meta.
hatch run test:unit → 5135 passed, 0 failed.
hatch run test:types → clean.
hatch run fmt → clean.

Notes for the reviewer

Breaking-ish: any auto-generated Document.id for documents with non-empty meta changes. The release note's upgrade section calls this out and tells users how to handle it (re-ingest, or pass id explicitly). Empty-meta IDs are intentionally preserved to minimize churn.
Scope was kept tight: the bug report mentions embedding and sparse_embedding hashing as related concerns. SparseEmbedding.to_dict() is order-stable today via dataclass field order, and float-embedding nondeterminism isn't fixable at this layer, so this PR sticks to the actual reported root cause.
default=str in json.dumps is a safety net for the (unsupported but tolerated) case of non-JSON-serializable values in meta, matching the prior behavior of not crashing on such inputs.

Checklist

I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I have added unit tests and updated the docstrings.
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I have documented my code.
I have added a release note file, following the contributors guidelines.
I have run pre-commit hooks and fixed any issue.

vercel · 2026-05-31T18:02:24Z

@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

github-actions · 2026-05-31T19:03:47Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
haystack/dataclasses
document.py
Project Total

_{This report was generated by python-coverage-comment-action}

biswajeetdev

The core fix is right — str(dict) includes Python insertion order in its output, so two Document objects with identical content and logically identical meta dicts could get different IDs just because the dicts were constructed differently. json.dumps(..., sort_keys=True) produces a canonical string that is order-independent.

One thing worth calling out in the release note: the default=str argument also affects documents with non-JSON-serialisable meta values (e.g. datetime objects, custom classes). Previously str({"date": datetime(2024,1,1)}) produced "datetime.datetime(2024, 1, 1, 0, 0)" in the hash; now json.dumps(..., default=str) produces "2024-01-01 00:00:00". These are different strings, so any documents with non-serialisable meta fields will also get new IDs — not just documents where key order varied. The release note currently says only non-empty-meta documents are affected, which is true but understates the impact for users with datetime or custom-type meta.

The if self.meta else "{}" guard correctly matches str({}) == "{}" for empty-dict meta, so the "empty meta is unaffected" claim holds.

Aarkin7 · 2026-06-02T15:14:26Z

Thanks for catching this. You're right that the impact wording was too narrow. Updated the upgrade note to spell out both cases explicitly:

documents with non-empty meta (repr -> JSON serialization change)
documents whose meta contains non-JSON-serializable values like datetime or custom classes (now serialized via str(...) instead of repr(...) thanks to default=str)

julian-risch · 2026-06-05T12:42:11Z

Hi @Aarkin7 Thank you for opening this pull request. I agree that a document id should be independent of meta key order. However, it's a big breaking change for production users who have indexed documents and rely on document ids to check for duplicates when re-indexing documents. We'll discuss if that change could be postponed until the next major release of Haystack (version 3.0) and be grouped with any other breaking changes that are necessary. Right now, I believe users benefit more from the stability than from making the ID meta key order-independent.

The hash was built from dict's repr, which reflects insertion order, so two Documents with equal meta could get different IDs. Serialize meta with sort_keys=True before hashing. Empty-meta IDs are unchanged.

Two BDD scenarios pinned IDs that were computed from documents with non-empty meta, so the deterministic-id fix changes them. Recompute and update the expected values; no behavior change.

julian-risch · 2026-06-05T13:12:22Z

We decided to postpone that fix until the next major Haystack release, which is version 3.0. I changed the target of your PR therefore to our v3 branch. We'll need to document in more detail what users need to be aware of when they migrate from v2 to v3 before merging the PR. We'll take it from here. Thanks again!

Add a Breaking Changes entry covering the new key-sorted JSON hashing of meta when auto-generating Document.id. Documents with non-empty meta get different IDs in v3.0; the entry explains why and how to migrate (re-ingest or pass the previous id explicitly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Clarified migration instructions for auto-generated IDs in DocumentStore.

Show a runnable InMemoryDocumentStore example that seeds the store with the IDs Haystack 2.x generated and a migrate_document_ids() helper that recomputes each id with the 3.x hashing and overwrites the stored documents. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Updated migration instructions to clarify the process of migrating document IDs without rerunning the indexing pipeline. Removed outdated comments and provided a clearer example.

Process documents in batches to bound extra memory, write all new documents before deleting any, and delete only the IDs that actually changed (empty-meta documents keep their ID). Note that the DocumentStore API has no pagination, so very large indexes should be read in chunks via the backend's scroll API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Updated migration instructions and code examples for regenerating document IDs without interrupting the index.

Updated migration instructions to include Haystack 3.0 for ID regeneration.

julian-risch

Looks good to me! Thank you @Aarkin7 . I only added more detailed migration instructions so that it is easier for users of Haystack 2.x to upgrade to 3.0 eventually.

Aarkin7 requested a review from a team as a code owner May 31, 2026 18:02

Aarkin7 requested review from julian-risch and removed request for a team May 31, 2026 18:02

github-actions Bot added the topic:tests label May 31, 2026

biswajeetdev reviewed Jun 2, 2026

View reviewed changes

github-actions Bot added the type:documentation Improvements on the docs label Jun 2, 2026

julian-risch changed the title ~~fix: make Document.id deterministic regardless of meta key order~~ fix!: make Document.id deterministic regardless of meta key order Jun 5, 2026

julian-risch changed the base branch from main to v3 June 5, 2026 12:57

julian-risch requested a review from a team as a code owner June 5, 2026 12:57

Aarkin7 added 3 commits June 5, 2026 15:04

fix: make Document.id deterministic regardless of meta key order

987d02d

The hash was built from dict's repr, which reflects insertion order, so two Documents with equal meta could get different IDs. Serialize meta with sort_keys=True before hashing. Empty-meta IDs are unchanged.

test: update stale Document IDs in pipeline BDD scenarios

e3d1ffe

Two BDD scenarios pinned IDs that were computed from documents with non-empty meta, so the deterministic-id fix changes them. Recompute and update the expected values; no behavior change.

docs: clarify Document.id upgrade note for non-JSON-serializable meta

ff2fe99

julian-risch force-pushed the fix/document-id-deterministic-hashing branch from 496d262 to ff2fe99 Compare June 5, 2026 13:08

julian-risch and others added 7 commits June 17, 2026 17:12

Update migration guide for ID stability changes

7c10072

Clarified migration instructions for auto-generated IDs in DocumentStore.

Clarify migration process for document IDs

4c21a10

Updated migration instructions to clarify the process of migrating document IDs without rerunning the indexing pipeline. Removed outdated comments and provided a clearer example.

simplify code example regenerating document IDs

41399b1

Updated migration instructions and code examples for regenerating document IDs without interrupting the index.

Clarify regenerating IDs requires Haystack 3.0

6588e26

Updated migration instructions to include Haystack 3.0 for ID regeneration.

julian-risch approved these changes Jun 17, 2026

View reviewed changes

Comment thread haystack/dataclasses/document.py Outdated

julian-risch added 2 commits June 17, 2026 18:41

Update haystack/dataclasses/document.py

a07abcb

Merge branch 'v3' into fix/document-id-deterministic-hashing

96dc17b

julian-risch enabled auto-merge (squash) June 17, 2026 16:47

julian-risch merged commit 5d7ee05 into deepset-ai:v3 Jun 17, 2026
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix!: make Document.id deterministic regardless of meta key order#11446

fix!: make Document.id deterministic regardless of meta key order#11446
julian-risch merged 12 commits into
deepset-ai:v3from
Aarkin7:fix/document-id-deterministic-hashing

Aarkin7 commented May 31, 2026

Uh oh!

vercel Bot commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026 •

edited

Loading

Uh oh!

biswajeetdev left a comment

Uh oh!

Aarkin7 commented Jun 2, 2026

Uh oh!

julian-risch commented Jun 5, 2026

Uh oh!

julian-risch commented Jun 5, 2026

Uh oh!

julian-risch left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Aarkin7 commented May 31, 2026

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

vercel Bot commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

biswajeetdev left a comment

Choose a reason for hiding this comment

Uh oh!

Aarkin7 commented Jun 2, 2026

Uh oh!

julian-risch commented Jun 5, 2026

Uh oh!

julian-risch commented Jun 5, 2026

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 31, 2026 •

edited

Loading