Skip to content

fix: avoid NaN cosine scores for zero-norm embeddings in InMemoryDocumentStore#11628

Merged
julian-risch merged 3 commits into
deepset-ai:mainfrom
i-anubhav-anand:fix/in-memory-cosine-zero-vector-nan
Jun 16, 2026
Merged

fix: avoid NaN cosine scores for zero-norm embeddings in InMemoryDocumentStore#11628
julian-risch merged 3 commits into
deepset-ai:mainfrom
i-anubhav-anand:fix/in-memory-cosine-zero-vector-nan

Conversation

@i-anubhav-anand

Copy link
Copy Markdown
Contributor

Related Issues

Self-found bug (no existing issue). InMemoryDocumentStore.embedding_retrieval returns NaN similarity scores when a document (or the query) has a zero-norm embedding, silently corrupting ranking.

Proposed Changes:

For cosine similarity, embeddings are normalized by their L2 norm:

query_embedding /= np.linalg.norm(x=query_embedding, axis=1, keepdims=True)
document_embeddings /= np.linalg.norm(x=document_embeddings, axis=1, keepdims=True)

A zero-norm vector (e.g. a zero embedding, which some models emit for empty/whitespace input) makes this divide by zero, producing NaN scores (numpy even emits a RuntimeWarning: invalid value encountered in divide). NaN then sorts unpredictably and silently corrupts the ranking.

This guards the normalization so zero-norm vectors stay zero (denominator forced to 1.0), giving such documents a cosine score of 0.0 instead of NaN. Non-zero embeddings are unaffected.

Reproduction (before the fix):

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

store = InMemoryDocumentStore(embedding_similarity_function="cosine")
store.write_documents([Document(content="zero", embedding=[0.0, 0.0, 0.0])])
print(store.embedding_retrieval(query_embedding=[1.0, 0.0, 0.0])[0].score)  # nan

How did you test it?

Added test_embedding_retrieval_with_zero_vector_does_not_produce_nan in test/document_stores/test_in_memory.py: a zero-embedding document no longer yields a NaN score (it gets 0.0) while a normal document is unaffected. It fails on main (NaN) and passes with this change. Ran hatch run test:unit test/document_stores/test_in_memory.py (148 passed, 4 skipped), hatch run fmt (clean), hatch run test:types haystack/document_stores/in_memory/document_store.py (mypy clean), and added a release note.

Notes for the reviewer

Behavior for non-zero embeddings is unchanged; only the zero-norm edge case is guarded.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have added unit tests and updated the docstrings.
  • I've used a conventional commit type for my PR title (fix:).
  • I have added a release note file.
  • I have run pre-commit hooks / hatch run fmt and fixed any issue.

This PR was generated with the help of an AI assistant. I have reviewed the changes, reproduced the bug, and run the relevant tests locally.

…mentStore

embedding_retrieval normalized embeddings by their L2 norm for cosine
similarity, dividing by zero when a document or the query had a zero-norm
embedding and producing NaN scores that silently corrupt ranking. Guard the
normalization so zero-norm vectors stay zero (score 0.0) instead of NaN.
@i-anubhav-anand i-anubhav-anand requested a review from a team as a code owner June 14, 2026 19:56
@i-anubhav-anand i-anubhav-anand requested review from julian-risch and removed request for a team June 14, 2026 19:56
@vercel

vercel Bot commented Jun 14, 2026

Copy link
Copy Markdown

@i-anubhav-anand is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview Jun 16, 2026 9:07am

Request Review

@julian-risch julian-risch enabled auto-merge (squash) June 16, 2026 08:37

@julian-risch julian-risch left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you for your contribution @i-anubhav-anand !

@github-actions github-actions Bot added the type:documentation Improvements on the docs label Jun 16, 2026
julian-risch and others added 2 commits June 16, 2026 10:40
Updated formatting for code snippets in release notes.
Document.score is Optional[float], so math.isnan() needs a None guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@julian-risch julian-risch merged commit dd05a63 into deepset-ai:main Jun 16, 2026
23 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/document_stores/in_memory
  document_store.py 801
Project Total  

This report was generated by python-coverage-comment-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants