Skip to content

[Bug]: <title>Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models #2231

@laomomo

Description

@laomomo

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models

GraphRAG version: 3.0.2
graphrag-vectors version: 3.0.2


Summary

When using any embedding model that does not output 3072-dimensional vectors, the pipeline crashes with a LanceError(Arrow): Size of FixedSizeList is not the same error during the generate_text_embeddings workflow (the final step). The root cause is a hardcoded default of vector_size=3072 in graphrag_vectors/index_schema.py, which assumes OpenAI text-embedding-3-large as the universal embedding model.


Steps to Reproduce

  1. Initialize a GraphRAG project:

    graphrag init --root ./my_project
    
  2. Configure settings.yaml to use any embedding model that does not output 3072-dimensional vectors, for example:

    • NVIDIA NIM baai/bge-m3 → 1024d
    • OpenAI text-embedding-3-small → 1536d
    • OpenAI text-embedding-ada-002 → 1536d
    • Ollama nomic-embed-text → 768d
  3. Do not set vector_size explicitly under vector_store in settings.yaml (which is the case for all graphrag init-generated configs, since the template contains no index_schema entries at all)

  4. Run the indexing pipeline:

    graphrag index --root ./my_project
    
  5. Pipeline fails during generate_text_embeddings (workflow 10/10) with:

    LanceError(Arrow): Size of FixedSizeList is not the same.
    input list: fixed_size_list<item: float>[1024]
    output list: fixed_size_list<item: float>[3072]
    

Root Cause

graphrag_vectors/index_schema.py line 10:

DEFAULT_VECTOR_SIZE: int = 3072   # hardcoded to OpenAI text-embedding-3-large dims

graphrag_vectors/vector_store.py line 42:

def __init__(self, ..., vector_size: int = 3072, ...):

graphrag/config/models/graph_rag_config.py lines 259–265 (_validate_vector_store):
When an embedding's schema entry is missing from settings.yaml, it is auto-created as IndexSchema(index_name=embedding) — which inherits DEFAULT_VECTOR_SIZE=3072.

graphrag_vectors/lancedb.py create_index():
The LanceDB table schema is created using self.vector_size (3072). Then load_documents() attempts to write actual embedding vectors (e.g., 1024-dim), triggering the Arrow FixedSizeList mismatch at line 80:

vector_column = pa.FixedSizeListArray.from_arrays(flat_array, self.vector_size)

Note that load_documents() does attempt to update self.vector_size from the first document (lines 65–66), but by that point the LanceDB table schema has already been created with the wrong size in create_index().


Impact

This affects all users who use any embedding model other than OpenAI text-embedding-3-large:

Model Dims Affected
OpenAI text-embedding-3-large 3072 ✅ Works (this is the hardcoded default)
OpenAI text-embedding-3-small 1536 ❌ Crashes
OpenAI text-embedding-ada-002 1536 ❌ Crashes
NVIDIA NIM baai/bge-m3 1024 ❌ Crashes
Ollama nomic-embed-text 768 ❌ Crashes
Any other non-3072-dim model varies ❌ Crashes

The error occurs at workflow 10/10 — the very last step — after all LLM calls for entity extraction and community reports have already been made and billed. There is no graceful error message pointing to vector_size or settings.yaml as the cause.


Workaround

Explicitly set vector_size under each schema entry in settings.yaml to match the actual embedding model output dimensions:

vector_store:
  default:
    type: lancedb
    db_uri: ./output/lancedb
  entity_description:
    vector_size: 1024   # set to match your embedding model
  community_full_content:
    vector_size: 1024
  text_unit_text:
    vector_size: 1024

Suggested Fix

Option A — Auto-detect from first document (minimal change):
In lancedb.py, move the self.vector_size update from load_documents() to before create_index() is called, by inspecting the first document's vector length before creating the table schema.

Option B — Infer from model config at init time (better UX):
In _validate_vector_store() in graph_rag_config.py, when auto-creating missing IndexSchema entries, look up the output dimension from the configured embedding model's configuration (e.g., via a dimensions field in ModelConfig), instead of defaulting to 3072.

Option C — Warning at config validation (minimal, immediate improvement):
Emit a clear warning during config loading if vector_size is left at default 3072 and the configured embedding model is not text-embedding-3-large, prompting the user to set it explicitly in settings.yaml.


Additional Context

The graphrag init-generated settings.yaml contains no vector_store.index_schema entries, so new users have no indication that vector_size must be set manually for non-OpenAI embeddings. The only hint is buried in the LanceDB Arrow error message, which does not reference settings.yaml or vector_size as the fix.

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

# Paste your config here

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    backlogWe've confirmed some action is needed on this and will plan itbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions