Do you need to file an issue?
Describe the bug
Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models
GraphRAG version: 3.0.2
graphrag-vectors version: 3.0.2
Summary
When using any embedding model that does not output 3072-dimensional vectors, the pipeline crashes with a LanceError(Arrow): Size of FixedSizeList is not the same error during the generate_text_embeddings workflow (the final step). The root cause is a hardcoded default of vector_size=3072 in graphrag_vectors/index_schema.py, which assumes OpenAI text-embedding-3-large as the universal embedding model.
Steps to Reproduce
-
Initialize a GraphRAG project:
graphrag init --root ./my_project
-
Configure settings.yaml to use any embedding model that does not output 3072-dimensional vectors, for example:
- NVIDIA NIM
baai/bge-m3 → 1024d
- OpenAI
text-embedding-3-small → 1536d
- OpenAI
text-embedding-ada-002 → 1536d
- Ollama
nomic-embed-text → 768d
-
Do not set vector_size explicitly under vector_store in settings.yaml (which is the case for all graphrag init-generated configs, since the template contains no index_schema entries at all)
-
Run the indexing pipeline:
graphrag index --root ./my_project
-
Pipeline fails during generate_text_embeddings (workflow 10/10) with:
LanceError(Arrow): Size of FixedSizeList is not the same.
input list: fixed_size_list<item: float>[1024]
output list: fixed_size_list<item: float>[3072]
Root Cause
graphrag_vectors/index_schema.py line 10:
DEFAULT_VECTOR_SIZE: int = 3072 # hardcoded to OpenAI text-embedding-3-large dims
graphrag_vectors/vector_store.py line 42:
def __init__(self, ..., vector_size: int = 3072, ...):
graphrag/config/models/graph_rag_config.py lines 259–265 (_validate_vector_store):
When an embedding's schema entry is missing from settings.yaml, it is auto-created as IndexSchema(index_name=embedding) — which inherits DEFAULT_VECTOR_SIZE=3072.
graphrag_vectors/lancedb.py create_index():
The LanceDB table schema is created using self.vector_size (3072). Then load_documents() attempts to write actual embedding vectors (e.g., 1024-dim), triggering the Arrow FixedSizeList mismatch at line 80:
vector_column = pa.FixedSizeListArray.from_arrays(flat_array, self.vector_size)
Note that load_documents() does attempt to update self.vector_size from the first document (lines 65–66), but by that point the LanceDB table schema has already been created with the wrong size in create_index().
Impact
This affects all users who use any embedding model other than OpenAI text-embedding-3-large:
| Model |
Dims |
Affected |
| OpenAI text-embedding-3-large |
3072 |
✅ Works (this is the hardcoded default) |
| OpenAI text-embedding-3-small |
1536 |
❌ Crashes |
| OpenAI text-embedding-ada-002 |
1536 |
❌ Crashes |
| NVIDIA NIM baai/bge-m3 |
1024 |
❌ Crashes |
| Ollama nomic-embed-text |
768 |
❌ Crashes |
| Any other non-3072-dim model |
varies |
❌ Crashes |
The error occurs at workflow 10/10 — the very last step — after all LLM calls for entity extraction and community reports have already been made and billed. There is no graceful error message pointing to vector_size or settings.yaml as the cause.
Workaround
Explicitly set vector_size under each schema entry in settings.yaml to match the actual embedding model output dimensions:
vector_store:
default:
type: lancedb
db_uri: ./output/lancedb
entity_description:
vector_size: 1024 # set to match your embedding model
community_full_content:
vector_size: 1024
text_unit_text:
vector_size: 1024
Suggested Fix
Option A — Auto-detect from first document (minimal change):
In lancedb.py, move the self.vector_size update from load_documents() to before create_index() is called, by inspecting the first document's vector length before creating the table schema.
Option B — Infer from model config at init time (better UX):
In _validate_vector_store() in graph_rag_config.py, when auto-creating missing IndexSchema entries, look up the output dimension from the configured embedding model's configuration (e.g., via a dimensions field in ModelConfig), instead of defaulting to 3072.
Option C — Warning at config validation (minimal, immediate improvement):
Emit a clear warning during config loading if vector_size is left at default 3072 and the configured embedding model is not text-embedding-3-large, prompting the user to set it explicitly in settings.yaml.
Additional Context
The graphrag init-generated settings.yaml contains no vector_store.index_schema entries, so new users have no indication that vector_size must be set manually for non-OpenAI embeddings. The only hint is buried in the LanceDB Arrow error message, which does not reference settings.yaml or vector_size as the fix.
Steps to reproduce
No response
Expected Behavior
No response
GraphRAG Config Used
Logs and screenshots
No response
Additional Information
- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
Do you need to file an issue?
Describe the bug
Default
vector_size=3072inIndexSchemacauses LanceDBFixedSizeListdimension mismatch for non-OpenAI embedding modelsGraphRAG version: 3.0.2
graphrag-vectors version: 3.0.2
Summary
When using any embedding model that does not output 3072-dimensional vectors, the pipeline crashes with a
LanceError(Arrow): Size of FixedSizeList is not the sameerror during thegenerate_text_embeddingsworkflow (the final step). The root cause is a hardcoded default ofvector_size=3072ingraphrag_vectors/index_schema.py, which assumes OpenAItext-embedding-3-largeas the universal embedding model.Steps to Reproduce
Initialize a GraphRAG project:
Configure
settings.yamlto use any embedding model that does not output 3072-dimensional vectors, for example:baai/bge-m3→ 1024dtext-embedding-3-small→ 1536dtext-embedding-ada-002→ 1536dnomic-embed-text→ 768dDo not set
vector_sizeexplicitly undervector_storeinsettings.yaml(which is the case for allgraphrag init-generated configs, since the template contains noindex_schemaentries at all)Run the indexing pipeline:
Pipeline fails during
generate_text_embeddings(workflow 10/10) with:Root Cause
graphrag_vectors/index_schema.pyline 10:graphrag_vectors/vector_store.pyline 42:graphrag/config/models/graph_rag_config.pylines 259–265 (_validate_vector_store):When an embedding's schema entry is missing from
settings.yaml, it is auto-created asIndexSchema(index_name=embedding)— which inheritsDEFAULT_VECTOR_SIZE=3072.graphrag_vectors/lancedb.pycreate_index():The LanceDB table schema is created using
self.vector_size(3072). Thenload_documents()attempts to write actual embedding vectors (e.g., 1024-dim), triggering the ArrowFixedSizeListmismatch at line 80:Note that
load_documents()does attempt to updateself.vector_sizefrom the first document (lines 65–66), but by that point the LanceDB table schema has already been created with the wrong size increate_index().Impact
This affects all users who use any embedding model other than OpenAI
text-embedding-3-large:The error occurs at workflow 10/10 — the very last step — after all LLM calls for entity extraction and community reports have already been made and billed. There is no graceful error message pointing to
vector_sizeorsettings.yamlas the cause.Workaround
Explicitly set
vector_sizeunder each schema entry insettings.yamlto match the actual embedding model output dimensions:Suggested Fix
Option A — Auto-detect from first document (minimal change):
In
lancedb.py, move theself.vector_sizeupdate fromload_documents()to beforecreate_index()is called, by inspecting the first document's vector length before creating the table schema.Option B — Infer from model config at init time (better UX):
In
_validate_vector_store()ingraph_rag_config.py, when auto-creating missingIndexSchemaentries, look up the output dimension from the configured embedding model's configuration (e.g., via adimensionsfield inModelConfig), instead of defaulting to 3072.Option C — Warning at config validation (minimal, immediate improvement):
Emit a clear warning during config loading if
vector_sizeis left at default 3072 and the configured embedding model is nottext-embedding-3-large, prompting the user to set it explicitly insettings.yaml.Additional Context
The
graphrag init-generatedsettings.yamlcontains novector_store.index_schemaentries, so new users have no indication thatvector_sizemust be set manually for non-OpenAI embeddings. The only hint is buried in the LanceDB Arrow error message, which does not referencesettings.yamlorvector_sizeas the fix.Steps to reproduce
No response
Expected Behavior
No response
GraphRAG Config Used
# Paste your config hereLogs and screenshots
No response
Additional Information