[Bug]: <title>Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models

### Do you need to file an issue?

- [x] I have searched the existing issues and this bug is not already filed.
- [ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

### Describe the bug

## Default `vector_size=3072` in `IndexSchema` causes LanceDB `FixedSizeList` dimension mismatch for non-OpenAI embedding models

**GraphRAG version:** 3.0.2  
**graphrag-vectors version:** 3.0.2

---

### Summary

When using any embedding model that does **not** output 3072-dimensional vectors, the pipeline crashes with a `LanceError(Arrow): Size of FixedSizeList is not the same` error during the `generate_text_embeddings` workflow (the final step). The root cause is a hardcoded default of `vector_size=3072` in `graphrag_vectors/index_schema.py`, which assumes OpenAI `text-embedding-3-large` as the universal embedding model.

---

### Steps to Reproduce

1. Initialize a GraphRAG project:
   ```
   graphrag init --root ./my_project
   ```

2. Configure `settings.yaml` to use any embedding model that does **not** output 3072-dimensional vectors, for example:
   - NVIDIA NIM `baai/bge-m3` → 1024d
   - OpenAI `text-embedding-3-small` → 1536d
   - OpenAI `text-embedding-ada-002` → 1536d
   - Ollama `nomic-embed-text` → 768d

3. Do **not** set `vector_size` explicitly under `vector_store` in `settings.yaml` (which is the case for all `graphrag init`-generated configs, since the template contains no `index_schema` entries at all)

4. Run the indexing pipeline:
   ```
   graphrag index --root ./my_project
   ```

5. Pipeline fails during `generate_text_embeddings` (workflow 10/10) with:
   ```
   LanceError(Arrow): Size of FixedSizeList is not the same.
   input list: fixed_size_list<item: float>[1024]
   output list: fixed_size_list<item: float>[3072]
   ```

---

### Root Cause

**`graphrag_vectors/index_schema.py` line 10:**
```python
DEFAULT_VECTOR_SIZE: int = 3072   # hardcoded to OpenAI text-embedding-3-large dims
```

**`graphrag_vectors/vector_store.py` line 42:**
```python
def __init__(self, ..., vector_size: int = 3072, ...):
```

**`graphrag/config/models/graph_rag_config.py` lines 259–265 (`_validate_vector_store`):**  
When an embedding's schema entry is missing from `settings.yaml`, it is auto-created as `IndexSchema(index_name=embedding)` — which inherits `DEFAULT_VECTOR_SIZE=3072`.

**`graphrag_vectors/lancedb.py` `create_index()`:**  
The LanceDB table schema is created using `self.vector_size` (3072). Then `load_documents()` attempts to write actual embedding vectors (e.g., 1024-dim), triggering the Arrow `FixedSizeList` mismatch at line 80:
```python
vector_column = pa.FixedSizeListArray.from_arrays(flat_array, self.vector_size)
```

Note that `load_documents()` does attempt to update `self.vector_size` from the first document (lines 65–66), but by that point the LanceDB table schema has already been created with the wrong size in `create_index()`.

---

### Impact

This affects **all users** who use any embedding model other than OpenAI `text-embedding-3-large`:

| Model | Dims | Affected |
|-------|------|----------|
| OpenAI text-embedding-3-large | 3072 | ✅ Works (this is the hardcoded default) |
| OpenAI text-embedding-3-small | 1536 | ❌ Crashes |
| OpenAI text-embedding-ada-002 | 1536 | ❌ Crashes |
| NVIDIA NIM baai/bge-m3 | 1024 | ❌ Crashes |
| Ollama nomic-embed-text | 768 | ❌ Crashes |
| Any other non-3072-dim model | varies | ❌ Crashes |

The error occurs at **workflow 10/10** — the very last step — after all LLM calls for entity extraction and community reports have already been made and billed. There is no graceful error message pointing to `vector_size` or `settings.yaml` as the cause.

---

### Workaround

Explicitly set `vector_size` under each schema entry in `settings.yaml` to match the actual embedding model output dimensions:

```yaml
vector_store:
  default:
    type: lancedb
    db_uri: ./output/lancedb
  entity_description:
    vector_size: 1024   # set to match your embedding model
  community_full_content:
    vector_size: 1024
  text_unit_text:
    vector_size: 1024
```

---

### Suggested Fix

**Option A — Auto-detect from first document (minimal change):**  
In `lancedb.py`, move the `self.vector_size` update from `load_documents()` to before `create_index()` is called, by inspecting the first document's vector length before creating the table schema.

**Option B — Infer from model config at init time (better UX):**  
In `_validate_vector_store()` in `graph_rag_config.py`, when auto-creating missing `IndexSchema` entries, look up the output dimension from the configured embedding model's configuration (e.g., via a `dimensions` field in `ModelConfig`), instead of defaulting to 3072.

**Option C — Warning at config validation (minimal, immediate improvement):**  
Emit a clear warning during config loading if `vector_size` is left at default 3072 and the configured embedding model is not `text-embedding-3-large`, prompting the user to set it explicitly in `settings.yaml`.

---

### Additional Context

The `graphrag init`-generated `settings.yaml` contains no `vector_store.index_schema` entries, so new users have no indication that `vector_size` must be set manually for non-OpenAI embeddings. The only hint is buried in the LanceDB Arrow error message, which does not reference `settings.yaml` or `vector_size` as the fix.


### Steps to reproduce

_No response_

### Expected Behavior

_No response_

### GraphRAG Config Used

```yaml
# Paste your config here

```


### Logs and screenshots

_No response_

### Additional Information

- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: <title>Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models #2231

Do you need to file an issue?

Describe the bug

Default `vector_size=3072` in `IndexSchema` causes LanceDB `FixedSizeList` dimension mismatch for non-OpenAI embedding models

Summary

Steps to Reproduce

Root Cause

Impact

Workaround

Suggested Fix

Additional Context

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Dims	Affected
OpenAI text-embedding-3-large	3072	✅ Works (this is the hardcoded default)
OpenAI text-embedding-3-small	1536	❌ Crashes
OpenAI text-embedding-ada-002	1536	❌ Crashes
NVIDIA NIM baai/bge-m3	1024	❌ Crashes
Ollama nomic-embed-text	768	❌ Crashes
Any other non-3072-dim model	varies	❌ Crashes

[Bug]: <title>Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models #2231

Description

Do you need to file an issue?

Describe the bug

Default vector_size=3072 in IndexSchema causes LanceDB FixedSizeList dimension mismatch for non-OpenAI embedding models

Summary

Steps to Reproduce

Root Cause

Impact

Workaround

Suggested Fix

Additional Context

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Default `vector_size=3072` in `IndexSchema` causes LanceDB `FixedSizeList` dimension mismatch for non-OpenAI embedding models