fix: use crawl4ai result.markdown instead of removed markdown_v2 by obchain · Pull Request #52 · sentient-agi/OpenDeepSearch

obchain · 2026-05-13T11:20:06Z

What

Stop reading the legacy result.markdown_v2 attribute in WebScraper.extract and use result.markdown (the current MarkdownGenerationResult) instead, with a getattr fallback to markdown_v2 for older crawl4ai builds.

Why

crawl4ai 0.5.x removed markdown_v2 and replaced its __getattr__ with a hard AttributeError:

AttributeError: The 'markdown_v2' attribute is deprecated and has been removed.
Please use 'markdown' instead, which now returns a MarkdownGenerationResult

Two call sites in context_scraping/crawl4ai_scraper.py were hitting it:

The no_extraction / cosine branch guards with hasattr(result, 'markdown_v2'). On 0.5.x that returns False, so content silently stays None and the scraper returns an empty payload — exactly what About results of the WebScraper #34 reports (Debug: Processed content: None).
A few lines later, len(result.markdown_v2.raw_markdown) is unconditional, so on success the scraper re-raises the same AttributeError and the extraction loop falls into the broad except block.

Closes #34

How

src/opendeepsearch/context_scraping/crawl4ai_scraper.py:

compute markdown_obj = getattr(result, 'markdown', None) or getattr(result, 'markdown_v2', None) once per result
read raw_markdown from markdown_obj for no_extraction / cosine strategies
guard the raw_markdown_length / citations_markdown_length bookkeeping on markdown_obj is not None and use getattr(..., '', '') or '' for the individual fields, so a partially-populated MarkdownGenerationResult does not crash the bookkeeping either

The getattr fallback to markdown_v2 keeps the path working for anyone still on a pre-0.5 crawl4ai build (the project's pinned crawl4ai @ git+...salzubi401/crawl4ai.git@main is left untouched on purpose).

Testing

python3 -m py_compile src/opendeepsearch/context_scraping/crawl4ai_scraper.py — clean
grep -n markdown_v2 src/opendeepsearch/context_scraping/crawl4ai_scraper.py — only the intentional fallback line remains
Manual repro path (matches the issue): with crawl4ai 0.5.x installed and WebScraper(debug=True).scrape(<any URL>), the unpatched code logs Debug: Processed content: None and surfaces AttributeError: 'markdown_v2' attribute is deprecated. With this patch, result.markdown.raw_markdown is read instead and content is populated.

No new unit tests added — WebScraper instantiates AsyncWebCrawler and reaches the network, so a meaningful test would need either VCR fixtures or a fake result double. Happy to add a small unit test for _pick_markdown_object if you'd prefer to extract a helper.

`crawl4ai` 0.5.x deletes the legacy `markdown_v2` attribute and raises an `AttributeError` from `__getattr__` whenever it is accessed, which breaks the `no_extraction` / `cosine` scraping path in `WebScraper.extract` — `content` stays `None` and the subsequent `len(result.markdown_v2.raw_markdown)` re-raises. Read the markdown payload from `result.markdown` (the current `MarkdownGenerationResult`) and fall back to `markdown_v2` via `getattr` so installations on older crawl4ai builds keep working. Guard the length bookkeeping against missing attributes too. Fixes sentient-agi#34

This was referenced May 19, 2026

BasicWebScraper.extract carries the same markdown_v2 deprecation bug as #34 #59

Open

fix: drop markdown_v2 access in BasicWebScraper.extract #60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use crawl4ai result.markdown instead of removed markdown_v2#52

fix: use crawl4ai result.markdown instead of removed markdown_v2#52
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:fix/34-crawl4ai-markdown-v2

obchain commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obchain commented May 13, 2026

What

Why

How

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant