Skip to content

fix: use crawl4ai result.markdown instead of removed markdown_v2#52

Open
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:fix/34-crawl4ai-markdown-v2
Open

fix: use crawl4ai result.markdown instead of removed markdown_v2#52
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:fix/34-crawl4ai-markdown-v2

Conversation

@obchain
Copy link
Copy Markdown

@obchain obchain commented May 13, 2026

What

Stop reading the legacy result.markdown_v2 attribute in WebScraper.extract and use result.markdown (the current MarkdownGenerationResult) instead, with a getattr fallback to markdown_v2 for older crawl4ai builds.

Why

crawl4ai 0.5.x removed markdown_v2 and replaced its __getattr__ with a hard AttributeError:

AttributeError: The 'markdown_v2' attribute is deprecated and has been removed.
Please use 'markdown' instead, which now returns a MarkdownGenerationResult

Two call sites in context_scraping/crawl4ai_scraper.py were hitting it:

  1. The no_extraction / cosine branch guards with hasattr(result, 'markdown_v2'). On 0.5.x that returns False, so content silently stays None and the scraper returns an empty payload — exactly what About results of the WebScraper #34 reports (Debug: Processed content: None).
  2. A few lines later, len(result.markdown_v2.raw_markdown) is unconditional, so on success the scraper re-raises the same AttributeError and the extraction loop falls into the broad except block.

Closes #34

How

src/opendeepsearch/context_scraping/crawl4ai_scraper.py:

  • compute markdown_obj = getattr(result, 'markdown', None) or getattr(result, 'markdown_v2', None) once per result
  • read raw_markdown from markdown_obj for no_extraction / cosine strategies
  • guard the raw_markdown_length / citations_markdown_length bookkeeping on markdown_obj is not None and use getattr(..., '', '') or '' for the individual fields, so a partially-populated MarkdownGenerationResult does not crash the bookkeeping either

The getattr fallback to markdown_v2 keeps the path working for anyone still on a pre-0.5 crawl4ai build (the project's pinned crawl4ai @ git+...salzubi401/crawl4ai.git@main is left untouched on purpose).

Testing

  • python3 -m py_compile src/opendeepsearch/context_scraping/crawl4ai_scraper.py — clean
  • grep -n markdown_v2 src/opendeepsearch/context_scraping/crawl4ai_scraper.py — only the intentional fallback line remains
  • Manual repro path (matches the issue): with crawl4ai 0.5.x installed and WebScraper(debug=True).scrape(<any URL>), the unpatched code logs Debug: Processed content: None and surfaces AttributeError: 'markdown_v2' attribute is deprecated. With this patch, result.markdown.raw_markdown is read instead and content is populated.

No new unit tests added — WebScraper instantiates AsyncWebCrawler and reaches the network, so a meaningful test would need either VCR fixtures or a fake result double. Happy to add a small unit test for _pick_markdown_object if you'd prefer to extract a helper.

`crawl4ai` 0.5.x deletes the legacy `markdown_v2` attribute and raises
an `AttributeError` from `__getattr__` whenever it is accessed, which
breaks the `no_extraction` / `cosine` scraping path in
`WebScraper.extract` — `content` stays `None` and the subsequent
`len(result.markdown_v2.raw_markdown)` re-raises.

Read the markdown payload from `result.markdown` (the current
`MarkdownGenerationResult`) and fall back to `markdown_v2` via
`getattr` so installations on older crawl4ai builds keep working.
Guard the length bookkeeping against missing attributes too.

Fixes sentient-agi#34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

About results of the WebScraper

1 participant