Skip to content

Use asynchronous crawling#3

Merged
umpirsky merged 11 commits intomasterfrom
feature/async
Dec 28, 2014
Merged

Use asynchronous crawling#3
umpirsky merged 11 commits intomasterfrom
feature/async

Conversation

@umpirsky
Copy link
Copy Markdown
Owner

Currently, the crawler is running synchronously, meaning that it spends most of its time waiting for IO.
Using an asynchronous HTTP client would allow being much more efficient, by doing HTTP calls in parallel.

Note that the asynchronous behavior can be implemented in a fully BC way: $crawler->crawl() could still block until all URLs have been crawled, and the callback-based API can be kept the same.

This means that instead of using Goutte to send the calls, you would need to use Guzzle directly, to be able to access its asynchronous API. You could then convert the response back to a BrowserKit response before calling the callback if you want to keep BC (or you could decide to break BC and to pass a Guzzle response instead)

@umpirsky
Copy link
Copy Markdown
Owner

@stof I was thinking about this, and my initial idea was to use ReactPHP.

Your proposal sounds better to me and is definitely one of next things I want to do on this project.

Thanks!

@stof
Copy link
Copy Markdown
Contributor Author

stof commented Dec 15, 2014

ReactPHP would not be a good fit here:

  • its event loop is probably overkill here
  • its HTTP client does not support HTTP fully (Gzip, etc...) because they are implementing HTTP themselves on top of their asynchronous socket client. Guzzle on the other hand relies on curl for the HTTP part, which is feature-complete.

@umpirsky
Copy link
Copy Markdown
Owner

@stof Thanks for sharing, I was aware of the 2nd point. 👍

@umpirsky umpirsky self-assigned this Dec 19, 2014
@umpirsky
Copy link
Copy Markdown
Owner

@stof Here is the initial implementation. Feedback is welcome, thanks.

umpirsky added a commit that referenced this pull request Dec 28, 2014
@umpirsky umpirsky merged commit 3c4a72b into master Dec 28, 2014
@stof
Copy link
Copy Markdown
Contributor Author

stof commented Dec 29, 2014

@umpirsky turning an issue opened by me into a PR from your repo is really weird (and it makes it hard to understand things when looking at them later). It would be much better to open a separate PR referencing the issue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not work, because links can be relative

@umpirsky
Copy link
Copy Markdown
Owner

@stof Yes, I know, sorry, I merged other branch into this so I created mess. You can focus on 346f2a2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants