Conversation
Owner
|
@stof I was thinking about this, and my initial idea was to use ReactPHP. Your proposal sounds better to me and is definitely one of next things I want to do on this project. Thanks! |
Contributor
Author
|
ReactPHP would not be a good fit here:
|
Owner
|
@stof Thanks for sharing, I was aware of the 2nd point. 👍 |
Owner
|
@stof Here is the initial implementation. Feedback is welcome, thanks. |
Contributor
Author
|
@umpirsky turning an issue opened by me into a PR from your repo is really weird (and it makes it hard to understand things when looking at them later). It would be much better to open a separate PR referencing the issue |
Contributor
There was a problem hiding this comment.
this does not work, because links can be relative
Owner
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, the crawler is running synchronously, meaning that it spends most of its time waiting for IO.
Using an asynchronous HTTP client would allow being much more efficient, by doing HTTP calls in parallel.
Note that the asynchronous behavior can be implemented in a fully BC way:
$crawler->crawl()could still block until all URLs have been crawled, and the callback-based API can be kept the same.This means that instead of using Goutte to send the calls, you would need to use Guzzle directly, to be able to access its asynchronous API. You could then convert the response back to a BrowserKit response before calling the callback if you want to keep BC (or you could decide to break BC and to pass a Guzzle response instead)