Use asynchronous crawling by umpirsky · Pull Request #3 · umpirsky/centipede-crawler

umpirsky · 2014-12-20T08:54:01Z

Currently, the crawler is running synchronously, meaning that it spends most of its time waiting for IO.
Using an asynchronous HTTP client would allow being much more efficient, by doing HTTP calls in parallel.

Note that the asynchronous behavior can be implemented in a fully BC way: $crawler->crawl() could still block until all URLs have been crawled, and the callback-based API can be kept the same.

This means that instead of using Goutte to send the calls, you would need to use Guzzle directly, to be able to access its asynchronous API. You could then convert the response back to a BrowserKit response before calling the callback if you want to keep BC (or you could decide to break BC and to pass a Guzzle response instead)

umpirsky · 2014-12-15T11:15:47Z

@stof I was thinking about this, and my initial idea was to use ReactPHP.

Your proposal sounds better to me and is definitely one of next things I want to do on this project.

Thanks!

stof · 2014-12-15T11:45:36Z

ReactPHP would not be a good fit here:

its event loop is probably overkill here
its HTTP client does not support HTTP fully (Gzip, etc...) because they are implementing HTTP themselves on top of their asynchronous socket client. Guzzle on the other hand relies on curl for the HTTP part, which is feature-complete.

umpirsky · 2014-12-16T14:40:01Z

@stof Thanks for sharing, I was aware of the 2nd point. 👍

umpirsky · 2014-12-20T10:56:59Z

@stof Here is the initial implementation. Feedback is welcome, thanks.

Refactoring

Use asynchronous crawling

stof · 2014-12-29T11:21:47Z

@umpirsky turning an issue opened by me into a PR from your repo is really weird (and it makes it hard to understand things when looking at them later). It would be much better to open a separate PR referencing the issue

stof · 2014-12-29T11:24:54Z

this does not work, because links can be relative

umpirsky · 2014-12-29T11:34:45Z

@stof Yes, I know, sorry, I merged other branch into this so I created mess. You can focus on 346f2a2.

umpirsky added the enhancement label Dec 15, 2014

umpirsky self-assigned this Dec 19, 2014

umpirsky added 3 commits December 20, 2014 09:53

Use Guzzle async

346f2a2

Fix type hints

f8420af

Early callback

97a59ab

umpirsky added 8 commits December 20, 2014 12:16

UrlFilter

0642f34

FilterInterface

02eef15

Remove request method

9b1c849

UrlExtractor

c5a36df

Error handling

72397d8

HostChecker

38c6940

Error handling

e078ad3

Merge pull request #6 from umpirsky/fix/refactoring

d56ad21

Refactoring

umpirsky added a commit that referenced this pull request Dec 28, 2014

Merge pull request #3 from umpirsky/feature/async

3c4a72b

Use asynchronous crawling

umpirsky merged commit 3c4a72b into master Dec 28, 2014

stof reviewed Dec 29, 2014
View reviewed changes

Comment thread src/Centipede/Extractor/UrlExtractor.php

Copy link
Copy Markdown

Contributor

stof Dec 29, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not work, because links can be relative

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use asynchronous crawling#3

Use asynchronous crawling#3
umpirsky merged 11 commits intomasterfrom
feature/async

umpirsky commented Dec 20, 2014

Uh oh!

umpirsky commented Dec 15, 2014

Uh oh!

stof commented Dec 15, 2014

Uh oh!

umpirsky commented Dec 16, 2014

Uh oh!

umpirsky commented Dec 20, 2014

Uh oh!

stof commented Dec 29, 2014

Uh oh!

stof Dec 29, 2014

Uh oh!

umpirsky commented Dec 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

umpirsky commented Dec 20, 2014

Uh oh!

umpirsky commented Dec 15, 2014

Uh oh!

stof commented Dec 15, 2014

Uh oh!

umpirsky commented Dec 16, 2014

Uh oh!

umpirsky commented Dec 20, 2014

Uh oh!

stof commented Dec 29, 2014

Uh oh!

stof Dec 29, 2014

Choose a reason for hiding this comment

Uh oh!

umpirsky commented Dec 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants