Skip to content

Question: crawling in similar domain #612

@FcrbPeter

Description

@FcrbPeter

Hi Pascal,

I am working on a website which include different domains, such as...

// Below are the domains in the start url section
www.rthk.hk
app3.rthk.hk
app4.rthk.hk
programme.rthk.hk
news.rthk.hk
podcast.rthk.hk
// Below are the domains that need to crawl but not listed above
app1.rthk.hk
app2.rthk.hk
... more with "rthk.hk"

In the config.xml, I did something like...

// stayOnDomain = false, because there would be other similar doamin
// stayOnPort & stayOnProtocol = false, because there are http and https
<startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
<url>http://app3.rthk.hk/search/google/start.php</url>
<url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url>
<url>https://www.rthk.hk/</url>
<url>https://news.rthk.hk/</url>
<url>http://podcast.rthk.hk/</url>
<url>http://app4.rthk.hk/special/rthkmemory/</url>
<url>http://app4.rthk.hk/elearning/healthpedia/</url>
<startURLs>

<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*rthk\.org\.hk/.*
</filter>

... other exclude filters
</referenceFilters>

I found this solution from the past issues.
However, it seems not working in my case.

I got the following log which there is a unwanted url got fetched.

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/1d916bc2a14c46e2999138ed408fecb9.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/04dd334f9961456586f017a5c44ce7dc.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/../../../../module/visitcount/visit.jsp?type=3&i_webid=2&i_columnid=316&i_articleid=944973 (RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,reg
ex=.*\/([^\/]*)\/\1\/\1\/.*])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/images/10/gmz_dqwz_pic.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html (No "include" document filters matched.)

I would like to ask if there is any wrong from the config.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions