Jurnalo Crawler Core

This library implements extraction of web data from instance based learning. Given a set of pages from a Web site, the proposed technique takes a labeled page ( user labels the items that need to be extracted ). The system then stores a certain number of consecutive prefix and suffix tokens (tags) of each item. After that it can extract target items from each new page from the Web site which uses the same template.

Requirements

Ruby 1.8.6
rspec (1.2.6) [required for specs]
hpricot
ruby-stemmer [required for keywords, has a c extension]
htmlentities [required for keywords and cleaner]
multibyte [required for processing utf8 strings]
activerecord [required for JCore::Story Database Plugin]

Specs

rake specs:all
rake learn:all

rake learn:all lets you create all the templates in one go

Scripts

Sources Learned: NY Times, Times

NOTE: Following commands are to be run from jCore ROOT_DIR.

How to learn?

ruby script/learn -s time
ruby script/learn -s nytimes

This will learn about Times and NY Times pages respectively. The labeled data is stored in ROOT_DIR/data/labeled_stories. The file naming follows following convention: <source>_ddd.kd+.html where d represents digit and d+ represents one or more digits. E.g.

nytimes_001.k5.html tells that source is nytimes and maximum prefix/suffix length for the template is 5.

To add more sources, the labeled data should be created and then learn script be run.

How to inspect the template learned?

ruby script/template -s time
ruby script/template -s nytimes

This will show the template structure.

How to extract information?

ruby script/extract -s time -u "http://www.time.com/time/nation/article/0,8599,1915835,00.html"
ruby script/extract -s nytimes -u "http://www.nytimes.com/2009/08/14/opinion/14krugman.html"

This will display the information that is extracted using templates. If information is not extracted then it reports "Information not found".

How to learn using XPath?

XPath expressions are the expressions supported by Hpricot and Hpricot. Plus you can specify some tricks like using to_s rather than inner_html which is default. Also you can delete some elements which are not necessary. XPath tag should be autoclosing tag otherwise it is interpreted as the prefix/suffix extraction label.

TODOS: ::delete::, ::select::, match enhancements

How to modify the document before processing?

Sometimes the document generated by source is not correct and needs to be modified before we can actually work on it. Modify doc label actually notifies the learner that documents need to be altered before it can be processed

Author

Ram Singla

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
data		data
lib		lib
script		script
specs		specs
.gitignore		.gitignore
README.markdown		README.markdown
Rakefile		Rakefile
VERSION.yml		VERSION.yml
jcore.gemspec		jcore.gemspec
jcore.rb		jcore.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jurnalo Crawler Core

Requirements

Specs

Scripts

How to learn?

How to inspect the template learned?

How to extract information?

How to learn using XPath?

How to modify the document before processing?

Author

Copyright

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jurnalo Crawler Core

Requirements

Specs

Scripts

How to learn?

How to inspect the template learned?

How to extract information?

How to learn using XPath?

How to modify the document before processing?

Author

Copyright

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages