Skip to content

ohlhaver/jcore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jurnalo Crawler Core

This library implements extraction of web data from instance based learning. Given a set of pages from a Web site, the proposed technique takes a labeled page ( user labels the items that need to be extracted ). The system then stores a certain number of consecutive prefix and suffix tokens (tags) of each item. After that it can extract target items from each new page from the Web site which uses the same template.

Requirements

  • Ruby 1.8.6
  • rspec (1.2.6) [required for specs]
  • hpricot
  • ruby-stemmer [required for keywords, has a c extension]
  • htmlentities [required for keywords and cleaner]
  • multibyte [required for processing utf8 strings]
  • activerecord [required for JCore::Story Database Plugin]

Specs

rake specs:all
rake learn:all

rake learn:all lets you create all the templates in one go

Scripts

Sources Learned: NY Times, Times

NOTE: Following commands are to be run from jCore ROOT_DIR.

How to learn?

ruby script/learn -s time
ruby script/learn -s nytimes

This will learn about Times and NY Times pages respectively. The labeled data is stored in ROOT_DIR/data/labeled_stories. The file naming follows following convention: <source>_ddd.kd+.html where d represents digit and d+ represents one or more digits. E.g.

nytimes_001.k5.html tells that source is nytimes and maximum prefix/suffix length for the template is 5.

To add more sources, the labeled data should be created and then learn script be run.

How to inspect the template learned?

ruby script/template -s time
ruby script/template -s nytimes

This will show the template structure.

How to extract information?

ruby script/extract -s time -u "http://www.time.com/time/nation/article/0,8599,1915835,00.html"
ruby script/extract -s nytimes -u "http://www.nytimes.com/2009/08/14/opinion/14krugman.html"

This will display the information that is extracted using templates. If information is not extracted then it reports "Information not found".

How to learn using XPath?


XPath expressions are the expressions supported by Hpricot and Hpricot. Plus you can specify some tricks like using to_s rather than inner_html which is default. Also you can delete some elements which are not necessary. XPath tag should be autoclosing tag otherwise it is interpreted as the prefix/suffix extraction label.

TODOS: ::delete::, ::select::, match enhancements

How to modify the document before processing?

Sometimes the document generated by source is not correct and needs to be modified before we can actually work on it. Modify doc label actually notifies the learner that documents need to be altered before it can be processed

Author

Ram Singla

Copyright

Jurnalo.com (c) 2009

About

experimental webdata extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages