This library implements extraction of web data from instance based learning. Given a set of pages from a Web site, the proposed technique takes a labeled page ( user labels the items that need to be extracted ). The system then stores a certain number of consecutive prefix and suffix tokens (tags) of each item. After that it can extract target items from each new page from the Web site which uses the same template.
- Ruby 1.8.6
- rspec (1.2.6) [required for specs]
- hpricot
- ruby-stemmer [required for keywords, has a c extension]
- htmlentities [required for keywords and cleaner]
- multibyte [required for processing utf8 strings]
- activerecord [required for JCore::Story Database Plugin]
rake specs:all
rake learn:all
rake learn:all lets you create all the templates in one go
Sources Learned: NY Times, Times
NOTE: Following commands are to be run from jCore ROOT_DIR.
ruby script/learn -s time
ruby script/learn -s nytimes
This will learn about Times and NY Times pages respectively.
The labeled data is stored in ROOT_DIR/data/labeled_stories.
The file naming follows following convention: <source>_ddd.kd+.html where
d represents digit and d+ represents one or more digits. E.g.
nytimes_001.k5.html tells that source is nytimes and maximum
prefix/suffix length for the template is 5.
To add more sources, the labeled data should be created and then learn script be run.
ruby script/template -s time
ruby script/template -s nytimes
This will show the template structure.
ruby script/extract -s time -u "http://www.time.com/time/nation/article/0,8599,1915835,00.html"
ruby script/extract -s nytimes -u "http://www.nytimes.com/2009/08/14/opinion/14krugman.html"
This will display the information that is extracted using templates. If information is not extracted then it reports "Information not found".
XPath expressions are the expressions supported by Hpricot and Hpricot. Plus you can specify some tricks like using to_s rather than inner_html which is default. Also you can delete some elements which are not necessary. XPath tag should be autoclosing tag otherwise it is interpreted as the prefix/suffix extraction label.
TODOS: ::delete::, ::select::, match enhancements
Sometimes the document generated by source is not correct and needs to be modified before we can actually work on it. Modify doc label actually notifies the learner that documents need to be altered before it can be processed
Ram Singla
Jurnalo.com (c) 2009