diff options
Diffstat (limited to 'vendor/fguillot/picofeed/docs/grabber.markdown')
-rw-r--r-- | vendor/fguillot/picofeed/docs/grabber.markdown | 97 |
1 files changed, 97 insertions, 0 deletions
diff --git a/vendor/fguillot/picofeed/docs/grabber.markdown b/vendor/fguillot/picofeed/docs/grabber.markdown new file mode 100644 index 000000000..6a7dd2ada --- /dev/null +++ b/vendor/fguillot/picofeed/docs/grabber.markdown @@ -0,0 +1,97 @@ +Web scraper +=========== + +The web scraper is useful for feeds that display only a summary of articles, the scraper can download and parse the full content from the original website. + +How the content grabber works? +------------------------------ + +1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`) +2. Try to find the text content by using common attributes for class and id +3. Finally, if nothing is found, the feed content is displayed + +**The best results are obtained with Xpath rules file.** + +How to use the content scraper? +------------------------------- + +```php +use PicoFeed\Reader; + +$reader = new Reader; +$reader->download('http://www.egscomics.com/rss.php'); + +$parser = $reader->getParser(); + +if ($parser !== false) { + + $parser->enableContentGrabber(); // <= Enable the content grabber + $feed = $parser->execute(); + // ... +} +``` + +When the content scraper is enabled, everything will be slower. +For each item a new HTTP request is made and the HTML downloaded is parsed with XML/Xpath. + +Configuration +------------- + +### Enable content grabber for items + +- Method name: `enableContentGrabber()` +- Default value: false (content grabber is disabled by default) +- Argument value: none + +```php +$parser->enableContentGrabber(); +``` + +### Ignore item urls for the content grabber + +- Method name: `setGrabberIgnoreUrls()` +- Default value: empty (fetch all item urls) +- Argument value: array (list of item urls to ignore) + +```php +$parser->setGrabberIgnoreUrls(['http://foo', 'http://bar']); +``` + +How to write a grabber rules file? +---------------------------------- + +Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the same as the domain name: + +Example with the BBC website, `www.bbc.co.uk.php`: + +```php +<?php +return array( + 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833', + 'body' => array( + '//div[@class="story-body"]', + ), + 'strip' => array( + '//script', + '//form', + '//style', + '//*[@class="story-date"]', + '//*[@class="story-header"]', + '//*[@class="story-related"]', + '//*[contains(@class, "byline")]', + '//*[contains(@class, "story-feature")]', + '//*[@id="video-carousel-container"]', + '//*[@id="also-related-links"]', + '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]', + ) +); +``` + +Actually, only `body`, `strip` and `test_url` are supported. + +Don't forget to send a pull request or a ticket to share your contribution with everybody, + +List of content grabber rules +----------------------------- + +Rules are stored inside the directory [lib/PicoFeed/Rules](https://github.com/fguillot/picoFeed/tree/master/lib/PicoFeed/Rules) |