summaryrefslogtreecommitdiffstats
path: root/vendor/fguillot/picofeed/docs/grabber.markdown
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/fguillot/picofeed/docs/grabber.markdown')
-rw-r--r--vendor/fguillot/picofeed/docs/grabber.markdown97
1 files changed, 97 insertions, 0 deletions
diff --git a/vendor/fguillot/picofeed/docs/grabber.markdown b/vendor/fguillot/picofeed/docs/grabber.markdown
new file mode 100644
index 000000000..6a7dd2ada
--- /dev/null
+++ b/vendor/fguillot/picofeed/docs/grabber.markdown
@@ -0,0 +1,97 @@
+Web scraper
+===========
+
+The web scraper is useful for feeds that display only a summary of articles, the scraper can download and parse the full content from the original website.
+
+How the content grabber works?
+------------------------------
+
+1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`)
+2. Try to find the text content by using common attributes for class and id
+3. Finally, if nothing is found, the feed content is displayed
+
+**The best results are obtained with Xpath rules file.**
+
+How to use the content scraper?
+-------------------------------
+
+```php
+use PicoFeed\Reader;
+
+$reader = new Reader;
+$reader->download('http://www.egscomics.com/rss.php');
+
+$parser = $reader->getParser();
+
+if ($parser !== false) {
+
+ $parser->enableContentGrabber(); // <= Enable the content grabber
+ $feed = $parser->execute();
+ // ...
+}
+```
+
+When the content scraper is enabled, everything will be slower.
+For each item a new HTTP request is made and the HTML downloaded is parsed with XML/Xpath.
+
+Configuration
+-------------
+
+### Enable content grabber for items
+
+- Method name: `enableContentGrabber()`
+- Default value: false (content grabber is disabled by default)
+- Argument value: none
+
+```php
+$parser->enableContentGrabber();
+```
+
+### Ignore item urls for the content grabber
+
+- Method name: `setGrabberIgnoreUrls()`
+- Default value: empty (fetch all item urls)
+- Argument value: array (list of item urls to ignore)
+
+```php
+$parser->setGrabberIgnoreUrls(['http://foo', 'http://bar']);
+```
+
+How to write a grabber rules file?
+----------------------------------
+
+Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the same as the domain name:
+
+Example with the BBC website, `www.bbc.co.uk.php`:
+
+```php
+<?php
+return array(
+ 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
+ 'body' => array(
+ '//div[@class="story-body"]',
+ ),
+ 'strip' => array(
+ '//script',
+ '//form',
+ '//style',
+ '//*[@class="story-date"]',
+ '//*[@class="story-header"]',
+ '//*[@class="story-related"]',
+ '//*[contains(@class, "byline")]',
+ '//*[contains(@class, "story-feature")]',
+ '//*[@id="video-carousel-container"]',
+ '//*[@id="also-related-links"]',
+ '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
+ )
+);
+```
+
+Actually, only `body`, `strip` and `test_url` are supported.
+
+Don't forget to send a pull request or a ticket to share your contribution with everybody,
+
+List of content grabber rules
+-----------------------------
+
+Rules are stored inside the directory [lib/PicoFeed/Rules](https://github.com/fguillot/picoFeed/tree/master/lib/PicoFeed/Rules)