summaryrefslogtreecommitdiffstats
path: root/vendor/fguillot/picofeed/docs/grabber.markdown
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/fguillot/picofeed/docs/grabber.markdown')
-rw-r--r--vendor/fguillot/picofeed/docs/grabber.markdown195
1 files changed, 0 insertions, 195 deletions
diff --git a/vendor/fguillot/picofeed/docs/grabber.markdown b/vendor/fguillot/picofeed/docs/grabber.markdown
deleted file mode 100644
index 4ac83068f..000000000
--- a/vendor/fguillot/picofeed/docs/grabber.markdown
+++ /dev/null
@@ -1,195 +0,0 @@
-Web scraper
-===========
-
-The web scraper is useful for feeds that display only a summary of articles, the scraper can download and parse the full content from the original website.
-
-How the content grabber works?
-------------------------------
-
-1. Try with rules first (XPath queries) for the domain name (see `PicoFeed\Rules\`)
-2. Try to find the text content by using common attributes for class and id
-3. Finally, if nothing is found, the feed content is displayed
-
-**The best results are obtained with XPath rules file.**
-
-Standalone usage
-----------------
-
-Fetch remote content:
-
-```php
-<?php
-
-use PicoFeed\Config\Config;
-use PicoFeed\Scraper\Scraper;
-
-$config = new Config;
-
-$grabber = new Scraper($config)
-$grabber->setUrl($url);
-$grabber->execute();
-
-// Get raw HTML content
-echo $grabber->getRawContent();
-
-// Get relevant content
-echo $grabber->getRelevantContent();
-
-// Get filtered relevant content
-echo $grabber->getFilteredContent();
-
-// Return true if there is relevant content
-var_dump($grabber->hasRelevantContent());
-```
-
-Parse HTML content:
-
-```php
-<?php
-
-$grabber = new Scraper($config);
-$grabber->setRawContent($html);
-$grabber->execute();
-```
-
-Fetch full item contents during feed parsing
---------------------------------------------
-
-Before parsing all items, just call the method `$parser->enableContentGrabber()`:
-
-```php
-<?php
-
-use PicoFeed\Reader\Reader;
-use PicoFeed\PicoFeedException;
-
-try {
-
- $reader = new Reader;
-
- // Return a resource
- $resource = $reader->download('http://www.egscomics.com/rss.php');
-
- // Return the right parser instance according to the feed format
- $parser = $reader->getParser(
- $resource->getUrl(),
- $resource->getContent(),
- $resource->getEncoding()
- );
-
- // Enable content grabber before parsing items
- $parser->enableContentGrabber();
-
- // Return a Feed object
- $feed = $parser->execute();
-}
-catch (PicoFeedException $e) {
- // Do Something...
-}
-```
-
-When the content scraper is enabled, everything will be slower.
-**For each item a new HTTP request is made** and the HTML downloaded is parsed with XML/XPath.
-
-Configuration
--------------
-
-### Enable content grabber for items
-
-- Method name: `enableContentGrabber()`
-- Default value: false (also fetch content if no rule file exist)
-- Argument value: bool (true scrape only webpages which have a rule file)
-
-```php
-$parser->enableContentGrabber(false);
-```
-
-### Ignore item urls for the content grabber
-
-- Method name: `setGrabberIgnoreUrls()`
-- Default value: empty (fetch all item urls)
-- Argument value: array (list of item urls to ignore)
-
-```php
-$parser->setGrabberIgnoreUrls(['http://foo', 'http://bar']);
-```
-
-How to write a grabber rules file?
-----------------------------------
-
-Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the same as the domain name:
-
-Example with the BBC website, `www.bbc.co.uk.php`:
-
-```php
-<?php
-return array(
- 'grabber' => array(
- '%.*%' => array(
- 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
- 'body' => array(
- '//div[@class="story-body"]',
- ),
- 'strip' => array(
- '//script',
- '//form',
- '//style',
- '//*[@class="story-date"]',
- '//*[@class="story-header"]',
- '//*[@class="story-related"]',
- '//*[contains(@class, "byline")]',
- '//*[contains(@class, "story-feature")]',
- '//*[@id="video-carousel-container"]',
- '//*[@id="also-related-links"]',
- '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
- )
- )
- )
-);
-```
-Each rule file can contain multiple rules, based so links to different website URLs can be handled differently. The first level key is a regex, which will be matched against the full path of the URL using **preg_match**, e.g. for **http://www.bbc.co.uk/news/world-middle-east-23911833?test=1** the URL that would be matched is **/news/world-middle-east-23911833?test=1**
-
-Each rule has the following keys:
-* **body**: An array of xpath expressions which will be extracted from the page
-* **strip**: An array of xpath expressions which will be removed from the matched content
-* **test_url**: A test url to a matching page to test the grabber
-
-Don't forget to send a pull request or a ticket to share your contribution with everybody,
-
-**A more complex example**:
-
-Let's say you wanted to extract a div with the id **video** if the article points to an URL like **http://comix.com/videos/423**, **audio** if the article points to an URL like **http://comix.com/podcasts/5** and all other links to the page should instead take the div with the id **content**. The following rulefile would fit that requirement and would be stored in a file called **lib/PicoFeed/Rules/comix.com.php**:
-
-
-```php
-return array(
- 'grabber' => array(
- '%^/videos.*%' => array(
- 'test_url' => 'http://comix.com/videos/423',
- 'body' => array(
- '//div[@id="video"]',
- ),
- 'strip' => array()
- ),
- '%^/podcasts.*%' => array(
- 'test_url' => 'http://comix.com/podcasts/5',
- 'body' => array(
- '//div[@id="audio"]',
- ),
- 'strip' => array()
- ),
- '%.*%' => array(
- 'test_url' => 'http://comix.com/blog/1',
- 'body' => array(
- '//div[@id="content"]',
- ),
- 'strip' => array()
- )
- )
-);
-```
-
-List of content grabber rules
------------------------------
-
-Rules are stored inside the directory [lib/PicoFeed/Rules](https://github.com/fguillot/picoFeed/tree/master/lib/PicoFeed/Rules)