summaryrefslogtreecommitdiffstats
path: root/vendor/fguillot/picofeed/docs/grabber.markdown
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/fguillot/picofeed/docs/grabber.markdown')
-rw-r--r--vendor/fguillot/picofeed/docs/grabber.markdown109
1 files changed, 84 insertions, 25 deletions
diff --git a/vendor/fguillot/picofeed/docs/grabber.markdown b/vendor/fguillot/picofeed/docs/grabber.markdown
index b99b756ed..4ac83068f 100644
--- a/vendor/fguillot/picofeed/docs/grabber.markdown
+++ b/vendor/fguillot/picofeed/docs/grabber.markdown
@@ -15,23 +15,41 @@ How the content grabber works?
Standalone usage
----------------
+Fetch remote content:
+
```php
<?php
-use PicoFeed\Client\Grabber;
+use PicoFeed\Config\Config;
+use PicoFeed\Scraper\Scraper;
+
+$config = new Config;
-$grabber = new Grabber($item_url);
-$grabber->download();
-$grabber->parse();
+$grabber = new Scraper($config)
+$grabber->setUrl($url);
+$grabber->execute();
// Get raw HTML content
echo $grabber->getRawContent();
// Get relevant content
-echo $grabber->getContent();
+echo $grabber->getRelevantContent();
// Get filtered relevant content
echo $grabber->getFilteredContent();
+
+// Return true if there is relevant content
+var_dump($grabber->hasRelevantContent());
+```
+
+Parse HTML content:
+
+```php
+<?php
+
+$grabber = new Scraper($config);
+$grabber->setRawContent($html);
+$grabber->execute();
```
Fetch full item contents during feed parsing
@@ -79,11 +97,11 @@ Configuration
### Enable content grabber for items
- Method name: `enableContentGrabber()`
-- Default value: false (content grabber is disabled by default)
-- Argument value: none
+- Default value: false (also fetch content if no rule file exist)
+- Argument value: bool (true scrape only webpages which have a rule file)
```php
-$parser->enableContentGrabber();
+$parser->enableContentGrabber(false);
```
### Ignore item urls for the content grabber
@@ -106,30 +124,71 @@ Example with the BBC website, `www.bbc.co.uk.php`:
```php
<?php
return array(
- 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
- 'body' => array(
- '//div[@class="story-body"]',
- ),
- 'strip' => array(
- '//script',
- '//form',
- '//style',
- '//*[@class="story-date"]',
- '//*[@class="story-header"]',
- '//*[@class="story-related"]',
- '//*[contains(@class, "byline")]',
- '//*[contains(@class, "story-feature")]',
- '//*[@id="video-carousel-container"]',
- '//*[@id="also-related-links"]',
- '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
+ 'grabber' => array(
+ '%.*%' => array(
+ 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
+ 'body' => array(
+ '//div[@class="story-body"]',
+ ),
+ 'strip' => array(
+ '//script',
+ '//form',
+ '//style',
+ '//*[@class="story-date"]',
+ '//*[@class="story-header"]',
+ '//*[@class="story-related"]',
+ '//*[contains(@class, "byline")]',
+ '//*[contains(@class, "story-feature")]',
+ '//*[@id="video-carousel-container"]',
+ '//*[@id="also-related-links"]',
+ '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
+ )
+ )
)
);
```
+Each rule file can contain multiple rules, based so links to different website URLs can be handled differently. The first level key is a regex, which will be matched against the full path of the URL using **preg_match**, e.g. for **http://www.bbc.co.uk/news/world-middle-east-23911833?test=1** the URL that would be matched is **/news/world-middle-east-23911833?test=1**
-Actually, only `body`, `strip` and `test_url` are supported.
+Each rule has the following keys:
+* **body**: An array of xpath expressions which will be extracted from the page
+* **strip**: An array of xpath expressions which will be removed from the matched content
+* **test_url**: A test url to a matching page to test the grabber
Don't forget to send a pull request or a ticket to share your contribution with everybody,
+**A more complex example**:
+
+Let's say you wanted to extract a div with the id **video** if the article points to an URL like **http://comix.com/videos/423**, **audio** if the article points to an URL like **http://comix.com/podcasts/5** and all other links to the page should instead take the div with the id **content**. The following rulefile would fit that requirement and would be stored in a file called **lib/PicoFeed/Rules/comix.com.php**:
+
+
+```php
+return array(
+ 'grabber' => array(
+ '%^/videos.*%' => array(
+ 'test_url' => 'http://comix.com/videos/423',
+ 'body' => array(
+ '//div[@id="video"]',
+ ),
+ 'strip' => array()
+ ),
+ '%^/podcasts.*%' => array(
+ 'test_url' => 'http://comix.com/podcasts/5',
+ 'body' => array(
+ '//div[@id="audio"]',
+ ),
+ 'strip' => array()
+ ),
+ '%.*%' => array(
+ 'test_url' => 'http://comix.com/blog/1',
+ 'body' => array(
+ '//div[@id="content"]',
+ ),
+ 'strip' => array()
+ )
+ )
+);
+```
+
List of content grabber rules
-----------------------------