update deps

author: Bernhard Posselt <dev@bernhard-posselt.com> 2015-04-30 18:30:11 +0200
committer: Bernhard Posselt <dev@bernhard-posselt.com> 2015-04-30 18:30:11 +0200
commit: eb28c3b137c8a0d61377087c9a04b820151b0b7c (patch)
tree: c1ebf149f43fa653a4ef1c3f33df04557094e834 /vendor/fguillot/picofeed/docs
parent: 2e54780c1496bfa39cd035b9ac40ed851d2198f1 (diff)
2 files changed, 105 insertions, 25 deletions
diff --git a/vendor/fguillot/picofeed/docs/feed-parsing.markdown b/vendor/fguillot/picofeed/docs/feed-parsing.markdown
index 1ee21451d..8ab2dac01 100644
--- a/vendor/fguillot/picofeed/docs/feed-parsing.markdown
+++ b/vendor/fguillot/picofeed/docs/feed-parsing.markdown
@@ -215,6 +215,27 @@ catch (PicoFeedException $e) {
 }
 ```
 
+Custom regex filters
+--------------------
+In case you want modify the content with a simple regex, you can create a rule file named after the domain of the feed's link attribute. For the feed pointing to **http://www.twogag.com/** the file is stored under **Rules/twogag.com.php**
+
+For filtering, only the array with the key **filter** will be considered. The first level key is a preg_match regex that will match the sub url, e.g. to only match a feed whose link attribute points to **twogag.com/test**, the regex could look like **%/test.*%**. The second level array contains a list of search and replace strings, which will be passed to the preg\_replace function. The first string is the argument that should be matched, the second is the replacement.
+
+To replace all occurences of links to smaller images for twogag, the following rule can be used:
+
+
+```php
+<?php
+return array(
+    'filter' => array(
+        '%.*%' => array(
+            "%http://www.twogag.com/comics-rss/([^.]+)\\.jpg%" =>
+            "http://www.twogag.com/comics/$1.jpg"
+        )
+    )
+);
+```
+
 Feed and item properties
 ------------------------
 
diff --git a/vendor/fguillot/picofeed/docs/grabber.markdown b/vendor/fguillot/picofeed/docs/grabber.markdown
index b99b756ed..4ac83068f 100644
--- a/vendor/fguillot/picofeed/docs/grabber.markdown
+++ b/vendor/fguillot/picofeed/docs/grabber.markdown
@@ -15,23 +15,41 @@ How the content grabber works?
 Standalone usage
 ----------------
 
+Fetch remote content:
+
 ```php
 <?php
 
-use PicoFeed\Client\Grabber;
+use PicoFeed\Config\Config;
+use PicoFeed\Scraper\Scraper;
+
+$config = new Config;
 
-$grabber = new Grabber($item_url);
-$grabber->download();
-$grabber->parse();
+$grabber = new Scraper($config)
+$grabber->setUrl($url);
+$grabber->execute();
 
 // Get raw HTML content
 echo $grabber->getRawContent();
 
 // Get relevant content
-echo $grabber->getContent();
+echo $grabber->getRelevantContent();
 
 // Get filtered relevant content
 echo $grabber->getFilteredContent();
+
+// Return true if there is relevant content
+var_dump($grabber->hasRelevantContent());
+```
+
+Parse HTML content:
+
+```php
+<?php
+
+$grabber = new Scraper($config);
+$grabber->setRawContent($html);
+$grabber->execute();
 ```
 
 Fetch full item contents during feed parsing
@@ -79,11 +97,11 @@ Configuration
 ### Enable content grabber for items
 
 - Method name: `enableContentGrabber()`
-- Default value: false (content grabber is disabled by default)
-- Argument value: none
+- Default value: false (also fetch content if no rule file exist)
+- Argument value: bool (true scrape only webpages which have a rule file)
 
 ```php
-$parser->enableContentGrabber();
+$parser->enableContentGrabber(false);
 ```
 
 ### Ignore item urls for the content grabber
@@ -106,30 +124,71 @@ Example with the BBC website, `www.bbc.co.uk.php`:
 ```php
 <?php
 return array(
-    'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
-    'body' => array(
-        '//div[@class="story-body"]',
-    ),
-    'strip' => array(
-        '//script',
-        '//form',
-        '//style',
-        '//*[@class="story-date"]',
-        '//*[@class="story-header"]',
-        '//*[@class="story-related"]',
-        '//*[contains(@class, "byline")]',
-        '//*[contains(@class, "story-feature")]',
-        '//*[@id="video-carousel-container"]',
-        '//*[@id="also-related-links"]',
-        '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
+    'grabber' => array(
+        '%.*%' => array(
+            'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
+            'body' => array(
+                '//div[@class="story-body"]',
+            ),
+            'strip' => array(
+                '//script',
+                '//form',
+                '//style',
+                '//*[@class="story-date"]',
+                '//*[@class="story-header"]',
+                '//*[@class="story-related"]',
+                '//*[contains(@class, "byline")]',
+                '//*[contains(@class, "story-feature")]',
+                '//*[@id="video-carousel-container"]',
+                '//*[@id="also-related-links"]',
+                '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
+            )
+        )
     )
 );
 ```
+Each rule file can contain multiple rules, based so links to different website URLs can be handled differently. The first level key is a regex, which will be matched against the full path of the URL using **preg_match**, e.g. for **http://www.bbc.co.uk/news/world-middle-east-23911833?test=1** the URL that would be matched is **/news/world-middle-east-23911833?test=1**
 
-Actually, only `body`, `strip` and `test_url` are supported.
+Each rule has the following keys:
+* **body**: An array of xpath expressions which will be extracted from the page
+* **strip**: An array of xpath expressions which will be removed from the matched content
+* **test_url**: A test url to a matching page to test the grabber
 
 Don't forget to send a pull request or a ticket to share your contribution with everybody,
 
+**A more complex example**:
+
+Let's say you wanted to extract a div with the id **video** if the article points to an URL like **http://comix.com/videos/423**, **audio** if the article points to an URL like **http://comix.com/podcasts/5** and all other links to the page should instead take the div with the id **content**. The following rulefile would fit that requirement and would be stored in a file called **lib/PicoFeed/Rules/comix.com.php**:
+
+
+```php
+return array(
+    'grabber' => array(
+        '%^/videos.*%' => array(
+            'test_url' => 'http://comix.com/videos/423',
+            'body' => array(
+                '//div[@id="video"]',
+            ),
+            'strip' => array()
+        ),
+        '%^/podcasts.*%' => array(
+            'test_url' => 'http://comix.com/podcasts/5',
+            'body' => array(
+                '//div[@id="audio"]',
+            ),
+            'strip' => array()
+        ),
+        '%.*%' => array(
+            'test_url' => 'http://comix.com/blog/1',
+            'body' => array(
+                '//div[@id="content"]',
+            ),
+            'strip' => array()
+        )
+    )
+);
+```
+
 List of content grabber rules
 -----------------------------
author	Bernhard Posselt <dev@bernhard-posselt.com>	2015-04-30 18:30:11 +0200
committer	Bernhard Posselt <dev@bernhard-posselt.com>	2015-04-30 18:30:11 +0200
commit	eb28c3b137c8a0d61377087c9a04b820151b0b7c (patch)
tree	c1ebf149f43fa653a4ef1c3f33df04557094e834 /vendor/fguillot/picofeed/docs
parent	2e54780c1496bfa39cd035b9ac40ed851d2198f1 (diff)