diff options
author | Bernhard Posselt <dev@bernhard-posselt.com> | 2015-04-30 18:30:11 +0200 |
---|---|---|
committer | Bernhard Posselt <dev@bernhard-posselt.com> | 2015-04-30 18:30:11 +0200 |
commit | eb28c3b137c8a0d61377087c9a04b820151b0b7c (patch) | |
tree | c1ebf149f43fa653a4ef1c3f33df04557094e834 /vendor/fguillot/picofeed/docs | |
parent | 2e54780c1496bfa39cd035b9ac40ed851d2198f1 (diff) |
update deps
Diffstat (limited to 'vendor/fguillot/picofeed/docs')
-rw-r--r-- | vendor/fguillot/picofeed/docs/feed-parsing.markdown | 21 | ||||
-rw-r--r-- | vendor/fguillot/picofeed/docs/grabber.markdown | 109 |
2 files changed, 105 insertions, 25 deletions
diff --git a/vendor/fguillot/picofeed/docs/feed-parsing.markdown b/vendor/fguillot/picofeed/docs/feed-parsing.markdown index 1ee21451d..8ab2dac01 100644 --- a/vendor/fguillot/picofeed/docs/feed-parsing.markdown +++ b/vendor/fguillot/picofeed/docs/feed-parsing.markdown @@ -215,6 +215,27 @@ catch (PicoFeedException $e) { } ``` +Custom regex filters +-------------------- +In case you want modify the content with a simple regex, you can create a rule file named after the domain of the feed's link attribute. For the feed pointing to **http://www.twogag.com/** the file is stored under **Rules/twogag.com.php** + +For filtering, only the array with the key **filter** will be considered. The first level key is a preg_match regex that will match the sub url, e.g. to only match a feed whose link attribute points to **twogag.com/test**, the regex could look like **%/test.*%**. The second level array contains a list of search and replace strings, which will be passed to the preg\_replace function. The first string is the argument that should be matched, the second is the replacement. + +To replace all occurences of links to smaller images for twogag, the following rule can be used: + + +```php +<?php +return array( + 'filter' => array( + '%.*%' => array( + "%http://www.twogag.com/comics-rss/([^.]+)\\.jpg%" => + "http://www.twogag.com/comics/$1.jpg" + ) + ) +); +``` + Feed and item properties ------------------------ diff --git a/vendor/fguillot/picofeed/docs/grabber.markdown b/vendor/fguillot/picofeed/docs/grabber.markdown index b99b756ed..4ac83068f 100644 --- a/vendor/fguillot/picofeed/docs/grabber.markdown +++ b/vendor/fguillot/picofeed/docs/grabber.markdown @@ -15,23 +15,41 @@ How the content grabber works? Standalone usage ---------------- +Fetch remote content: + ```php <?php -use PicoFeed\Client\Grabber; +use PicoFeed\Config\Config; +use PicoFeed\Scraper\Scraper; + +$config = new Config; -$grabber = new Grabber($item_url); -$grabber->download(); -$grabber->parse(); +$grabber = new Scraper($config) +$grabber->setUrl($url); +$grabber->execute(); // Get raw HTML content echo $grabber->getRawContent(); // Get relevant content -echo $grabber->getContent(); +echo $grabber->getRelevantContent(); // Get filtered relevant content echo $grabber->getFilteredContent(); + +// Return true if there is relevant content +var_dump($grabber->hasRelevantContent()); +``` + +Parse HTML content: + +```php +<?php + +$grabber = new Scraper($config); +$grabber->setRawContent($html); +$grabber->execute(); ``` Fetch full item contents during feed parsing @@ -79,11 +97,11 @@ Configuration ### Enable content grabber for items - Method name: `enableContentGrabber()` -- Default value: false (content grabber is disabled by default) -- Argument value: none +- Default value: false (also fetch content if no rule file exist) +- Argument value: bool (true scrape only webpages which have a rule file) ```php -$parser->enableContentGrabber(); +$parser->enableContentGrabber(false); ``` ### Ignore item urls for the content grabber @@ -106,30 +124,71 @@ Example with the BBC website, `www.bbc.co.uk.php`: ```php <?php return array( - 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833', - 'body' => array( - '//div[@class="story-body"]', - ), - 'strip' => array( - '//script', - '//form', - '//style', - '//*[@class="story-date"]', - '//*[@class="story-header"]', - '//*[@class="story-related"]', - '//*[contains(@class, "byline")]', - '//*[contains(@class, "story-feature")]', - '//*[@id="video-carousel-container"]', - '//*[@id="also-related-links"]', - '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]', + 'grabber' => array( + '%.*%' => array( + 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833', + 'body' => array( + '//div[@class="story-body"]', + ), + 'strip' => array( + '//script', + '//form', + '//style', + '//*[@class="story-date"]', + '//*[@class="story-header"]', + '//*[@class="story-related"]', + '//*[contains(@class, "byline")]', + '//*[contains(@class, "story-feature")]', + '//*[@id="video-carousel-container"]', + '//*[@id="also-related-links"]', + '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]', + ) + ) ) ); ``` +Each rule file can contain multiple rules, based so links to different website URLs can be handled differently. The first level key is a regex, which will be matched against the full path of the URL using **preg_match**, e.g. for **http://www.bbc.co.uk/news/world-middle-east-23911833?test=1** the URL that would be matched is **/news/world-middle-east-23911833?test=1** -Actually, only `body`, `strip` and `test_url` are supported. +Each rule has the following keys: +* **body**: An array of xpath expressions which will be extracted from the page +* **strip**: An array of xpath expressions which will be removed from the matched content +* **test_url**: A test url to a matching page to test the grabber Don't forget to send a pull request or a ticket to share your contribution with everybody, +**A more complex example**: + +Let's say you wanted to extract a div with the id **video** if the article points to an URL like **http://comix.com/videos/423**, **audio** if the article points to an URL like **http://comix.com/podcasts/5** and all other links to the page should instead take the div with the id **content**. The following rulefile would fit that requirement and would be stored in a file called **lib/PicoFeed/Rules/comix.com.php**: + + +```php +return array( + 'grabber' => array( + '%^/videos.*%' => array( + 'test_url' => 'http://comix.com/videos/423', + 'body' => array( + '//div[@id="video"]', + ), + 'strip' => array() + ), + '%^/podcasts.*%' => array( + 'test_url' => 'http://comix.com/podcasts/5', + 'body' => array( + '//div[@id="audio"]', + ), + 'strip' => array() + ), + '%.*%' => array( + 'test_url' => 'http://comix.com/blog/1', + 'body' => array( + '//div[@id="content"]', + ), + 'strip' => array() + ) + ) +); +``` + List of content grabber rules ----------------------------- |