1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
|
Web scraper
===========
The web scraper is useful for feeds that display only a summary of articles, the scraper can download and parse the full content from the original website.
How the content grabber works?
------------------------------
1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`)
2. Try to find the text content by using common attributes for class and id
3. Finally, if nothing is found, the feed content is displayed
**The best results are obtained with Xpath rules file.**
How to use the content scraper?
-------------------------------
```php
use PicoFeed\Reader;
$reader = new Reader;
$reader->download('http://www.egscomics.com/rss.php');
$parser = $reader->getParser();
if ($parser !== false) {
$parser->enableContentGrabber(); // <= Enable the content grabber
$feed = $parser->execute();
// ...
}
```
When the content scraper is enabled, everything will be slower.
For each item a new HTTP request is made and the HTML downloaded is parsed with XML/Xpath.
Configuration
-------------
### Enable content grabber for items
- Method name: `enableContentGrabber()`
- Default value: false (content grabber is disabled by default)
- Argument value: none
```php
$parser->enableContentGrabber();
```
### Ignore item urls for the content grabber
- Method name: `setGrabberIgnoreUrls()`
- Default value: empty (fetch all item urls)
- Argument value: array (list of item urls to ignore)
```php
$parser->setGrabberIgnoreUrls(['http://foo', 'http://bar']);
```
How to write a grabber rules file?
----------------------------------
Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the same as the domain name:
Example with the BBC website, `www.bbc.co.uk.php`:
```php
<?php
return array(
'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
'body' => array(
'//div[@class="story-body"]',
),
'strip' => array(
'//script',
'//form',
'//style',
'//*[@class="story-date"]',
'//*[@class="story-header"]',
'//*[@class="story-related"]',
'//*[contains(@class, "byline")]',
'//*[contains(@class, "story-feature")]',
'//*[@id="video-carousel-container"]',
'//*[@id="also-related-links"]',
'//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
)
);
```
Actually, only `body`, `strip` and `test_url` are supported.
Don't forget to send a pull request or a ticket to share your contribution with everybody,
List of content grabber rules
-----------------------------
Rules are stored inside the directory [lib/PicoFeed/Rules](https://github.com/fguillot/picoFeed/tree/master/lib/PicoFeed/Rules)
|