vendor/fguillot/picofeed/docs/grabber.markdown


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136

Web scraper
===========

The web scraper is useful for feeds that display only a summary of articles, the scraper can download and parse the full content from the original website.

How the content grabber works?
------------------------------

1. Try with rules first (XPath queries) for the domain name (see `PicoFeed\Rules\`)
2. Try to find the text content by using common attributes for class and id
3. Finally, if nothing is found, the feed content is displayed

**The best results are obtained with XPath rules file.**

Standalone usage
----------------

```php
<?php

use PicoFeed\Client\Grabber;

$grabber = new Grabber($item_url);
$grabber->download();
$grabber->parse();

// Get raw HTML content
echo $grabber->getRawContent();

// Get relevant content
echo $grabber->getContent();

// Get filtered relevant content
echo $grabber->getFilteredContent();
```

Fetch full item contents during feed parsing
--------------------------------------------

Before parsing all items, just call the method `$parser->enableContentGrabber()`:

```php
<?php

use PicoFeed\Reader\Reader;
use PicoFeed\PicoFeedException;

try {

    $reader = new Reader;

    // Return a resource
    $resource = $reader->download('http://www.egscomics.com/rss.php');

    // Return the right parser instance according to the feed format
    $parser = $reader->getParser(
        $resource->getUrl(),
        $resource->getContent(),
        $resource->getEncoding()
    );

    // Enable content grabber before parsing items
    $parser->enableContentGrabber();

    // Return a Feed object
    $feed = $parser->execute();
}
catch (PicoFeedException $e) {
    // Do Something...
}
```

When the content scraper is enabled, everything will be slower.
**For each item a new HTTP request is made** and the HTML downloaded is parsed with XML/XPath.

Configuration
-------------

### Enable content grabber for items

- Method name: `enableContentGrabber()`
- Default value: false (content grabber is disabled by default)
- Argument value: none

```php
$parser->enableContentGrabber();
```

### Ignore item urls for the content grabber

- Method name: `setGrabberIgnoreUrls()`
- Default value: empty (fetch all item urls)
- Argument value: array (list of item urls to ignore)

```php
$parser->setGrabberIgnoreUrls(['http://foo', 'http://bar']);
```

How to write a grabber rules file?
----------------------------------

Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the same as the domain name:

Example with the BBC website, `www.bbc.co.uk.php`:

```php
<?php
return array(
    'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
    'body' => array(
        '//div[@class="story-body"]',
    ),
    'strip' => array(
        '//script',
        '//form',
        '//style',
        '//*[@class="story-date"]',
        '//*[@class="story-header"]',
        '//*[@class="story-related"]',
        '//*[contains(@class, "byline")]',
        '//*[contains(@class, "story-feature")]',
        '//*[@id="video-carousel-container"]',
        '//*[@id="also-related-links"]',
        '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
    )
);
```

Actually, only `body`, `strip` and `test_url` are supported.

Don't forget to send a pull request or a ticket to share your contribution with everybody,

List of content grabber rules
-----------------------------

Rules are stored inside the directory [lib/PicoFeed/Rules](https://github.com/fguillot/picoFeed/tree/master/lib/PicoFeed/Rules)