Full-Text RSS — from FiveFilters.org

Create full-text feed from feed or webpage URL
Options api_keys) && !empty($options->api_keys)) { ?>
key_required) echo 'required'; ?> title="Access Key" data-content="key_required) ? 'An access key is required to generate a feed' : 'If you have an access key, enter it here.'; ?>" />
' // for ($i = 1; $i <= $options->max_entries; $i++) { // printf("\n", $i, ($i==$options->default_entries) ? ' selected="selected"' : '', $i); // } // echo ''; if (!empty($options->api_keys)) { $msg = 'Limit: '.$options->max_entries.' (with key: '.$options->max_entries_with_key.')'; $msg_more = 'If you need more items, change max_entries (and max_entries_with_key) in config.'; } else { $msg = 'Limit: '.$options->max_entries; $msg_more = 'If you need more items, change max_entries in config.'; } ?>
exclude_items_on_fail == 'user') { ?>
summary == 'user') { ?>

Quick start

  1. Check server compatibility to make sure this server meets the requirements
  2. Enter a feed or article URL in the form above and click 'Create Feed' ?
  3. If the generated full-text feed looks okay, copy the URL from your browser's address bar and use it in your news reader or application
  4. That's it! (Although see below if you'd like to customise further.)

Configure

In addition to the options above, Full-Text RSS comes with a configuration file which allows you to control how the application works. Find out more.

Features include:

  • Site patterns for better control over extraction (more info)
  • Restrict access to those with an access key and/or to a pre-defined set of URLs
  • Restrict the maximum number of feed items to be processed
  • Prepend or append an HTML fragment to each feed item processed
  • Caching

To change the configuration, save a copy of config.php as custom_config.php and make any changes you like to it.To change the configuration, edit custom_config.php and make any changes you like.

Manage and update site config files

For best results, we suggest you update the site config files bundled with Full-Text RSS.

The easiest way to update these is via the admin area. (For advanced users, you'll also be able to edit and test the extraction rules contained in the site config files from the admin area.)

Customise this page

If everything works fine, feel free to modify this page by following the steps below:

  1. Save a copy of index.php as custom_index.php
  2. Edit custom_index.php

Next time you load this page, it will automatically load custom_index.php instead.

Support

Check our help centre if you need help. You can also email us at help@fivefilters.org.

Thank you!

Thanks for downloading and setting up Full-Text RSS. This software is developed and maintained by FiveFilters.org. If you find it useful, but have not purchased this from us, please consider supporting us by purchasing from FiveFilters.org.

Request and Response

The details on this page are mainly intended for developers who'd like to use Full-Text RSS for article extraction and feed conversion. News enthusiasts who simply want to subscribe to a full-text feed in their news reading application can safely ignore the details here and use the form above.

This page describes the two endpoints offered by Full-Text RSS: Article Extraction and Feed Conversion. If you've restricted access to Full-Text RSS, the final section on API keys will tell you how to pass your key along in the request.


1. Article Extraction

To extract article content from a web page and get a simple JSON response, use the following endpoint:

  • /extract.php?url=[url]

Request Parameters

When making HTTP requests, you can pass the following parameters to extract.php in a GET or POST request.

Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

Parameter Value Description
url string (URL) This is the only required parameter. It should be the URL to a standard HTML page. You can omit the 'http://' prefix if you like.
inputhtml string (HTML) If you already have the HTML, you can pass it here. We will not make any HTTP requests for the content if this parameter is used. Note: The input HTML should be UTF-8 encoded. And you will still need to give us the URL associated with the content (the URL may determine how the content is extracted, if we have extraction rules associated with it).
content 0, 1 (default), html5 If set to 0, the extracted content will not be included in the output. If set to html5, we'll output HTML5.
links preserve (default), footnotes, remove Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
images 1 (default), 0 If set to 0, images and associated elements (img, figure, figcaption) will be removed from the output.
xss 0, 1 (default)

Use this to enable/disable XSS filtering. It is enabled by default, but if your application/framework/CMS already filters HTML for XSS vulnerabilities, you can disable XSS filtering here.

If enabled, we'll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: when enabled this will remove certain elements you may want to preserve, such as iframes.

lang 0, 1 (default), 2, 3

Language detection. If you'd like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) (Default value)
2
As above, but guess the language if it's not specified.
3
Always guess the language, whether it's specified or not.
debug [no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parser html5php, libxml The default parser is libxml as it's the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It's slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter. Note: if the Gumbo PHP extension is available, that will be used regardless of this parameter or site config file directives.
siteconfig string Site-specific extraction rules are usually stored in text files in the site_config folder. You can also submit extraction rules directly in your request using this parameter.
proxy 0, 1, string (proxy name) This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Response (example)

Simple JSON output containing extracted article title, content, and more. It was produced from the following input URL: http://www.truthdig.com/report/print/make_america_ungovernable_20170205

{
    "title": "Make America Ungovernable",
    "excerpt": "By Chris Hedges Mr. Fish / Truthdig Donald Trump’s regime is rapidl…",
    "date": "2017-02-05T23:34:57+00:00",
    "author": null,
    "language": "en",
    "url": "http://www.truthdig.com/report/item/make_america_ungovernable_20170…",
    "effective_url": "http://www.truthdig.com/report/print/make_america_ungovernable_2017…",
    "domain": "truthdig.com",
    "word_count": 2284,
    "og_url": "http://www.truthdig.com/report/print/make_america_ungovernable_2017…",
    "og_title": "Make America Ungovernable: Chris Hedges",
    "og_description": "The window to overthrow the Trump regime is rapidly closing. We mus…",
    "og_image": null,
    "og_type": "article",
    "twitter_card": null,
    "twitter_site": "@truthdig",
    "twitter_creator": "@truthdig",
    "twitter_image": null,
    "twitter_title": "Make America Ungovernable | Truthdig: Drilling Beneath the Headline…",
    "twitter_description": "The window to overthrow the Trump regime is rapidly closing. We mus…",
    "content": "<h4 class="date">Posted on Feb 5</h4>…"
}

Note: For brevity the output above is truncated.


2. Feed Conversion

To transform a partial feed to a full-text feed, pass the URL (encoded) in the querystring to the following URL:

  • /makefulltextfeed.php?url=[url]

All the parameters in the form at the top of this page can be passed in this way. Examine the URL in the address bar after you click 'Create Feed' to see the values.

Request Parameters

When making HTTP requests, you can pass the following parameters to makefulltextfeed.php in a GET request. Most of these parameters have default values suitable for news enthusiasts who simply want to subscribe to a full-text feed in their news reading application. If that's what you're doing, you can safely ignore the details here. For developers, or others who need more control over the output produced by Full-Text RSS, this section should give you an idea of what you can do.

We do not provide form fields for all of these parameters, but you can modify the URL in your browser after clicking 'Create Feed' to use them.

Note: for many of these parameters, the configuration file will ultimately determine if and how they can be used.

Parameter Value Description
url string (URL) This is the only required parameter. It should be the URL to a partial feed or a standard HTML page. You can omit the 'http://' prefix if you like.
format rss (default), json The default Full-Text RSS output is RSS. The only other valid output format is JSON. To get JSON output, pass format=json in the querystring. Exclude it from the URL (or set it to ‘rss’) if you’d like RSS.
summary 0 (default), 1 If set to 1, an excerpt will be included for each item in the output.
content 0, 1 (default), html5 If set to 0, the extracted content will not be included in the output. If set to html5, we'll output HTML5.
links preserve (default), footnotes, remove Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
images 1 (default), 0 If set to 0, images and associated elements (img, figure, figcaption) will be removed from the output.
exc 0 (default), 1 If Full-Text RSS fails to extract the article body, the generated feed item will include a message saying extraction failed followed by the original item description (if present in the original feed). You ask Full-Text RSS to remove such items from the generated feed completely by passing 1 in this parameter.
accept auto (default), feed, html

Tell Full-Text RSS what it should expect when fetching the input URL. By default Full-Text RSS tries to guess whether the response is a feed or regular HTML page. It's a good idea to be explicit by passing the appropriate type in this parameter. This is useful if, for example, a feed stops working and begins to return HTML or redirecs to a HTML page as a result of site changes. In such a scenario, if you've been explicit about the URL being a feed, Full-Text RSS will not parse HTML returned in response. If you pass accept=html (previously html=1), Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.

Note: If excluded, or set to auto, Full-Text RSS first tries to parse the server's response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html mode, Full-Text RSS will identify itself as a browser from the very first request.

xss 0 (default), 1

Use this to enable XSS filtering. We have not enabled this by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it's good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMSs which display feed content - the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side - although there's client side xss filtering available too.

If enabled, we'll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

callback string This is for JSONP use. If you're requesting JSON output, you can also specify a callback function (Javascript client-side function) to receive the Full-Text RSS JSON output.
lang 0, 1 (default), 2, 3

Language detection. If you'd like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) or feed metadata. (Default value)
2
As above, but guess the language if it's not specified.
3
Always guess the language, whether it's specified or not.

If language detection is enabled and a match is found, the language code will be returned in the <dc:language> element inside the <item> element.

debug [no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parser html5php, libxml The default parser is libxml as it's the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It's slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter. Note: if the Gumbo PHP extension is available, that will be used regardless of this parameter or site config file directives.
siteconfig string Site-specific extraction rules are usually stored in text files in the site_config folder. You can also submit extraction rules directly in your request using this parameter.
proxy 0, 1, string (proxy name) This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Feed-only parameters — These parameters only apply to web feeds. They have no effect when the input URL points to a web page.

Parameter Value Description
use_extracted_title 0 (default), 1 By default, if the input URL points to a feed, item titles in the generated feed will not be changed - we assume item titles in feeds are not truncated. If you'd like them to be replaced with titles Full-Text RSS extracts, use this parameter in the request. To enable/disable this for for all feeds, see the config file - specifically $options->favour_feed_titles
use_effective_url 0 (default), 1 When we extract content for feed items, we often end up at a different URL than the one in the original feed. This is often a result of URL shorteners or tracking services being used by the feed publisher. We include the final (effective) URL we reached to get the content inside the dc:identifier field. If you enable this, we'll also use this URL in place of the original item URL in the new feed we produce. To enable/disable this for for all feeds, see the config file - specifically $options->favour_effective_url
max number The maximum number of feed items to process. (The default and upper limit will be found in the configuration file.)

Response (example)

JSON output produced for the BBC feed http://feeds.bbci.co.uk/news/rss.xml. You can also request regular RSS.

{
    "rss": {
        "@attributes": {
            "version": "2.0"
        }
,
        "channel": {
            "title": "BBC News - Home",
            "link": "http://www.bbc.co.uk/news/#sa-ns_mchannel=rss&amp;ns_source=PublicR…",
            "description": "The latest stories from the Home section of the BBC News web site.",
            "ttl": 15,
            "image": {
                "title": "BBC News - Home",
                "link": "http://www.bbc.co.uk/news/#sa-ns_mchannel=rss&amp;ns_source=PublicR…",
                "url": "http://news.bbcimg.co.uk/nol/shared/img/bbc_news_120x60.gif"
            }
,
            "item": [
                {
                    "title": "Russia's Putin visits annexed Crimea",
                    "link": "http://www.bbc.co.uk/news/world-europe-27344029#sa-ns_mchannel=rss&…",
                    "guid": "http://www.bbc.co.uk/news/world-europe-27344029#sa-ns_mchannel=rss&…",
                    "description": "President Putin: \"[Crimeans have] proved their loyalty to a histor…",
                    "content_encoded": "<!-- Adding hypertab -->&#13;\n&#13;\n&#13;\n<!-- end of hypertab -…",
                    "pubDate": "Fri, 09 May 2014 15:02:04 +0000",
                    "dc_language": "en-gb",
                    "dc_format": "text/html",
                    "dc_identifier": "http://www.bbc.co.uk/news/world-europe-27344029",
                    "media_thumbnail": [
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74751000/jpg/_74751301_ycst2i…"
                            }

                        }
,
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74751000/jpg/_74751302_ycst2i…"
                            }

                        }

                    ]

                }
,
                {
                    "title": "Harris 'assaulted daughter's friend'",
                    "link": "http://www.bbc.co.uk/news/uk-27340134#sa-ns_mchannel=rss&ns_source=…",
                    "guid": "http://www.bbc.co.uk/news/uk-27340134#sa-ns_mchannel=rss&amp;ns_sou…",
                    "description": "Rolf Harris arrives at court flanked by his wife and daughter Rolf …",
                    "content_encoded": "<!-- Embedding the video player -->&#13;\n<!-- This is the embedd…",
                    "pubDate": "Fri, 09 May 2014 15:21:52 +0000",
                    "dc_language": "en-gb",
                    "dc_format": "text/html",
                    "dc_identifier": "http://www.bbc.co.uk/news/uk-27340134",
                    "media_thumbnail": [
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74740000/jpg/_74740642_hi0221…"
                            }

                        }
,
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74740000/jpg/_74740643_hi0221…"
                            }

                        }

                    ]

                }
,
                {
                    "title": "Nigeria 'ignored' school warning",
                    "link": "http://www.bbc.co.uk/news/world-africa-27344863#sa-ns_mchannel=rss&…",
                    "guid": "http://www.bbc.co.uk/news/world-africa-27344863#sa-ns_mchannel=rss&…",
                    "description": "Nigeria's military had advance warning of the attack on a school at…",
                    "content_encoded": "<div class=\"caption full-width\">&#13;\n <img src=\"http://news.b…",
                    "pubDate": "Fri, 09 May 2014 15:48:34 +0000",
                    "dc_language": "en-gb",
                    "dc_format": "text/html",
                    "dc_identifier": "http://www.bbc.co.uk/news/world-africa-27344863",
                    "media_thumbnail": [
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74749000/jpg/_74749855_747495…"
                            }

                        }
,
                        {
                            "@attributes": {
                                "url": "http://news.bbcimg.co.uk/media/images/74749000/jpg/_74749856_747495…"
                            }

                        }

                    ]

                }

            ]

        }

    }

}

Note: For brevity the output above is truncated.


API Keys

To restrict access to your copy of Full-Text RSS, you can specify API keys in the config file.

Note: Full-text feeds produced by Full-Text RSS are intended to be publically accessible to work with feed readers. As such, the API key should not appear in the final URL for feeds.

Parameter Value Description
key string or number

This parameter has two functions.

If you're calling Full-Text RSS programattically, it's better to use this parameter to provide the API key index number together with the hash parameter (see below) so that the actual API key does not get sent in the HTTP request.

If you pass the actual API key in this parameter, the hash parameter is not required. If you pass the actual API key, Full-Text RSS will find the index number and generate the hash value automatically and redirect to a new URL to hide the API key. If you'd like to link to a generated feed publically while protecting your API key, make sure you copy and paste the URL that results after the redirect.

If you've configured Full-Text RSS to require a key, an invalid key will result in an error message.

hash string A SHA-1 hash value of the API key (actual key, not index number) and the URL supplied in the url parameter, concatenated. This parameter must be passed along with the API key's index number using the key parameter (see above). In PHP, for example: $hash = sha1($api_key.$url);
key_redirect 0 or 1 (default)

When supplying the API key with the key parameter, Full-Text RSS will generate a new URL and issue a HTTP redirect to the new URL to hide the API key (see description above). If you'd like to avoid an HTTP redirect, you can pass 0 in this parameter. We do not recommend you subscribe to feeds generated in this way.

About

This is a free software project to enable article extraction from web pages. It can extract content from a standard HTML page and return a 1-item feed or it can transform an existing feed into a full-text feed. It is being developed as part of the Five Filters project to promote independent, non-corporate media.

Bookmarklet

Rather than copying and pasting URLs into this form, you can add the bookmarklet on this page to your browser. Simply drag the link below to your browser's bookmarks toolbar. Then whenever you'd like a full-text feed, click the bookmarklet.

Drag this:

Note: This uses the default options and does not include your access key (if configured).

Free Software

Note: 'Free' as in 'free speech' (see the free software definition)

If you're the owner of this site and you plan to offer this service to others through your hosted copy, please keep a download link so users can grab a copy of the code if they want it (you can either offer a free download yourself, or link to the purchase option on fivefilters.org to support us).

For full details, please refer to the license.

If you're not the owner of this site (ie. you're not hosting this yourself), you do not have to rely on an external service if you don't want to. You can download your own copy of Full-Text RSS under the AGPL license.

Software Components

Full-Text RSS is written in PHP and relies on the following primary components:

Depending on your configuration, these secondary components may also be used:

System Requirements

PHP 5.3 or above is required. A simple shared web hosting account should work fine, but we recommend a VPS with 1GB RAM. The code has been tested on Windows and Linux using the Apache web server. If you're a Windows user, you can try it on your own machine using Uniform Server. It has also been reported as working under IIS, but we have not tested this ourselves.

Download

Download from fivefilters.org — old versions are available in our code repository.

Your version of Full-Text RSS:
Your version of Site Patterns:

To see if you have the latest versions, check for updates.

If you've purchased this from FiveFilters.org, you'll receive notification when we release a new version or update the site patterns.

AGPL logo

Full-Text RSS is licensed under the AGPL version 3 — more information about why we use this license can be found on FiveFilters.org

The software components in this application are licensed as follows...