Full-Text RSS 2.7

This commit is contained in:
Keyvan 2011-11-04 18:40:29 +01:00
parent e2a9b81740
commit f9f03f14c0
20 changed files with 1146 additions and 119 deletions

View File

@ -1,6 +1,13 @@
Full-Text RSS
=============
NOTE
----
This is a our public version of Full-Text RSS available to download for free from http://code.fivefilters.org.
To sustain the project we sell copies of the most up-to-date version at http://fivefilters.org/content-only/#download - so if you like this, please consider supporting us by purchasing the latest release.
About
-----
@ -14,7 +21,7 @@ Installation
2. FTP the files up to your server
3. Access index.php through your browser. E.g. http://my-host.com/full-text-rss/index.php
3. Access index.php through your browser. E.g. http://example.org/full-text-rss/index.php
4. Enter a URL in the form field to test the code

View File

@ -9,7 +9,7 @@ To update your copy of Full-Text RSS to ensure feeds continue to be processed as
3. FTP the new folder up to your server
4. Access index.php in the new folder through your browser -- for example http://my-host.com/full-text-rss-updated/index.php
4. Access index.php in the new folder through your browser -- for example http://example.org/full-text-rss-updated/index.php
5. Enter a URL in the form field to test the updated code

View File

@ -2,6 +2,16 @@ FiveFilters.org: Full-Text RSS
http://fivefilters.org/content-only/
CHANGELOG
------------------------------------
2.7 (2011-03-21)
- Site patterns for better control over extraction (see site_config/README.txt)
- hNews support (improves content extraction for sites using hNews microformatting)
- Cookie Jar now used to store and sends cookies when following HTTP redirects
- Better handling of certain cases where HTML Tidy fails to clean up properly
- Bug fix: curl_multi_select() timing out in certain environments (fixed in HumbleHttpAgent.php)
- Bug fix: broken HTTP header parsing in some environments (fixed in SimplePie_HumbleHttpAgent.php)
- Bug fix: invalid API URL shown (fixed in index.php)
- Plus other minor fixes...
2.6 (2011-03-02)
- Rewriting of hash-bang (#!) URLs (see http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch for an explanation)
- Improved parallel fetching support (HumbleHttpAgent uses curl_multi_* functions if PECL HTTP extension is not present)

View File

@ -11,6 +11,9 @@
// options you'd like to override in custom_config.php.
// .....................................................
// Create config object
$options = new stdClass();
// Enable service
// ----------------------
// Set this to false if you want to disable the service.
@ -185,7 +188,7 @@ $options->error_message_with_key = '[unable to retrieve full-text content]';
/// DO NOT CHANGE ANYTHING BELOW THIS ///////////
/////////////////////////////////////////////////
if (!defined('_FF_FTR_VERSION')) define('_FF_FTR_VERSION', '2.6');
if (!defined('_FF_FTR_VERSION')) define('_FF_FTR_VERSION', '2.7');
if ((basename(__FILE__) == 'config.php') && (file_exists(dirname(__FILE__).'/custom_config.php'))) {
require_once(dirname(__FILE__).'/custom_config.php');

View File

@ -13,7 +13,7 @@ SimplePie.org. We have kept most of their checks intact as we use SimplePie in o
http://github.com/simplepie/simplepie/tree/master/compatibility_test/
*/
$app_name = 'Full-Text RSS 2.6';
$app_name = 'Full-Text RSS 2.7';
$php_ok = (function_exists('version_compare') && version_compare(phpversion(), '5.2.0', '>='));
$pcre_ok = extension_loaded('pcre');
@ -24,6 +24,7 @@ $tidy_ok = function_exists('tidy_parse_string');
$curl_ok = function_exists('curl_exec');
$parallel_ok = ((extension_loaded('http') && class_exists('HttpRequestPool')) || ($curl_ok && function_exists('curl_multi_init')));
$allow_url_fopen_ok = (bool)ini_get('allow_url_fopen');
$filter_ok = extension_loaded('filter');
if (extension_loaded('xmlreader')) {
$xml_ok = true;
@ -217,6 +218,11 @@ div.chunk {
<td>Enabled</td>
<td><?php echo ($iconv_ok) ? 'Enabled' : 'Disabled'; ?></td>
</tr>
<tr class="<?php echo ($filter_ok) ? 'enabled' : 'disabled'; ?>">
<td><a href="http://uk.php.net/manual/en/book.filter.php">Data filtering</a></td>
<td>Enabled</td>
<td><?php echo ($filter_ok) ? 'Enabled' : 'Disabled'; ?></td>
</tr>
<tr class="<?php echo ($tidy_ok) ? 'enabled' : 'disabled'; ?>">
<td><a href="http://php.net/tidy">Tidy</a></td>
<td>Enabled</td>
@ -244,7 +250,7 @@ div.chunk {
<div class="chunk">
<h3>What does this mean?</h3>
<ol>
<?php if ($php_ok && $xml_ok && $pcre_ok && $mbstring_ok && $iconv_ok && $zlib_ok && $tidy_ok && $curl_ok && $parallel_ok && $allow_url_fopen_ok): ?>
<?php if ($php_ok && $xml_ok && $pcre_ok && $mbstring_ok && $iconv_ok && $filter_ok && $zlib_ok && $tidy_ok && $curl_ok && $parallel_ok && $allow_url_fopen_ok): ?>
<li><em>You have everything you need to run <?php echo $app_name; ?> properly! Congratulations!</em></li>
<?php else: ?>
<?php if ($php_ok): ?>
@ -257,38 +263,45 @@ div.chunk {
<?php if ($allow_url_fopen_ok): ?>
<li><strong>allow_url_fopen:</strong> You have allow_url_fopen enabled. <em>No problems here.</em></li>
<?php if ($zlib_ok): ?>
<li><strong>Zlib:</strong> You have <code>Zlib</code> enabled. This allows SimplePie to support GZIP-encoded feeds. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>Zlib:</strong> The <code>Zlib</code> extension is not available. SimplePie will ignore any GZIP-encoding, and instead handle feeds as uncompressed text.</li>
<?php endif; ?>
<?php if ($filter_ok): ?>
<li><strong>Data filtering:</strong> You have the PHP filter extension enabled. <em>No problems here.</em></li>
<?php if ($mbstring_ok && $iconv_ok): ?>
<li><strong>mbstring and iconv:</strong> You have both <code>mbstring</code> and <code>iconv</code> installed! This will allow <?php echo $app_name; ?> to handle the greatest number of languages. <em>No problems here.</em></li>
<?php elseif ($mbstring_ok): ?>
<li><strong>mbstring:</strong> <code>mbstring</code> is installed, but <code>iconv</code> is not.</li>
<?php elseif ($iconv_ok): ?>
<li><strong>iconv:</strong> <code>iconv</code> is installed, but <code>mbstring</code> is not.</li>
<?php else: ?>
<li><strong>mbstring and iconv:</strong> <em>You do not have either of the extensions installed.</em> This will significantly impair your ability to read non-English feeds, as well as even some English ones.</li>
<?php endif; ?>
<?php if ($zlib_ok): ?>
<li><strong>Zlib:</strong> You have <code>Zlib</code> enabled. This allows SimplePie to support GZIP-encoded feeds. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>Zlib:</strong> The <code>Zlib</code> extension is not available. SimplePie will ignore any GZIP-encoding, and instead handle feeds as uncompressed text.</li>
<?php endif; ?>
<?php if ($tidy_ok): ?>
<li><strong>Tidy:</strong> You have <code>Tidy</code> support installed. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>Tidy:</strong> The <code>Tidy</code> extension is not available. <?php echo $app_name; ?> should still work with most feeds, but you may experience problems with some.</li>
<?php endif; ?>
<?php if ($mbstring_ok && $iconv_ok): ?>
<li><strong>mbstring and iconv:</strong> You have both <code>mbstring</code> and <code>iconv</code> installed! This will allow <?php echo $app_name; ?> to handle the greatest number of languages. <em>No problems here.</em></li>
<?php elseif ($mbstring_ok): ?>
<li><strong>mbstring:</strong> <code>mbstring</code> is installed, but <code>iconv</code> is not.</li>
<?php elseif ($iconv_ok): ?>
<li><strong>iconv:</strong> <code>iconv</code> is installed, but <code>mbstring</code> is not.</li>
<?php else: ?>
<li><strong>mbstring and iconv:</strong> <em>You do not have either of the extensions installed.</em> This will significantly impair your ability to read non-English feeds, as well as even some English ones.</li>
<?php endif; ?>
<?php if ($curl_ok): ?>
<li><strong>cURL:</strong> You have <code>cURL</code> support installed. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>cURL:</strong> The <code>cURL</code> extension is not available. SimplePie will use <code>fsockopen()</code> instead.</li>
<?php endif; ?>
<?php if ($tidy_ok): ?>
<li><strong>Tidy:</strong> You have <code>Tidy</code> support installed. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>Tidy:</strong> The <code>Tidy</code> extension is not available. <?php echo $app_name; ?> should still work with most feeds, but you may experience problems with some.</li>
<?php endif; ?>
<?php if ($curl_ok): ?>
<li><strong>cURL:</strong> You have <code>cURL</code> support installed. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>cURL:</strong> The <code>cURL</code> extension is not available. SimplePie will use <code>fsockopen()</code> instead.</li>
<?php endif; ?>
<?php if ($parallel_ok): ?>
<li><strong>Parallel URL fetching:</strong> You have <code>HttpRequestPool</code> or <code>curl_multi</code> support installed. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>Parallel URL fetching:</strong> <code>HttpRequestPool</code> or <code>curl_multi</code> support is not available. <?php echo $app_name; ?> will use <code>file_get_contents()</code> instead to fetch URLs sequentially rather than in parallel.</li>
<?php endif; ?>
<?php if ($parallel_ok): ?>
<li><strong>Parallel URL fetching:</strong> You have <code>HttpRequestPool</code> or <code>curl_multi</code> support installed. <em>No problems here.</em></li>
<?php else: ?>
<li><strong>Parallel URL fetching:</strong> <code>HttpRequestPool</code> or <code>curl_multi</code> support is not available. <?php echo $app_name; ?> will use <code>file_get_contents()</code> instead to fetch URLs sequentially rather than in parallel.</li>
<li><strong>Data filtering:</strong> Your PHP configuration has the filter extension disabled. <em><?php echo $app_name; ?> will not work here.</em></li>
<?php endif; ?>
<?php else: ?>
@ -309,19 +322,19 @@ div.chunk {
</div>
<div class="chunk">
<?php if ($php_ok && $xml_ok && $pcre_ok && $mbstring_ok && $iconv_ok && $allow_url_fopen_ok) { ?>
<?php if ($php_ok && $xml_ok && $pcre_ok && $mbstring_ok && $iconv_ok && $filter_ok && $allow_url_fopen_ok) { ?>
<h3>Bottom Line: Yes, you can!</h3>
<p><em>Your webhost has its act together!</em></p>
<p>You can download the latest version of <?php echo $app_name; ?> from <a href="http://fivefilters.org/content-only/#download">FiveFilters.org</a>.</p>
<p><strong>Note</strong>: Passing this test does not guarantee that <?php echo $app_name; ?> will run on your webhost &mdash; it only ensures that the basic requirements have been addressed. If you experience any problems, please let us know.</p>
<?php } else if ($php_ok && $xml_ok && $pcre_ok && $allow_url_fopen_ok) { ?>
<?php } else if ($php_ok && $xml_ok && $pcre_ok && $allow_url_fopen_ok && $filter_ok) { ?>
<h3>Bottom Line: Yes, you can!</h3>
<p><em>For most feeds, it'll run with no problems.</em> There are certain languages that you might have a hard time with though.</p>
<p>You can download the latest version of <?php echo $app_name; ?> from <a href="http://fivefilters.org/content-only/#download">FiveFilters.org</a>.</p>
<p><strong>Note</strong>: Passing this test does not guarantee that <?php echo $app_name; ?> will run on your webhost &mdash; it only ensures that the basic requirements have been addressed. If you experience any problems, please let us know.</p>
<?php } else { ?>
<h3>Bottom Line: We're sorry…</h3>
<p><em>Your webhost does not support the minimum requirements for <?php echo $app_name; ?>.</em> It may be a good idea to contact your webhost, and ask them to install a more recent version of PHP as well as the <code>xmlreader</code>, <code>xml</code>, <code>mbstring</code>, <code>iconv</code>, <code>curl</code>, and <code>zlib</code> extensions. And ask them to enable allow_url_fopen.</p>
<p><em>Your webhost does not support the minimum requirements for <?php echo $app_name; ?>.</em> It may be a good idea to contact your webhost and point them to the results of this test. They may be able to enable/install the required components.</p>
<?php } ?>
</div>

View File

@ -1,9 +1,5 @@
<?php
require_once(dirname(__FILE__).'/config.php');
// check for custom config.php (custom_config.php)
if (file_exists(dirname(__FILE__).'/custom_config.php')) {
require_once(dirname(__FILE__).'/custom_config.php');
}
// check for custom index.php (custom_index.php)
if (!defined('_FF_FTR_INDEX')) {
define('_FF_FTR_INDEX', true);
@ -26,7 +22,7 @@ if (!defined('_FF_FTR_INDEX')) {
<!--<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js"></script>-->
<script type="text/javascript" src="js/jquery-1.4.4.min.js"></script>
<script type="text/javascript">
var baseUrl = 'http://'+window.location.host+window.location.pathname.replace(/(\/index\.html|\/)$/, '');
var baseUrl = 'http://'+window.location.host+window.location.pathname.replace(/(\/index\.php|\/)$/, '');
$(document).ready(function() {
$('#form').submit(function() {
$('#url').val($('#url').val().replace(/^http:\/\//i, ''));
@ -102,6 +98,7 @@ if (!defined('_FF_FTR_INDEX')) {
<h3>Configure</h3>
<p>In addition to the options above, Full-Text RSS comes with a configuration file which allows you to control how the application works. Features include:</p>
<ul>
<li>Site patterns for better control over extraction (<a href="site_config/README.txt">more info</a>)</li>
<li>Restrict access to a pre-defined set of URLs or block certain URLs</li>
<li>Restrict the maximum number of feed items to be processed</li>
<li>Prepend or append an HTML fragment to each feed item processed</li>

View File

@ -0,0 +1,454 @@
<?php
/**
* Content Extractor
*
* Uses patterns specified in site config files and auto detection (hNews/PHP Readability)
* to extract content from HTML files.
*
* @version 0.5
* @date 2011-03-07
* @author Keyvan Minoukadeh
* @copyright 2011 Keyvan Minoukadeh
* @license http://www.gnu.org/licenses/agpl-3.0.html AGPL v3
*/
class ContentExtractor
{
const HOSTNAME_REGEX = '/^(([a-zA-Z0-9-]*[a-zA-Z0-9])\.)*([A-Za-z0-9-]*[A-Za-z0-9])$/';
protected static $config_cache = array();
protected static $tidy_config = array(
'clean' => true,
'output-xhtml' => true,
'logical-emphasis' => true,
'show-body-only' => false,
'wrap' => 0,
'drop-empty-paras' => true,
'drop-proprietary-attributes' => false,
'enclose-text' => true,
'enclose-block-text' => true,
'merge-divs' => true,
'merge-spans' => true,
'char-encoding' => 'utf8',
'hide-comments' => true
);
protected $config_path;
protected $html;
protected $config;
protected $title;
protected $body;
protected $success = false;
protected $fallback;
public $readability;
public $debug = false;
function __construct($config_path=null, ContentExtractor $config_fallback=null) {
$this->config_path = $config_path;
$this->fallback = $config_fallback;
}
protected function debug($msg) {
if ($this->debug) {
$mem = round(memory_get_usage()/1024, 2);
$memPeak = round(memory_get_peak_usage()/1024, 2);
echo '* ',$msg;
echo ' - mem used: ',$mem," (peak: $memPeak)\n";
ob_flush();
flush();
}
}
public function reset() {
$this->html = null;
$this->readability = null;
$this->config = null;
$this->title = null;
$this->body = null;
$this->success = false;
}
// returns SiteConfig instance if an appropriate one is found, false otherwise
public function get_site_config($host) {
$host = strtolower($host);
if (substr($host, 0, 4) == 'www.') $host = substr($host, 4);
if (!$host || (strlen($host) > 200) || !preg_match(self::HOSTNAME_REGEX, $host)) return false;
// check for site configuration
$try = array($host);
$split = explode('.', $host);
if (count($split) > 1) {
array_shift($split);
$try[] = '.'.implode('.', $split);
}
foreach ($try as $h) {
if (array_key_exists($h, self::$config_cache)) {
$this->debug("... cached ($h)");
return self::$config_cache[$h];
} elseif (file_exists($this->config_path."/$h.txt")) {
$this->debug("... from file ($h)");
$file = $this->config_path."/$h.txt";
break;
}
}
if (!isset($file)) {
if (isset($this->fallback)) {
$this->debug("... trying fallback ($host)");
return $this->fallback->get_site_config($host);
} else {
$this->debug("... no match ($host)");
return false;
}
}
$config_file = file($file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
if (!$config_file || !is_array($config_file)) return false;
$config = new SiteConfig();
foreach ($config_file as $line) {
$line = trim($line);
// skip comments, empty lines
if ($line == '' || $line[0] == '#') continue;
// get command
$command = explode(':', $line, 2);
// if there's no colon ':', skip this line
if (count($command) != 2) continue;
$val = trim($command[1]);
$command = trim($command[0]);
if ($command == '' || $val == '') continue;
// check for commands where we accept multiple statements
if (in_array($command, array('title', 'body', 'strip', 'strip_id_or_class', 'strip_image_src'))) {
array_push($config->$command, $val);
// check for single statement commands that evaluate to true or false
} elseif (in_array($command, array('tidy', 'prune', 'autodetect_on_failure'))) {
$config->$command = ($val == 'yes');
// check for single statement commands stored as strings
} elseif (in_array($command, array('test_url'))) {
$config->$command = $val;
}
}
// store copy of config in our static cache array in case we need to process another URL
self::$config_cache[$h] = $config;
return $config;
}
// returns true on success, false on failure
// $smart_tidy indicates that if tidy is used and no results are produced, we will
// try again without it. Tidy helps us deal with PHP's patchy HTML parsing most of the time
// but it has problems of its own which we try to avoid with this option.
public function process($html, $url, $smart_tidy=true) {
$this->reset();
// extract host name
$host = @parse_url($url, PHP_URL_HOST);
if (!($this->config = $this->get_site_config($host))) {
// no match, so use defaults
$this->config = new SiteConfig();
self::$config_cache[$host] = $this->config;
}
// use tidy (if it exists)?
// This fixes problems with some sites which would otherwise
// trouble DOMDocument's HTML parsing. (Although sometimes it
// makes matters worse, which is why you can override it in site config files.)
$tidied = false;
if ($this->config->tidy && function_exists('tidy_parse_string') && $smart_tidy) {
$this->debug('Using Tidy');
$tidy = tidy_parse_string($html, self::$tidy_config, 'UTF8');
if (tidy_clean_repair($tidy)) {
$original_html = $html;
$tidied = true;
$html = $tidy->value;
}
unset($tidy);
}
// load and parse html
$this->readability = new Readability($html, $url);
// we use xpath to find elements in the given HTML document
// see http://en.wikipedia.org/wiki/XPath_1.0
$xpath = new DOMXPath($this->readability->dom);
// strip elements (using xpath expressions)
foreach ($this->config->strip as $pattern) {
$elems = @$xpath->query($pattern, $this->readability->dom);
// check for matches
if ($elems && $elems->length > 0) {
$this->debug('Stripping '.$elems->length.' elements (strip)');
for ($i=$elems->length-1; $i >= 0; $i--) {
$elems->item($i)->parentNode->removeChild($elems->item($i));
}
}
}
// strip elements (using id and class attribute values)
foreach ($this->config->strip_id_or_class as $string) {
$string = strtr($string, array("'"=>'', '"'=>''));
$elems = @$xpath->query("//*[contains(@class, '$string') or contains(@id, '$string')]", $this->readability->dom);
// check for matches
if ($elems && $elems->length > 0) {
$this->debug('Stripping '.$elems->length.' elements (strip_id_or_class)');
for ($i=$elems->length-1; $i >= 0; $i--) {
$elems->item($i)->parentNode->removeChild($elems->item($i));
}
}
}
// strip images (using src attribute values)
foreach ($this->config->strip_image_src as $string) {
$string = strtr($string, array("'"=>'', '"'=>''));
$elems = @$xpath->query("//img[contains(@src, '$string')]", $this->readability->dom);
// check for matches
if ($elems && $elems->length > 0) {
$this->debug('Stripping '.$elems->length.' image elements');
for ($i=$elems->length-1; $i >= 0; $i--) {
$elems->item($i)->parentNode->removeChild($elems->item($i));
}
}
}
// strip elements using Readability.com and Instapaper.com ignore class names
// .entry-unrelated and .instapaper_ignore
// See https://www.readability.com/publishers/guidelines/#view-plainGuidelines
// and http://blog.instapaper.com/post/730281947
$elems = @$xpath->query("//*[contains(concat(' ',normalize-space(@class),' '),' entry-unrelated ') or contains(concat(' ',normalize-space(@class),' '),' instapaper_ignore ')]", $this->readability->dom);
// check for matches
if ($elems && $elems->length > 0) {
$this->debug('Stripping '.$elems->length.' .entry-unrelated,.instapaper_ignore elements');
for ($i=$elems->length-1; $i >= 0; $i--) {
$elems->item($i)->parentNode->removeChild($elems->item($i));
}
}
// strip elements that contain style="display: none;"
$elems = @$xpath->query("//*[contains(@style,'display:none')]", $this->readability->dom);
// check for matches
if ($elems && $elems->length > 0) {
$this->debug('Stripping '.$elems->length.' elements with inline display:none style');
for ($i=$elems->length-1; $i >= 0; $i--) {
$elems->item($i)->parentNode->removeChild($elems->item($i));
}
}
// try to get title
foreach ($this->config->title as $pattern) {
$elems = @$xpath->evaluate($pattern, $this->readability->dom);
if (is_string($elems)) {
$this->debug('Title expression evaluated as string');
$this->title = trim($elems);
break;
} elseif ($elems instanceof DOMNodeList && $elems->length > 0) {
$this->debug('Title matched');
$this->title = $elems->item(0)->textContent;
break;
}
}
// try to get body
foreach ($this->config->body as $pattern) {
$elems = @$xpath->query($pattern, $this->readability->dom);
// check for matches
if ($elems && $elems->length > 0) {
$this->debug('Body matched');
if ($elems->length == 1) {
$this->body = $elems->item(0);
// prune (clean up elements that may not be content)
if ($this->config->prune) {
$this->debug('Pruning content');
$this->readability->prepArticle($this->body);
}
break;
} else {
$this->body = $this->readability->dom->createElement('div');
$this->debug($elems->length.' body elems found');
foreach ($elems as $elem) {
$isDescendant = false;
foreach ($this->body->childNodes as $parent) {
if ($this->isDescendant($parent, $elem)) {
$isDescendant = true;
break;
}
}
if ($isDescendant) {
$this->debug('Element is child of another body element, skipping.');
} else {
// prune (clean up elements that may not be content)
if ($this->config->prune) {
$this->debug('Pruning content');
$this->readability->prepArticle($elem);
}
$this->debug('Element added to body');
$this->body->appendChild($elem);
}
}
}
}
}
// auto detect?
$detect_title = $detect_body = false;
// detect title?
if (!isset($this->title)) {
if (empty($this->config->title) || (!empty($this->config->title) && $this->config->autodetect_on_failure)) {
$detect_title = true;
}
}
// detect body?
if (!isset($this->body)) {
if (empty($this->config->body) || (!empty($this->config->body) && $this->config->autodetect_on_failure)) {
$detect_body = true;
}
}
// check for hNews
if ($detect_title || $detect_body) {
// check for hentry
$elems = @$xpath->query("//*[contains(concat(' ',normalize-space(@class),' '),' hentry ')]", $this->readability->dom);
if ($elems && $elems->length > 0) {
$this->debug('hNews: found hentry');
$hentry = $elems->item(0);
if ($detect_title) {
// check for entry-title
$elems = @$xpath->query(".//*[contains(concat(' ',normalize-space(@class),' '),' entry-title ')]", $hentry);
if ($elems && $elems->length > 0) {
$this->debug('hNews: found entry-title');
$this->title = $elems->item(0)->textContent;
$detect_title = false;
}
}
// check for entry-content.
// according to hAtom spec, if there are multiple elements marked entry-content,
// we include all of these in the order they appear - see http://microformats.org/wiki/hatom#Entry_Content
if ($detect_body) {
$elems = @$xpath->query(".//*[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]", $hentry);
if ($elems && $elems->length > 0) {
$this->debug('hNews: found entry-content');
if ($elems->length == 1) {
// what if it's empty? (some sites misuse hNews - place their content outside an empty entry-content element)
$e = $elems->item(0);
if (($e->tagName == 'img') || (trim($e->textContent) != '')) {
$this->body = $elems->item(0);
// prune (clean up elements that may not be content)
if ($this->config->prune) {
$this->debug('Pruning content');
$this->readability->prepArticle($this->body);
}
$detect_body = false;
} else {
$this->debug('hNews: skipping entry-content - appears not to contain content');
}
unset($e);
} else {
$this->body = $this->readability->dom->createElement('div');
$this->debug($elems->length.' entry-content elems found');
foreach ($elems as $elem) {
$isDescendant = false;
foreach ($this->body->childNodes as $parent) {
if ($this->isDescendant($parent, $elem)) {
$isDescendant = true;
break;
}
}
if ($isDescendant) {
$this->debug('Element is child of another body element, skipping.');
} else {
// prune (clean up elements that may not be content)
if ($this->config->prune) {
$this->debug('Pruning content');
$this->readability->prepArticle($elem);
}
$this->debug('Element added to body');
$this->body->appendChild($elem);
}
}
$detect_body = false;
}
}
}
}
}
// check for elements marked with instapaper_title
if ($detect_title) {
// check for instapaper_title
$elems = @$xpath->query("//*[contains(concat(' ',normalize-space(@class),' '),' instapaper_title ')]", $this->readability->dom);
if ($elems && $elems->length > 0) {
$this->debug('title found (.instapaper_title)');
$this->title = $elems->item(0)->textContent;
$detect_title = false;
}
}
// check for elements marked with instapaper_body
if ($detect_body) {
$elems = @$xpath->query("//*[contains(concat(' ',normalize-space(@class),' '),' instapaper_body ')]", $this->readability->dom);
if ($elems && $elems->length > 0) {
$this->debug('body found (.instapaper_body)');
$this->body = $elems->item(0);
// prune (clean up elements that may not be content)
if ($this->config->prune) {
$this->debug('Pruning content');
$this->readability->prepArticle($this->body);
}
$detect_body = false;
}
}
// still missing title or body, so we detect using Readability
if ($detect_title || $detect_body) {
$this->debug('Using Readability');
// clone body if we're only using Readability for title (otherwise it may interfere with body element)
if (isset($this->body)) $this->body = $this->body->cloneNode(true);
$success = $this->readability->init();
}
if ($detect_title) {
$this->debug('Detecting title');
$this->title = $this->readability->getTitle()->textContent;
}
if ($detect_body && $success) {
$this->debug('Detecting body');
$this->body = $this->readability->getContent();
if ($this->body->childNodes->length == 1 && $this->body->firstChild->nodeType === XML_ELEMENT_NODE) {
$this->body = $this->body->firstChild;
}
// prune (clean up elements that may not be content)
if ($this->config->prune) {
$this->debug('Pruning content');
$this->readability->prepArticle($this->body);
}
}
if (isset($this->body)) {
// remove scripts
$this->readability->removeScripts($this->body);
$this->success = true;
}
// if we've had no success and we've used tidy, there's a chance
// that tidy has messed up. So let's try again without tidy...
if (!$this->success && $tidied && $smart_tidy) {
$this->debug('Trying again without tidy');
$this->process($original_html, $url, false);
}
return $this->success;
}
private function isDescendant(DOMElement $parent, DOMElement $child) {
$node = $child->parentNode;
while ($node != null) {
if ($node->isSameNode($parent)) return true;
$node = $node->parentNode;
}
return false;
}
public function getContent() {
return $this->body;
}
public function getTitle() {
return $this->title;
}
public function getSiteConfig() {
return $this->config;
}
}
?>

View File

@ -0,0 +1,51 @@
<?php
/**
* Site Config
*
* Each instance of this class should hold extraction patterns and other directives
* for a website. See ContentExtractor class to see how it's used.
*
* @version 0.5
* @date 2011-03-08
* @author Keyvan Minoukadeh
* @copyright 2011 Keyvan Minoukadeh
* @license http://www.gnu.org/licenses/agpl-3.0.html AGPL v3
*/
class SiteConfig
{
// Use first matching element as title (0 or more xpath expressions)
public $title = array();
// Use first matching element as body (0 or more xpath expressions)
public $body = array();
// Strip elements matching these xpath expressions (0 or more)
public $strip = array();
// Strip elements which contain these strings (0 or more) in the id or class attribute
public $strip_id_or_class = array();
// Strip images which contain these strings (0 or more) in the src attribute
public $strip_image_src = array();
// Process HTML with tidy before creating DOM
public $tidy = true;
// Autodetect title/body if xpath expressions fail to produce results.
// Note that this applies to title and body separately, ie.
// * if we get a body match but no title match, this option will determine whether we autodetect title
// * if neither match, this determines whether we autodetect title and body.
// Also note that this only applies when there is at least one xpath expression in title or body, ie.
// * if title and body are both empty (no xpath expressions), this option has no effect (both title and body will be auto-detected)
// * if there's an xpath expression for title and none for body, body will be auto-detected and this option will determine whether we auto-detect title if the xpath expression for it fails to produce results.
// Usage scenario: you want to extract something specific from a set of URLs, e.g. a table, and if the table is not found, you want to ignore the entry completely. Auto-detection is unlikely to succeed here, so you construct your patterns and set this option to false. Another scenario may be a site where auto-detection has proven to fail (or worse, picked up the wrong content).
public $autodetect_on_failure = true;
// Clean up content block - attempt to remove elements that appear to be superfluous
public $prune = true;
// Test URL - if present, can be used to test the config above
public $test_url = null;
}
?>

View File

@ -0,0 +1,404 @@
<?php
/**
* Cookie Jar
*
* PHP class for handling cookies, as defined by the Netscape spec:
* <http://curl.haxx.se/rfc/cookie_spec.html>
*
* This class should be used to handle cookies (storing cookies from HTTP response messages, and
* sending out cookies in HTTP request messages). This has been adapted for FiveFilters.org
* from the original version used in HTTP Navigator. See http://www.keyvan.net/code/http-navigator/
*
* This class is mainly based on Cookies.pm <http://search.cpan.org/author/GAAS/libwww-perl-5.65/
* lib/HTTP/Cookies.pm> from the libwww-perl collection <http://www.linpro.no/lwp/>.
* Unlike Cookies.pm, this class only supports the Netscape cookie spec, not RFC 2965.
*
* @version 0.5
* @date 2011-03-15
* @see http://php.net/HttpRequestPool
* @author Keyvan Minoukadeh
* @copyright 2011 Keyvan Minoukadeh
* @license http://www.gnu.org/licenses/agpl-3.0.html AGPL v3
*/
class CookieJar
{
/**
* Cookies - array containing all cookies.
*
* <pre>
* Cookies are stored like this:
* [domain][path][name] = array
* where array is:
* 0 => value, 1 => secure, 2 => expires
* </pre>
* @var array
* @access private
*/
public $cookies = array();
public $debug = false;
/**
* Constructor
*/
function __construct() {
}
protected function debug($msg, $file=null, $line=null) {
if ($this->debug) {
$mem = round(memory_get_usage()/1024, 2);
$memPeak = round(memory_get_peak_usage()/1024, 2);
echo '* ',$msg;
if (isset($file, $line)) echo " ($file line $line)";
echo ' - mem used: ',$mem," (peak: $memPeak)\n";
ob_flush();
flush();
}
}
/**
* Get matching cookies
*
* Only use this method if you cannot use add_cookie_header(), for example, if you want to use
* this cookie jar class without using the request class.
*
* @param array $param associative array containing 'domain', 'path', 'secure' keys
* @return string
* @see add_cookie_header()
*/
public function getMatchingCookies($url)
{
if (($parts = @parse_url($url)) && isset($parts['scheme'], $parts['host'], $parts['path'])) {
$param['domain'] = $parts['host'];
$param['path'] = $parts['path'];
$param['secure'] = (strtolower($parts['scheme']) == 'https');
unset($parts);
} else {
return false;
}
// RFC 2965 notes:
// If multiple cookies satisfy the criteria above, they are ordered in
// the Cookie header such that those with more specific Path attributes
// precede those with less specific. Ordering with respect to other
// attributes (e.g., Domain) is unspecified.
$domain = $param['domain'];
if (strpos($domain, '.') === false) $domain .= '.local';
$request_path = $param['path'];
if ($request_path == '') $request_path = '/';
$request_secure = $param['secure'];
$now = time();
$matched_cookies = array();
// domain - find matching domains
$this->debug('Finding matching domains for '.$domain, __FILE__, __LINE__);
while (strpos($domain, '.') !== false) {
if (isset($this->cookies[$domain])) {
$this->debug(' domain match found: '.$domain);
$cookies =& $this->cookies[$domain];
} else {
$domain = $this->_reduce_domain($domain);
continue;
}
// paths - find matching paths starting from most specific
$this->debug(' - Finding matching paths for '.$request_path);
$paths = array_keys($cookies);
usort($paths, array($this, '_cmp_length'));
foreach ($paths as $path) {
// continue to next cookie if request path does not path-match cookie path
if (!$this->_path_match($request_path, $path)) continue;
// loop through cookie names
$this->debug(' path match found: '.$path);
foreach ($cookies[$path] as $name => $values) {
// if this cookie is secure but request isn't, continue to next cookie
if ($values[1] && !$request_secure) continue;
// if cookie is not a session cookie and has expired, continue to next cookie
if (is_int($values[2]) && ($values[2] < $now)) continue;
// cookie matches request
$this->debug(' cookie match: '.$name.'='.$values[0]);
$matched_cookies[] = $name.'='.$values[0];
}
}
$domain = $this->_reduce_domain($domain);
}
// return cookies
return implode('; ', $matched_cookies);
}
/**
* Parse Set-Cookie values.
*
* Only use this method if you cannot use extract_cookies(), for example, if you want to use
* this cookie jar class without using the response class.
*
* @param array $set_cookies array holding 1 or more "Set-Cookie" header values
* @param array $param associative array containing 'host', 'path' keys
* @return void
* @see extract_cookies()
*/
public function storeCookies($url, $set_cookies)
{
if (count($set_cookies) == 0) return;
$param = @parse_url($url);
if (!is_array($param) || !isset($param['host'])) return;
$request_host = $param['host'];
if (strpos($request_host, '.') === false) $request_host .= '.local';
$request_path = @$param['path'];
if ($request_path == '') $request_path = '/';
//
// loop through set-cookie headers
//
foreach ($set_cookies as $set_cookie) {
$this->debug('Parsing: '.$set_cookie);
// temporary cookie store (before adding to jar)
$tmp_cookie = array();
$param = explode(';', $set_cookie);
// loop through params
for ($x=0; $x<count($param); $x++) {
$key_val = explode('=', $param[$x], 2);
if (count($key_val) != 2) {
// if the first param isn't a name=value pair, continue to the next set-cookie
// header
if ($x == 0) continue 2;
// check for secure flag
if (strtolower(trim($key_val[0])) == 'secure') $tmp_cookie['secure'] = true;
// continue to next param
continue;
}
list($key, $val) = array_map('trim', $key_val);
// first name=value pair is the cookie name and value
// the name and value are stored under 'name' and 'value' to avoid conflicts
// with later parameters.
if ($x == 0) {
$tmp_cookie = array('name'=>$key, 'value'=>$val);
continue;
}
$key = strtolower($key);
if (in_array($key, array('expires', 'path', 'domain', 'secure'))) {
$tmp_cookie[$key] = $val;
}
}
//
// set cookie
//
// check domain
if (isset($tmp_cookie['domain']) && ($tmp_cookie['domain'] != $request_host) &&
($tmp_cookie['domain'] != ".$request_host")) {
$domain = $tmp_cookie['domain'];
if ((strpos($domain, '.') === false) && ($domain != 'local')) {
$this->debug(' - domain "'.$domain.'" has no dot and is not a local domain');
continue;
}
if (preg_match('/\.[0-9]+$/', $domain)) {
$this->debug(' - domain "'.$domain.'" appears to be an ip address');
continue;
}
if (substr($domain, 0, 1) != '.') $domain = ".$domain";
if (!$this->_domain_match($request_host, $domain)) {
$this->debug(' - request host "'.$request_host.'" does not domain-match "'.$domain.'"');
continue;
}
} else {
// if domain is not specified in the set-cookie header, domain will default to
// the request host
$domain = $request_host;
}
// check path
if (isset($tmp_cookie['path']) && ($tmp_cookie['path'] != '')) {
$path = urldecode($tmp_cookie['path']);
if (!$this->_path_match($request_path, $path)) {
$this->debug(' - request path "'.$request_path.'" does not path-match "'.$path.'"');
continue;
}
} else {
$path = $request_path;
$path = substr($path, 0, strrpos($path, '/'));
if ($path == '') $path = '/';
}
// check if secure
$secure = (isset($tmp_cookie['secure'])) ? true : false;
// check expiry
if (isset($tmp_cookie['expires'])) {
if (($expires = strtotime($tmp_cookie['expires'])) < 0) {
$expires = null;
}
} else {
$expires = null;
}
// set cookie
$this->set_cookie($domain, $path, $tmp_cookie['name'], $tmp_cookie['value'], $secure, $expires);
}
}
// return array of set-cookie values extracted from HTTP response headers (string $h)
public function extractCookies($h) {
$x = 0;
$lines = 0;
$headers = array();
$last_match = false;
$h = explode("\n", $h);
foreach ($h as $line) {
$line = rtrim($line);
$lines++;
$trimmed_line = trim($line);
if (isset($line_last)) {
// check if we have \r\n\r\n (indicating the end of headers)
// some servers will not use CRLF (\r\n), so we make CR (\r) optional.
// if (preg_match('/\015?\012\015?\012/', $line_last.$line)) {
// break;
// }
// As an alternative, we can check if the current trimmed line is empty
if ($trimmed_line == '') {
break;
}
// check for continuation line...
// RFC 2616 Section 2.2 "Basic Rules":
// HTTP/1.1 header field values can be folded onto multiple lines if the
// continuation line begins with a space or horizontal tab. All linear
// white space, including folding, has the same semantics as SP. A
// recipient MAY replace any linear white space with a single SP before
// interpreting the field value or forwarding the message downstream.
if ($last_match && preg_match('/^\s+(.*)/', $line, $match)) {
// append to previous header value
$headers[$x-1] .= ' '.rtrim($match[1]);
continue;
}
}
$line_last = $line;
// split header name and value
if (preg_match('/^Set-Cookie\s*:\s*(.*)/i', $line, $match)) {
$headers[$x++] = rtrim($match[1]);
$last_match = true;
} else {
$last_match = false;
}
}
return $headers;
}
/**
* Set Cookie
* @param string $domain
* @param string $path
* @param string $name cookie name
* @param string $value cookie value
* @param bool $secure
* @param int $expires expiry time (null if session cookie, <= 0 will delete cookie)
* @return void
*/
function set_cookie($domain, $path, $name, $value, $secure=false, $expires=null)
{
if ($domain == '') return;
if ($path == '') return;
if ($name == '') return;
// check if cookie needs to go
if (isset($expires) && ($expires <= 0)) {
if (isset($this->cookies[$domain][$path][$name])) unset($this->cookies[$domain][$path][$name]);
return;
}
if ($value == '') return;
$this->cookies[$domain][$path][$name] = array($value, $secure, $expires);
return;
}
/**
* Clear cookies - [domain [,path [,name]]] - call method with no arguments to clear all cookies.
* @param string $domain
* @param string $path
* @param string $name
* @return void
*/
function clear($domain=null, $path=null, $name=null)
{
if (!isset($domain)) {
$this->cookies = array();
} elseif (!isset($path)) {
if (isset($this->cookies[$domain])) unset($this->cookies[$domain]);
} elseif (!isset($name)) {
if (isset($this->cookies[$domain][$path])) unset($this->cookies[$domain][$path]);
} elseif (isset($name)) {
if (isset($this->cookies[$domain][$path][$name])) unset($this->cookies[$domain][$path][$name]);
}
}
/**
* Compare string length - used for sorting
* @access private
* @return int
*/
function _cmp_length($a, $b)
{
$la = strlen($a); $lb = strlen($b);
if ($la == $lb) return 0;
return ($la > $lb) ? -1 : 1;
}
/**
* Reduce domain
* @param string $domain
* @return string
* @access private
*/
function _reduce_domain($domain)
{
if ($domain == '') return '';
if (substr($domain, 0, 1) == '.') return substr($domain, 1);
return substr($domain, strpos($domain, '.'));
}
/**
* Path match - check if path1 path-matches path2
*
* From RFC 2965:
* <i>For two strings that represent paths, P1 and P2, P1 path-matches P2
* if P2 is a prefix of P1 (including the case where P1 and P2 string-
* compare equal). Thus, the string /tec/waldo path-matches /tec.</i>
* @param string $path1
* @param string $path2
* @return bool
* @access private
*/
function _path_match($path1, $path2)
{
return (substr($path1, 0, strlen($path2)) == $path2);
}
/**
* Domain match - check if domain1 domain-matches domain2
*
* A few extracts from RFC 2965:
* - A Set-Cookie2 from request-host y.x.foo.com for Domain=.foo.com
* would be rejected, because H is y.x and contains a dot.
*
* - A Set-Cookie2 from request-host x.foo.com for Domain=.foo.com
* would be accepted.
*
* - A Set-Cookie2 with Domain=.com or Domain=.com., will always be
* rejected, because there is no embedded dot.
*
* - A Set-Cookie2 from request-host example for Domain=.local will
* be accepted, because the effective host name for the request-
* host is example.local, and example.local domain-matches .local.
*
* I'm ignoring the first point for now (must check to see how other browsers handle
* this rule for Set-Cookie headers)
*
* @param string $domain1
* @param string $domain2
* @return bool
* @access private
*/
function _domain_match($domain1, $domain2)
{
$domain1 = strtolower($domain1);
$domain2 = strtolower($domain2);
while (strpos($domain1, '.') !== false) {
if ($domain1 == $domain2) return true;
$domain1 = $this->_reduce_domain($domain1);
continue;
}
return false;
}
}
?>

View File

@ -30,6 +30,7 @@ class HumbleHttpAgent
protected $minimiseMemoryUse = false; //TODO
protected $debug = false;
protected $method;
protected $cookieJar;
public $rewriteHashbangFragment = true; // see http://code.google.com/web/ajaxcrawling/docs/specification.html
public $maxRedirects = 5;
@ -53,6 +54,9 @@ class HumbleHttpAgent
if ($this->method == self::METHOD_CURL_MULTI) {
require_once(dirname(__FILE__).'/RollingCurl.php');
}
// create cookie jar
$this->cookieJar = new CookieJar();
// set request options (redirect must be 0)
$this->requestOptions = array(
'timeout' => 10,
'redirect' => 0 // we handle redirects manually so we can rewrite the new hashbang URLs that are creeping up over the web
@ -61,7 +65,7 @@ class HumbleHttpAgent
if (is_array($requestOptions)) {
$this->requestOptions = array_merge($this->requestOptions, $requestOptions);
}
$this->httpContext = stream_context_create(array(
$this->httpContext = array(
'http' => array(
'ignore_errors' => true,
'timeout' => $this->requestOptions['timeout'],
@ -69,8 +73,7 @@ class HumbleHttpAgent
'header' => "User-Agent: PHP/".phpversion()."\r\n".
"Accept: */*\r\n"
)
)
);
);
}
protected function debug($msg) {
@ -209,6 +212,11 @@ class HumbleHttpAgent
$this->debug("......adding to pool");
$req_url = ($this->rewriteHashbangFragment) ? $this->rewriteHashbangFragment($url) : $url;
$httpRequest = new HttpRequest($req_url, HttpRequest::METH_GET, $this->requestOptions);
// send cookies, if we have any
if ($cookies = $this->cookieJar->getMatchingCookies($req_url)) {
$this->debug("......sending cookies: $cookies");
$httpRequest->addHeaders(array('Cookie' => $cookies));
}
$this->requests[$orig] = array('headers'=>null, 'body'=>null, 'httpRequest'=>$httpRequest);
$this->requests[$orig]['original_url'] = $orig;
$pool->attach($httpRequest);
@ -229,11 +237,16 @@ class HumbleHttpAgent
$this->requests[$orig]['body'] = $request->getResponseBody();
$this->requests[$orig]['effective_url'] = $request->getResponseInfo('effective_url');
$this->requests[$orig]['status_code'] = $status_code = $request->getResponseCode();
// is redirect?
if ((in_array($status_code, array(300, 301, 302, 303, 307)) || $status_code > 307 && $status_code < 400) && $request->getResponseHeader('location')) {
$redirectURL = $request->getResponseHeader('location');
$redirectURL = SimplePie_Misc::absolutize_url($redirectURL, $url);
if ($this->validateURL($redirectURL)) {
$this->debug('Redirect detected. Valid URL: '.$redirectURL);
// store any cookies
$cookies = $request->getResponseHeader('set-cookie');
if ($cookies && !is_array($cookies)) $cookies = array($cookies);
if ($cookies) $this->cookieJar->storeCookies($url, $cookies);
$this->redirectQueue[$orig] = $redirectURL;
} else {
$this->debug('Redirect detected. Invalid URL: '.$redirectURL);
@ -285,8 +298,13 @@ class HumbleHttpAgent
} else {
$this->debug("......adding to pool");
$req_url = ($this->rewriteHashbangFragment) ? $this->rewriteHashbangFragment($url) : $url;
$httpRequest = new RollingCurlRequest($req_url, 'GET', null, null, array(
$headers = array();
// send cookies, if we have any
if ($cookies = $this->cookieJar->getMatchingCookies($req_url)) {
$this->debug("......sending cookies: $cookies");
$headers[] = 'Cookie: '.$cookies;
}
$httpRequest = new RollingCurlRequest($req_url, 'GET', null, $headers, array(
CURLOPT_CONNECTTIMEOUT => $this->requestOptions['timeout'],
CURLOPT_TIMEOUT => $this->requestOptions['timeout']
));
@ -312,6 +330,9 @@ class HumbleHttpAgent
$redirectURL = SimplePie_Misc::absolutize_url($redirectURL, $url);
if ($this->validateURL($redirectURL)) {
$this->debug('Redirect detected. Valid URL: '.$redirectURL);
// store any cookies
$cookies = $this->cookieJar->extractCookies($this->requests[$orig]['headers']);
if (!empty($cookies)) $this->cookieJar->storeCookies($url, $cookies);
$this->redirectQueue[$orig] = $redirectURL;
} else {
$this->debug('Redirect detected. Invalid URL: '.$redirectURL);
@ -346,7 +367,13 @@ class HumbleHttpAgent
$this->debug("Sending request for $url");
$this->requests[$orig]['original_url'] = $orig;
$req_url = ($this->rewriteHashbangFragment) ? $this->rewriteHashbangFragment($url) : $url;
if (false !== ($html = @file_get_contents($req_url, false, $this->httpContext))) {
// send cookies, if we have any
$httpContext = $this->httpContext;
if ($cookies = $this->cookieJar->getMatchingCookies($req_url)) {
$this->debug("......sending cookies: $cookies");
$httpContext['http']['header'] .= 'Cookie: '.$cookies."\r\n";
}
if (false !== ($html = @file_get_contents($req_url, false, stream_context_create($httpContext)))) {
$this->debug('Received response');
// get status code
if (!isset($http_response_header[0]) || !preg_match('!^HTTP/\d+\.\d+\s+(\d+)!', trim($http_response_header[0]), $match)) {
@ -367,6 +394,9 @@ class HumbleHttpAgent
$redirectURL = SimplePie_Misc::absolutize_url($redirectURL, $url);
if ($this->validateURL($redirectURL)) {
$this->debug('Redirect detected. Valid URL: '.$redirectURL);
// store any cookies
$cookies = $this->cookieJar->extractCookies($this->requests[$orig]['headers']);
if (!empty($cookies)) $this->cookieJar->storeCookies($url, $cookies);
$this->redirectQueue[$orig] = $redirectURL;
} else {
$this->debug('Redirect detected. Invalid URL: '.$redirectURL);

View File

@ -337,8 +337,9 @@ class RollingCurl implements Countable {
}
// Block for data in / output; error handling is done by curl_multi_exec
if ($running)
curl_multi_select($master, $this->timeout);
//if ($running) curl_multi_select($master, $this->timeout);
// removing timeout as it causes problems on Windows with PHP 5.3.5 and Curl 7.20.0
if ($running) curl_multi_select($master);
} while ($running);
curl_multi_close($master);
@ -365,6 +366,10 @@ class RollingCurl implements Countable {
// $options[CURLOPT_MAXREDIRS] = 5;
//}
$headers = $this->__get('headers');
// append custom headers for this specific request
if ($request->headers) {
$headers = $headers + $request->headers;
}
// append custom options for this specific request
if ($request->options) {

View File

@ -57,7 +57,10 @@ class SimplePie_HumbleHttpAgent extends SimplePie_File
$this->error = 'failed to fetch URL';
$this->success = false;
} else {
$parser = new SimplePie_HTTP_Parser($response['headers']);
// The extra lines at the end are there to satisfy SimplePie's HTTP parser.
// The class expects a full HTTP message, whereas we're giving it only
// headers - the new lines indicate the start of the body.
$parser = new SimplePie_HTTP_Parser($response['headers']."\r\n\r\n");
if ($parser->parse()) {
$this->headers = $parser->headers;
//$this->body = $parser->body;

View File

@ -109,13 +109,13 @@ class Readability
function __construct($html, $url=null)
{
/* Turn all double br's into p's */
/* Note, this is pretty costly as far as processing goes. Maybe optimize later. */
$html = preg_replace($this->regexps['replaceBrs'], '</p><p>', $html);
$html = preg_replace($this->regexps['replaceFonts'], '<$1span>', $html);
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
$this->dom = new DOMDocument();
$this->dom->preserveWhiteSpace = false;
$this->dom->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
if (trim($html) == '') $html = '<html></html>';
@$this->dom->loadHTML($html);
$this->url = $url;
}

View File

@ -9846,7 +9846,7 @@ class SimplePie_Misc
case 'ksc56011987':
case 'ksc56011989':
case 'windows949':
return 'windows-949';
return 'EUC-KR';
case 'windows1250':
return 'windows-1250';

View File

@ -3,8 +3,8 @@
// Author: Keyvan Minoukadeh
// Copyright (c) 2011 Keyvan Minoukadeh
// License: AGPLv3
// Version: 2.6
// Date: 2011-03-02
// Version: 2.7
// Date: 2011-03-21
/*
This program is free software: you can redistribute it and/or modify
@ -49,11 +49,14 @@ function __autoload($class_name) {
// Include FeedCreator for RSS/Atom creation
'FeedWriter' => 'feedwriter/FeedWriter.php',
'FeedItem' => 'feedwriter/FeedItem.php',
// Include Readability for identifying and extracting content from URLs
// Include ContentExtractor and Readability for identifying and extracting content from URLs
'ContentExtractor' => 'content-extractor/ContentExtractor.php',
'SiteConfig' => 'content-extractor/SiteConfig.php',
'Readability' => 'readability/Readability.php',
// Include Humble HTTP Agent to allow parallel requests and response caching
'HumbleHttpAgent' => 'humble-http-agent/HumbleHttpAgent.php',
'SimplePie_HumbleHttpAgent' => 'humble-http-agent/SimplePie_HumbleHttpAgent.php',
'CookieJar' => 'humble-http-agent/CookieJar.php',
// Include IRI class for resolving relative URLs
'IRI' => 'iri/iri.php',
// Include Zend Cache to improve performance (cache results)
@ -413,6 +416,11 @@ if ($valid_key) {
//////////////////////////////////
$http = new HumbleHttpAgent();
//////////////////////////////////
// Set up Content Extractor
//////////////////////////////////
$extractor = new ContentExtractor(dirname(__FILE__).'/site_config/custom', new ContentExtractor(dirname(__FILE__).'/site_config/standard'));
/*
if ($options->caching) {
$frontendOptions = array(
@ -437,27 +445,6 @@ if ($options->caching) {
}
*/
////////////////////////////////
// Tidy config
////////////////////////////////
if (function_exists('tidy_parse_string')) {
$tidy_config = array(
'clean' => true,
'output-xhtml' => true,
'logical-emphasis' => true,
'show-body-only' => false,
'wrap' => 0,
'drop-empty-paras' => true,
'drop-proprietary-attributes' => false,
'enclose-text' => true,
'enclose-block-text' => true,
'merge-divs' => true,
'merge-spans' => true,
'char-encoding' => 'utf8',
'hide-comments' => true
);
}
////////////////////////////////
// Get RSS/Atom feed
////////////////////////////////
@ -495,28 +482,22 @@ if ($html_only || !$result) {
// remove strange things here
$html = str_replace('</[>', '', $html);
$html = convert_to_utf8($html, $response['headers']);
} else {
}
if (!$response || $response['status_code'] >= 300) {
die('Error retrieving '.$url);
}
if ($auto_extract) {
// Run through Tidy (if it exists).
// This fixes problems with some sites which would otherwise
// trouble DOMDocument's HTML parsing.
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, $tidy_config, 'UTF8');
if (tidy_clean_repair($tidy)) {
$html = $tidy->value;
}
}
$readability = new Readability($html, $effective_url);
if ($links == 'footnotes') $readability->convertLinksToFootnotes = true;
if (!$readability->init() && $exclude_on_fail) die('Sorry, could not extract content');
// content block is detected element
$content_block = $readability->getContent();
$extract_result = $extractor->process($html, $effective_url);
if (!$extract_result) die($options->error_message);
$readability = $extractor->readability;
$content_block = $extractor->getContent();
$title = $extractor->getTitle();
} else {
$readability = new Readability($html, $effective_url);
// content block is entire document
$content_block = $readability->dom;
//TODO: get title
$title = '';
}
if ($extract_pattern) {
$xpath = new DOMXPath($readability->dom);
@ -529,13 +510,16 @@ if ($html_only || !$result) {
$readability->removeScripts($content_block);
$readability->prepArticle($content_block);
} else {
if ($exclude_on_fail) die('Sorry, could not extract content');
$content_block = $readability->dom->createElement('p', 'Sorry, could not extract content');
die($options->error_message);
//$content_block = $readability->dom->createElement('p', 'Sorry, could not extract content');
}
}
$readability->clean($content_block, 'select');
if ($options->rewrite_relative_urls) makeAbsolute($effective_url, $content_block);
$title = $readability->getTitle()->textContent;
// footnotes
if (($links == 'footnotes') && (strpos($effective_url, 'wikipedia.org') === false)) {
$readability->addFootnotes($content_block);
}
if ($extract_pattern) {
// get outerHTML
$content = $content_block->ownerDocument->saveXML($content_block);
@ -637,7 +621,7 @@ foreach ($items as $key => $item) {
$newitem->setLink($item->get_permalink());
}
}
if ($permalink && $response = $http->get($permalink, true)) {
if ($permalink && ($response = $http->get($permalink, true)) && $response['status_code'] < 300) {
$effective_url = $response['effective_url'];
if (!url_allowed($effective_url)) continue;
$html = $response['body'];
@ -645,26 +629,15 @@ foreach ($items as $key => $item) {
$html = str_replace('</[>', '', $html);
$html = convert_to_utf8($html, $response['headers']);
if ($auto_extract) {
// Run through Tidy (if it exists).
// This fixes problems with some sites which would otherwise
// trouble DOMDocument's HTML parsing. (Although sometimes it fails
// to return anything, so it's a bit of tradeoff.)
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, $tidy_config, 'UTF8');
$tidy->cleanRepair();
$html = $tidy->value;
}
$readability = new Readability($html, $effective_url);
if ($links == 'footnotes') $readability->convertLinksToFootnotes = true;
$extract_result = $readability->init();
// content block is detected element
$content_block = $readability->getContent();
$extract_result = $extractor->process($html, $effective_url);
$readability = $extractor->readability;
$content_block = ($extract_result) ? $extractor->getContent() : null;
} else {
$readability = new Readability($html, $effective_url);
// content block is entire document (for now...)
$content_block = $readability->dom;
}
if ($extract_pattern) {
if ($extract_pattern && isset($content_block)) {
$xpath = new DOMXPath($readability->dom);
$elems = @$xpath->query($extract_pattern, $content_block);
// check if our custom extraction pattern matched
@ -691,6 +664,10 @@ foreach ($items as $key => $item) {
} else {
$readability->clean($content_block, 'select');
if ($options->rewrite_relative_urls) makeAbsolute($effective_url, $content_block);
// footnotes
if (($links == 'footnotes') && (strpos($effective_url, 'wikipedia.org') === false)) {
$readability->addFootnotes($content_block);
}
if ($extract_pattern) {
// get outerHTML
$html = $content_block->ownerDocument->saveXML($content_block);

59
site_config/README.txt Normal file
View File

@ -0,0 +1,59 @@
Full-Text RSS Site Patterns
---------------------------
NOTE: The information here is not up to date, but probably covers what you need for this version. For the most up to date information on Full-Text RSS site patterns, please see http://help.fivefilters.org/customer/portal/articles/223153-site-patterns
We recommend using the latest release of Full-Text RSS - available for purchase at http://fivefilters.org/content-only/#download - which also comes bundled with hundreds of site patterns to improve extraction.
Site patterns allow you to specify what should be extracted from specific sites.
How it works
------------
After we fetch the contents of a URL, we use the hostname (e.g. example.org) and check to see if a config file exists for that hostname. If there is a matching file (e.g. example.org.txt) in one of the config folders, we will fetch the rules it contains. If no such file is found, we attempt to detect the appropriate content block and title automatically. When there is a matching site config, we first try to use the patterns within to extract the content. If these patterns fail to match, we will, by default, revert to auto-detection - this gives us another chance to get at the content (useful when site redesigns invalidate our stored patterns).
The 'standard' folder contains site config files bundled with Full-Text RSS. Users may contribute their own patterns and we will try to update with each release.
The 'custom' folder can be used for sites not listed in standard, or to override sites in standard. If a site has an entry in both folders, only the one in 'custom' will be used. The custom folder allows you to separate your entries from the bundled ones, which also makes the task of upgrading to a new release easier (you benefit from the updated patterns in standard and copy over your existing patterns to custom).
The pattern format has been borrowed from Instapaper. Please see http://blog.instapaper.com/post/730281947 and http://www.instapaper.com/bodytext (requires login). We make use of the patterns provided by Instapaper and, in the same spirit, will soon make available our own additions.
Command reference (based on Instapaper)
---------------------------------------
title: [XPath]
The page title.
XPaths evaluating to strings are also accepted.
Multiple statements accepted.
Will evaluate in order until a result is found.
body: [XPath]
The body-text container. Auto-detected by default.
Multiple statements accepted.
Will evaluate in order until a result is found.
strip: [XPath]
Strip any matching element and its children.
Multiple statements accepted.
strip_id_or_class: [string]
Strip any element whose @id or @class contains this substring.
Multiple statements accepted.
strip_image_src: [string]
Strip any <IMG> whose @src contains this substring.
Multiple statements accepted.
tidy: [yes|no] (default: yes)
Preprocess with Tidy. May cause "no text" errors.
prune: [yes|no] (default: yes)
Strip elements within body that do not resemble content elements.
autodetect_on_failure: [yes|no] (default: yes)
If set to no, we will not attempt to auto-detect the title or content block.
test_url: [string]
Must be URL of an article from this site, not the site's front page.
# comments
Lines beginning with # are ignored.

View File

@ -0,0 +1,3 @@
<?php
// this is here to prevent directory listing over the web
?>

3
site_config/index.php Normal file
View File

@ -0,0 +1,3 @@
<?php
// this is here to prevent directory listing over the web
?>

View File

@ -0,0 +1,5 @@
body: //div[@id = 'content']
strip_id_or_class: editsection
strip_id_or_class: toc
prune: no
test_url: http://en.wikipedia.org/wiki/Christopher_Lloyd

View File

@ -0,0 +1,3 @@
<?php
// this is here to prevent directory listing over the web
?>