Full-Text RSS

Tom Lee's picture

This service is no longer in operation.

Partial-text RSS feeds are a pet peeve of mine. I'm not alone: I've read about Dave Winer and Steve Rubel's dislike of the practice. I'm sure there are a lot of other RSS users who are similarly irked by it.

So, after having a post-workout algorithmic epiphany (it's the best time for them), I started work on a little project to fix this annoyance — and ended up quite pleased with the result. You might find it useful, too: it's a little script that creates full-text RSS feeds from partial feeds. Just enter the URL of a partial feed in the box below and hit submit. You'll be directed to a URL that will (hopefully) provide a full-text version of the feed you specified.

I've been through a few different versions of the algorithm, but this one seems to be fairly universal and stable. It won't work for every partial-text feed, but it seems to work for a lot of them. I'm sure it could be better, which tempts me to open source the algorithm and invite people to improve upon it. But I won't — not yet, anyway.

I'm sensitive to the pressures that make bloggers use partial text feeds — some of my friends depend on selling advertising to support their sites. Unfortunately, RSS simply isn't respected by marketers and their clients. Offering a full text feed means fewer page views, which means less revenue — I've been told this bluntly by a friend who wanted to offer full text, did so, then noticed his revenues were shrinking. It's hard to fault him for returning to partial-text feeds.

But this situation isn't a problem with RSS; it's a problem with the ad industry. It's long past time for people to realize that if they give content away on the web they'll be unable to control how others choose to consume it. Inconveniencing users is not an acceptable solution to advertisers' inability to adopt new metrics.

Still, I wouldn't want to offer a feature that middlemen can resell at the expense of bloggers. So while I do want to open this up, I don't want to make things easy for the unscrupulous. This feature does need to pass out of my hands — its proper place is in the RSS reader, both for performance reasons and in order to eliminate one class of countermeasures that bloggers could take. Maybe I'll try my hand at adapting the code for Vienna.

A few technical notes: depending on the site, some entries may come back with comments or other cruft attached. Fellow geeks can trim those off by specifying URL-encoded regexes, passed in the querystring as parameters regex0regex9 (note that an outstanding issue with PHP magic quotes means that the + character doesn't work; use {1,} instead). I'd encourage users who create regexes for feeds to share them by tagging the URL with "fulltextrss" on del.icio.us. There are already a few examples available here.

Finally, please note that the service employs PEAR's function caching on a 15 minute timeout. If the results you're getting aren't up-to-date, just be patient (or alter one of the regex parameters).

Comments

Anonymous's picture

Hi
I love this tool.
What regex expression would I use to remove all images?
Thanks

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

You should be able to use one like the following:

regex0=%2F%3Cimg.%2A%3F%3E%2Fi

for example, here's this blog's feed without images (not that there are many):

http://labs.echoditto.com/projects/fulltextrss/?url=http://labs.echoditt...

Adam Lipkin's picture

Thanks, Tom! This worked like a charm on the feed I was subscribing to.

Nitin's picture

Awesome!

Anonymous's picture

freakonomics http://freakonomics.blogs.nytimes.com/ has managed to beat your tool. Is there any fix?

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Well, no, there's no fix to the issue -- they've stopped putting excerpts in the description field, which prevents the general-purpose tool from being used.

But I think someone's taken my advice and produced a dedicated full-text feed:

http://feeds.feedburner.com/freakonomics-full

Todd Sawicki's picture

Tom -
Blogs that use the [read more...] links in their feed seem to defeat your web service. Ars Technica's is a good example - their feedlink here: http://feeds.arstechnica.com/arstechnica/BAaf

Great work.
- todd

zek's picture

after facing some difficulties, finally it works for my 'test blog'.
However it is disturbing my adsense block i.e no ads shown at provided place
pls check my blog and provide some feedback

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Hmm. You might have to provide more detail -- it's plausible that it'd strip out adsense, but I'm not certain enough about what you're referring to to comment intelligently about it.

Prakash's picture

Hi,

I need same system which will produce Clean Full Text RSS. I need an ability to produce text only or skipping something like Image etc on the system. Can anyone code for me. I am ready to pay upto 30$

Please contact me at info@rapidshareonline.com

Anonymous's picture

how can I have plain text full rss, means no html tags
i.e

Anonymous's picture

great tool!!! This will be useful for my rome accommodations site. Thanks a lot for this!

Ozh's picture

Very interesting. Any chance you release the code for this ?

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

I'm happy to share my code on a case-by-case basis, but I'm wary of releasing it completely into the wild for the reasons mentioned in the post -- it could be used to divert revenue from content authors to rent-seeking third parties.

Shoot me an email (tom (at) echoditto (dot) com) and I'll be happy to talk to you about how I got this thing working.

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

I'm sorry Ramesh, I'm afraid I don't really understand what you're asking. Is the issue the high-ascii characters? Those are admittedly a consistent problem with PHP -- which this is. Maybe you could try passing the feed through Yahoo Pipes? I'm afraid I'm not prepared to tackle unicode support.

Anonymous's picture

Cool tool! I'm using it with www.Feedity.com for custom RSS web feeds.... awesome combo :)

Anonymous's picture

Thanks a bunch for this. Very useful tool

ysamjo's picture

hi Tom,
this is an awesome tool!

It searchs for an update every 15 minutes? Did I get this right?

The server is very slow at the moment - you have really to give this out of hands.

I don't need to know the algorithm for striping all unwanted tags, but maybe you can explain us how to set up such a service. I look for a "homemade Yahoo Pipes" for a long time. You used SimplePie?

tserj's picture

Could you send the source code via e-mail? I'll look on that and make some upgrade, than send results for you.

Srdj's picture

hi, could i also get the source code via e-mail? i really want to update some features etc... thanks! And thanks alot for all your work!

Stefan's picture

Interesting thing. Yet, it doesn't seem to work with Yahoo Groups. At least not with mine. Try: http://rss.groups.yahoo.com/group/thing-frankfurt/rss

Stefan

Annie's picture

I know this is an old post but I love you. That is all.

Hersey's picture

Vooovv Super ! It's a workink thanks my friend very good.

Anonymous's picture

Hehe, nice, although partial rsses are useful for ppl with limited traffic

Hersey var's picture

Vayyy thank you very much. Good job...

Soul.Trader's picture

Wow, this tool is perfect. Exactly what I was looking for. A++ from a FeedJournal user.

Ivan's picture

Hi Tom!
My name is Ivan. I'm from Russia.
May I buy this script? How much is it?
I shall use this script only for my own purposes.
Please contact me at vanno@list.ru
Thanks!

Geert's picture

Hi Tom!

Could you send me this script please?
I shall use this script only for my own purposes.
Please contact me at gjerutten (dot) hotmail (dot) com

Thanks

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Hi folks. A few things:

- the script is not for sale or able to be used toward for-profit ends, regardless of whether I have distributed it to you or not

- if you'd like a copy of the script I need you to email me: tom (at) echoditto (dot) com. I can't keep track of the requests via comments -- please email.

@nks's picture

superb .... just thing i was looking for ... since long ...

Oleh's picture

Thanks for this great service. I only have a problem with French - it is not displayed correctly - is it possible to fix?

Also Cyrillic is problematic - no visibility at all.

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

I apologize for the limitation, Oleh. Unfortunately PHP (and in particular PHP4) is quite bad at handling extended character sets, and I have no plans to resolve the situation.

If anyone would like to volunteer to work on improving unicode support (or porting the algorithm to a more unicode-friendly language), I'd of course be happy to share the source.

nguoiquangngai's picture

Hello Tom,

This is great script that I can see. I love your codes. I would appreciated if could get a copy of this via my email.

Thank you very much for great share

Cyndy Aleo-Carreira's picture

Found a weird bug. Please email me and I can show you the issue. Don't want to post it out in the open due to the blog I wanted to use it on. ;)

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Cyndy: you're welcome to email me about the problem at the address mentioned above. But you should note that this service is known to not work with every blog or character set, and isn't supported in any official way. So the odds that your difficulty is going to be resolved are fairly low, I'm afraid.

Anonymous's picture

Tom -

I'm working on an application that allows me to compile large amounts of text from RSS feeds and save it all to a .txt file. I'm writing applescript to help me do this. Could I look at your source code to see if I could implement something similar?

Thanks,

Andrew

flack dot andrew at gmail dot com

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Hey NX, sorry about that -- we moved some things around today and it affected the tool. Everything should be fixed now, though -- please let me know if you continue to have problems with it.

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

BoD: As you might imagine, the algorithm relies on clues within the feed to extract the full text of the entry from the the actual page. One of the most important of those clues is the RSS entry text, which is assumed to be present on the page. In cases like this IBM feed, where the feed text is simply an RSS-only summary of the content, the algorithm will fail.

Oleh's picture

Dear Tom!

Thanks for Cyrillic! It works now!

Reader's picture

Hi there, I've been using this site and have found it extremely useful.

However, it doesn't seem to work when the original article is presented on multiple pages. For example, I tried the New Yorker rss and there wasn't any problem with shorter articles, but the longer ones were only retrieved for the first page.

Is there a way to automatically get the rest of a multi-page article?

Jon's picture

Run my yahoo pipe for the new yorker through the full-text rss algorithm and it should work: http://pipes.yahoo.com/pipes/pipe.info?_id=15a49d24d2cd11f45225dca98aa34560

If you use the regex feature in yahoo pipes you can change the link to the printable version of the article, which doesn't have the page problem.

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

I'm glad to hear you've found the site useful, but no, I'm afraid there's no way to automatically remove pagination from sites. It would be possible to write scripts to retrieve that content, but it would have to be on a site-by-site basis. That's not a direction in which I want to take the tool, I'm afraid.

Austin's picture

Amazingly useful tool. I'm looking to incorporate this on my financial website, with all sources cited, to present multiple sources in an organized manner. Would I be able to use your code?

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

I'm afraid I don't know what a "java converted url feed reader" is. I can say with confidence that the tool works fine with a variety of other newsreaders, though.

Sumedh's picture

Hi...

I tried a URL of Google news RSS feed...

But it spit kind of an empty RSS? :)

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Sumedh: the script works by examining the HTML structure of the pages linked in the feed and looking for similarities between them. A source like Google News, which points at entirely different sites, is never going to work.

admin's picture
Member since:
7 February 2007
Last activity:
32 weeks 1 day

Sorry flatluigi, but we can't support individual feeds. If you want to solve the issue yourself, you should read up on "regular expressions" and examine the URLs of some of the sample feeds provided in the original post.

Drew Loika's picture

Wow, AWESOME script! I can't tell you how much it means to me (though you probably already know, that's why you wrote it!) to finally have full feeds for those few frustrating partials in my reader. THANK YOU for your hard work writing this, and THANK YOU again for hosting it.

Abel's picture

Tom,
This script is Great. Thanks a million!

I have a question, on one of my sites, they're rss feeds have the following filename/variables "rss.php?cat=CATEGORY&subcat=SUB+CATEGORY". Notice the '+' between sub and category? Well, that breaks the script. If the subcat variable is a single word, it works flawlessly. The problem just exists when the subcat is multiple words connected by '+'. Any ideas?

Abel's picture

*their*

;)