Merging RSS and Atom feeds from various sources

I have a lot of Python rss/atom feeds in my aggregator and entries are doubled all over the place.

Could'nt find any tool that would merge entries from several sources out there, in a smart way, by trying to find doublons.

I wrote a little script, extending Mark Pilgrim's feedparser we use in CPSRSS, to merge several sources, using the difflib module and the rss rendering we have in CPSBlog.

It calculates the diff ratio on the title and content of each entry to decide wheter
it's the same entry. When the ratio is <= 0.2 it's the same entry (hopefully :) )

Here's an example ran on these:

The result is here
(It's a one-shot xmlfile, made today, so it's not a real feed
 it is still readable by any client though)

Now I've been told that this was pretty useless, and that i would better make some clean in my feeds and do more interesting stuff in my spare time.

But i can't help it: everytime i see a feed related to python I just add the stuff
 to my client :'). So for an unorganized person like me, a CPRSS personnal website with this merging capability, where i can drop tons of feeds would be perfect.

Important announcement: Join the Nuxeo team and contribute to the Nuxeo project! We have open positions in France and the UK for open source Java EE developers and sales engineers, both junior and senior.

Like this post? Share it:


Trackback Pings

Trackback URL for this entry:
http://blogs.nuxeo.com/sections/blogs/tarek_ziade/2005_10_16_merging-rss-atom-feeds/tbping
» Daily Python-URL 2005-09-28 from quijote
Daily Python-URL 2005-09-28-----------------2005-10-18

Tracked on 10/20/2005 02:29 AM

Posted by Tarek Ziadé @ 10/16/2005 10:14 AM. - Categories: python, semantic_web, web -  0 comments

Nuxeo Bloggers: Log in!
Nuxeo - Indesko - Nuxeo 5 Project
All content is copyrighted by their author.
CPSSkins is Copyright © 2003-2006 by Jean-Marc Orliaguet. | CPS is Copyright © 2002-2006 by Nuxeo SAS.