KBD

Keith Devens .com

Thursday, November 20, 2008 Flag waving
When your enemy is making a very serious mistake, don't be impolite and disturb him. – Napoleon Bonaparte (allegedly)
← Screen-scraping with WWW::MechanizeRSS Pi →

Daily link icon Friday, January 24, 2003

RSS aggregators

My current RSS aggregator has outgrown itself. I have something like 100 feeds... ok, just checked 114 (active) feeds (with 14 I have in there I don't read anymore).

My aggregator is web-based. The way it currently works is that I click a "Check all feeds" button on my web site. I can check individual feeds too, but I hardly ever use that. So, I click the "Check all feeds" button, and it churns away downloading everyone's feeds for a minute or two. Unfortunately, PHP doesn't have threads, so I have to download the feeds serially instead of multiple at a time.

So after it returns, I click "Display all feeds". It takes a few seconds to return a page because there's usually so much content there and the code I use to display the feed is really slow. It only shows new items since the last time you got the feed. I check about once a day (sometimes I miss a day), and it takes me probably over an hour to go through everything (keep in mind, not just reading what's in the aggregator, but following links too). If I blog things I see it takes longer.

I've been wanting to improve my aggregator, but that project has always been lowest priority because it does work, even though it's messy. However, after reading Mark Pilgrim's Parse at all costs article on XML.com, I decided I'd like to use his ultra-liberal RSS parser because he's right, there are many broken feeds, and I'd also not have to do the work of improving my RSS-version-neutral RSS parser if Mark's already done a good job at it.

So I decided I'd write a new version in Python. Then I just read J.D. Lasica's article on News That Comes to You (via Scripting News, and wow, full transcripts), and thought that it'd be nice if I didn't have to write anything Smiley I thought in the past that I really wanted to have my aggregator be web-based so that I could read it "on the road" if I wanted to. But since I have so many feeds and it takes me forever to read, it's not something I actually do when I'm not home. If I really want to check a particular web site, I just go to that web site rather than checking in my aggregator.

My one major concern is that a "regular" aggregator won't do a good job of only showing the posts that are new. Anyway, I'm about to start trying some Windows aggregators out, so if I get anywhere I'll let you know my results.

Oh yeah, here's a list of my must-have features:

  • Reads RSS feeds Smiley
  • Does a good job not making me go through what I've already read
  • Let's me categorize feeds in multiple categories, such as:
    • Politics: RWN, InstaPundit, LGF, ScrappleFace
    • High volume, where if I don't read the feeds more frequently I may miss things (Slashdot and lots of other news sites)
    • Programming (I find it's much easier for my brain to read a common subject at a time - if I go through all the programming-related weblogs at once I can probably move faster than if my brain has to do a context switch every time I hit a new blog, which is what I do now with my aggregator)

As you can see, while some of the categories are topical and supposedly could be exclusive (I mean that a particular news source would only be in one category), some of the categories are clearly not topical, like "High volume". So I'd need an aggregator that can put a feed in multiple categories.

Those are the main things I guess. I'd like to be able to get a list of my feeds in some machine readable format out of the program so that I can use it for a blogroll... EffNews also has a cool feature:

To add a channel, use your favourite web browser to look for nice sites that support RSS. Look for links to the RSS feed... or a link that mentions RSS or Syndication or something similar.

When you've found something that might be an RSS link, use the mouse to drag it from the browser and drop it anywhere in the effnews application window. If successful, the site should appear in the effnews window after a few seconds.

If you like what you see, click on the Add button. This adds the site to the channel list.

Anyway, time to start the search. List of potential readers:

I'm going to try Syndirella first. What's funny is that the first thing I noticed when looking at the screenshots was that there was no count of new items. Lo and behold, I go to his weblog and he just implemented it. Oh, and Syndirella exports the feed list to OPML, which I can do whatever I want with. Shit, it doesn't support categories though. However, given how eager the author seems to support new features and how hard he tries ("It now passes all the tests from the test suite of Mark Pilgrim's ultra-liberal RSS parser") to do the right things ("It tries harder"), I suspect he'll get that in there eventually, so I'd be willing to put up without it for now. Sigh, now I have to download the .NET framework...

Holy crap dude! I just rebooted after installing the .NET framework and ran Syndirella for the first time, and it used up ALL my system resources. And I mean all of them. And that was after taking like a minute to start with my disk thrashing and everything. Not sure whether this is .NET's fault, the program's fault, or just an error in this particular build.

Ok, I just ran it one more time, and, while it still started up a little sluggishly (which is probably .NET's fault), it's chugging along now using not-too-many resources. That was really strange. Maybe .NET, or Syndirella, was doing something strange for initialization the first time it ran.

Hey, there we go with the categories:

After that, I'm going to finally implement a decent way of sorting and organizing feeds. This is probably the most important shortcoming of the current Syndirella UI. It would be easy to replace the listview with a treeview - but I don't really like the idea of using the treeview here. I think there is really no need to have multiple levels in the hierarchy - one level of grouping should be enough. Thus, I'm thinking of implementing collapsible groups looking somewhat like the Opera 7 mail client.

Also, I'll add some way to sort and/or reorder feeds within a group. Nothing fancy is needed here - just a popup menu option to sort the feeds in a group alphabetically, and drag & drop for manual reordering.

I agree, nested categories aren't needed, which I think is a great observation. I'm really looking forward to seeing what he produces. Ok, now to export my current feed list... let's see what formats this thing can import... oops, one bug - "back" doesn't work correctly. Ok, it can import OPML, so now to export my feed list in OPML... now where the heck is the documentation for this particular OPML format? There we go.

Grr, that page wasn't quite what I needed, so I just mimicked the format that Syndirella itself output. Here's a little script I used to produce the OPML file I needed:

<?php
require_once('../include_xmlrpc.php');
require_once(
'../include_db.php');

$conn init_db_connection($conn);
$sql "SELECT * FROM Aggregator WHERE Feed_Active = 1 ORDER BY Feed_Id";
$rs_id mysql_query($sql$conn) or die("Couldn't get your list of feeds");

$body = array();
$opml = array(
    
'opml'=>array(
        
'body'=>array(
            
'outline'=>&$body
        
)
    )
);

$count 0;
while(
$row mysql_fetch_assoc($rs_id)){
    
$body[$count] = NULL;
    
$body["$count attr"] = 
        array(
        
'title'=>$row['Feed_Name'], 
        
'xmlUrl'=>$row['Feed_URL'], 
        
'htmlUrl'=>$row['Feed_Homepage']
        );
    
$count++;
}

header("Content-type: text/xml");
echo 
XML_serialize($opml);

close_db_connection($conn);
?>

Now let's see if Syndirella imports it! Cool, worked, except I got the XML format wrong at first and had to spend a few minutes looking at it to realize I had "xmlURL" and "htmlURL" instead of "xmlUrl" and "htmlUrl" (I updated the script above to match).

Ok, I'm now a Syndirella user. It rocks. Now for feature requests: Smiley

  • Categories, or some way of organizing feeds. Seems to be in the works (see above), so I'm happy.
  • Statistics on feeds. How often they update, etc. If it turns out a feed doesn't update often, you can increase the amount of times between checks (or maybe Syndirella could do this automatically if you set a preference allowing it to).
  • I thought I had more, but I can't remember them now. Guess Syndirella is pretty good then.
  • Ok, I just did some testing and Syndirella doesn't support ETags or whatever it's supposed to do to save bandwidth, so that's a feature request. Update: it actually does support ETags and such. When you update only an individual feed it always downloads the entire feed, but when it normally checks feeds it makes use of them.
  • Hey, how freaking cool would it be if when a post was updated, Syndirella did a diff on it so you could easily see what was new?! That would be awesome, but this is more a "wish list" item than a "feature request". To be honest, that would really help with blogs like Erik's and Doc's.
  • I want an easier way to copy the feed's site's URL to the clipboard (this is a very minor nit - you can already copy the RSS feed location easily by going to the feed properties, but not the site's location)

One of the nice things is that because each feed is separated from all others one feed's bad HTML can't break the rest of the others. And I had thought that not having all feeds on one big page together would have been an inconvenience to me, but I don't think it is at all.

I wonder how Syndirella handles things if entries are reordered within a feed...

← Screen-scraping with WWW::MechanizeRSS Pi →

Comments XML gif

Si wrote:

You should try FeedReader. That's mighty funky.

∴ Si | 24-Jan-2003 6:38am est | #1326

Daniel Nolan (http://www.bleedingego.co.uk/) wrote:

Syndrella
There's also Newsgator if you use Outlook.

∴ Daniel Nolan | 24-Jan-2003 7:12am est | http://www.bleedingego.co.uk/ | #1328

Keith (http://www.keithdevens.com/) wrote:

Awesome, thanks. Google doesn't yet know about Syndrella.

Keith | 24-Jan-2003 7:15am est | http://www.keithdevens.com/ | #1329

M. Bean wrote:

Yeah, by default, the first time a .NET program is run, the .NET architecture compiles it, or the majority of it anyway, so that's why your system went batshit for a little bit. This happens on even the simplest of programs I've noticed, and from there on out, load times are generally dependent on the program itself. Once .NET gets all the prep stuff out of the way on the first run (or the first couple of runs, or the install, depending on how you configure the compiler) it actually does a pretty good job of keeping things running smoothly.

∴ M. Bean | 24-Jan-2003 4:54pm est | #1331

Keith (http://www.keithdevens.com/) wrote:

Yeah, I figured .NET was JITing something. It's funny, because even the first time I opened some dialogues I could tell .NET took over and JITed something. It was strange because it gobbled up all available memory.

Ever have that happen, and all your fonts get big and weird, and instead of scrollbar arrows in programs you get 6's?

Keith | 24-Jan-2003 6:02pm est | http://www.keithdevens.com/ | #1332

M. Bean wrote:

Nope. Never.

However, I don't really worry about system resources anymore, and even when .NET does have to JIT something I just notice a momentary stutter, and it has to JIT a lot on my machine, as you know I've been developing with C# and .NET a lot recently. Plus, I tend to tweak my .NET programs to do all the system-specific compiling at installation time, at least once I'm ready to build something in release mode.

That's another thing, the JIT compiler will kind of compile once something is needed so long as you are using a debug build, so if this guy is releasing debug mode builds on his site, you're running unoptimised (by my unscientific tests there is a pretty big performance difference between debug and release mode with .NET too) builds.

∴ M. Bean | 24-Jan-2003 7:01pm est | #1333

Oliver Tseng (http://http:/www.otweb.com/blog) wrote:

Very informative, good post.

∴ Oliver Tseng | 27-Jan-2003 9:32am est | http://http:/www.otweb.com/blog | #1338

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

November 2008
SunMonTueWedThuFriSat
 1
2345678
9101112131415
16171819202122
23242526272829
30 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 4 posts

Recent comments XML

new⇒Java join function

Meh, don't have null strings in​your string arrays imo, but you're​welcome ...

Keith: Nov 19, 7:51pm

Girls, please don't get breast implants

sorry but another thing i have to​make a comment on about you​men...the men...

happynow: Nov 17, 11:36pm

Books by Vincent Cheung

to all Cheung​fans:

read:

http://www.progin​osko.com/aquascum/cheung.h...

Zamir: Nov 16, 9:07am

Spider solitaire

To undo or not to undo that is the​question.
I'm an undoer. 
My dad​was n...

Can Turk: Nov 15, 2:50pm

Generated in about 0.226s.

(Used 8 db queries)

mobile phone