KBD

Keith Devens .com

Thursday, December 4, 2008 Flag waving
All non-trivial abstractions, to some degree, are leaky. – Joel Spolsky (The Law of Leaky Abstractions)
← Saddam and Dubya debateRandom good newsie stuff I don't have time to read fully or write about fully →

Daily link icon Wednesday, February 26, 2003

TagSoup

Man could I have used this about a year and a half ago! Via Bill, TagSoup.

For the last year I have been working on a new parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

TagSoup is a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly.

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation.

I have to keep this great-sounding tool in the back of my mind.

← Saddam and Dubya debateRandom good newsie stuff I don't have time to read fully or write about fully →

Comments XML gif


Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

December 2008
SunMonTueWedThuFriSat
 123456
78910111213
14151617181920
21222324252627
28293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 4 posts

Recent comments XML

Girls, please don't get breast implants

I have 34 A breast but at 22 years​old they seem to be growing again​which ...

76.64.120.153: Dec 3, 10:00am

Perl 6 1.0 in March?

Doh, my mistake. I'm aware of the​relation between Parrot and Rakudo​but I'...

Keith: Dec 2, 1:03am

Free image hosting sites

Well, TinyPic has this in its​FAQ:

> Images and videos is in​your accoun...

Keith: Dec 1, 1:13am

Join a NameValueCollection into a querystring in C#

Well with a lamba expression, this​is what I came up​with:

?!code:csharp...

Gustaf Lindqvist: Nov 30, 4:38pm

Generated in about 0.249s.

(Used 8 db queries)

mobile phone