KBD

Keith Devens .com

Saturday, October 11, 2008 Flag waving
Any sufficiently complicated C or Fortran program contains an ad hoc, informally specified, bug-ridden, slow implementation of half of... – Philip Greenspun (Greenspun's Tenth Rule)
← War against "terrorism" or Islamic fascism?XPath and XSLT →

Daily link icon Monday, June 14, 2004

BeautifulSoup

Wonderful! The universe gives me exactly what I need right when I need it Smiley I'd previously linked this somewhere along the way, but I'd forgotten about BeautifulSoup (via Ben, via the Daily Python URL). This should allow me to use Python instead of Perl for a spidering project I'm working on.

Update: Hmm... my only question is... why not pipe the page through Tidy and then parse it with a real XML parser? I'm not asking rhetorically either.

← War against "terrorism" or Islamic fascism?XPath and XSLT →

Comments XML gif

Simon Willison (http://simon.incutio.com/) wrote:

Tidy isn't infallible - it's possible to give it input which it can't convert to XHTML on its own, at which point it will spit out fatal errors along with tips on how to correct them, This is fine if you're using it as part of a manual document fixing process but is no good at all for random spidering, unless you're sure that the HTML you are crawling is "good enough" for Tidy to be able to convert it on its own.

∴ Simon Willison | 14-Jun-2004 7:37pm est | http://simon.incutio.com/ | #4780

Erik Hummel (http://ehummel.net) wrote:

I would think that using Tidy as in your idea would be a good idea as an experiment. It would allow us to figure out a decent concept on how many sites have terrible (as opposed to just non-compliant) HTML markup. Terrible sites won't be parse by Tidy (and simple error hanlding can catch the return values spit out by Tidy so you can log the pages that couldn't be parsed.

Thus it would give you a chance to write some elegant code to handle the XHTML output from Tidy for the pages that do work, and as we all know, writing elegant well-thought-out code is much better than writing code full of hacks to handle non-compliant pages (not that it can't be done, it just would take a lot more effort... and depending on how much value you put on your free time for pet-projects, it may also not be worth the effort)

∴ Erik Hummel | 14-Jun-2004 8:18pm est | http://ehummel.net | #4781

Leonard Richardson (http://www.crummy.com/) wrote:

The thing that frustrated me about existing parsers was not really anything about the parsers per se, but the fact that for a screenscraping application, parsing is only half the story. Once you've parsed the document you need to extract the part of the document you actually care about, and to do that you have to write a sort of special-purpose tree traversal. I wrote Beautiful Soup partially so I could parse anything that looked like HTML, but mainly to abstract away the tree traversal. It's not as fast as a handwritten traversal, but it's faster to write and (in my experience) more reliable and mantainable.

If you need to handle arbitrary pages that have nothing in common, where you don't get to analyze a representative sample ahead of time, I recommend considering Tidy + an XML parser. For such a project I'd think you'd be doing mostly parsing and very little extraction.

The niche for which I designed Beautiful Soup was where you had an ugly web page full of fascinating information, where if you looked at the HTML code you'd be able to intuit how to get out the good data. It's intended to make it really easy to transform that intuition into code.

∴ Leonard Richardson | 14-Jun-2004 9:10pm est | http://www.crummy.com/ | #4782

Keith (http://keithdevens.com/) wrote:

Hi Leonard. I'll probably be using BeautifulSoup tomorrow. I'll comment on how it goes Smiley Thanks for making it available.

Keith | 15-Jun-2004 12:25am est | http://keithdevens.com/ | #4790

Keith (http://keithdevens.com/) wrote:

Leonard, your code was a Godsend. Thanks to your library (and thanks to Python) I got stuff done today that I just wouldn't have been able to get done otherwise. Thank you!

Keith | 15-Jun-2004 8:58pm est | http://keithdevens.com/ | #4799

Leonard Richardson (http://www.crummy.com/) wrote:

You're welcome! I'm glad you found it useful. Would you be willing to give some details about how you used Beautiful Soup so I can add you to my 'real world examples' section?

∴ Leonard Richardson | 16-Jun-2004 12:14am est | http://www.crummy.com/ | #4801

Keith (http://keithdevens.com/) wrote:

Hmm... I'll try to write something up and e-mail it to you. It may be a few days before I get to it.

Keith | 16-Jun-2004 6:50pm est | http://keithdevens.com/ | #4808

Leonard Richardson (http://www.crummy.com/) wrote:

No problem.

∴ Leonard Richardson | 17-Jun-2004 1:26pm est | http://www.crummy.com/ | #4812

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

October 2008
SunMonTueWedThuFriSat
 1234
567891011
12131415161718
19202122232425
262728293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 7 posts

Recent comments XML

new⇒I hate PHP

Elliot Anderson,

Dude!! You the​man! The reverse replacement for​array_u...

Alex Ndungu: Oct 11, 1:35am

Call a function from a string in Python

?!code:
some_object.__getattribute​__('method_name')()
?!/code

is​the s...

Patrick Corcoran: Oct 8, 3:53pm

Spider solitaire

I have won 185 games of Spider​Solitaire at the "Difficult" level.​ What is...

75.179.28.113: Oct 8, 12:42pm

Sed one-liners

Hi.

I wanted to let you know​that I wrote an article "Famous Sed​One-Lin...

Peteris Krumins: Oct 8, 3:05am

Timesheet Calculator

Hadn't seen it before now, but my​company already uses a time​tracking prog...

Keith: Oct 7, 10:44am

Girls, please don't get breast implants

Hey everyone, 

I am new to this​blog and I have enjoyed reading all​your...

Sarah.M.: Oct 6, 9:45am

Generated in about 0.203s.

(Used 8 db queries)

mobile phone