KBD

Keith Devens .com

Wednesday, July 9, 2008 Flag waving
What you are thunders so loudly that I cannot hear what you say to the contrary. – Ralph Waldo Emerson
← Super Mario Brothers 3... in 11 minutesCox & Forkum are very talented →

Daily link icon Wednesday, November 26, 2003

Spidering Hacks, or Template::Extract

O'Reilly Network: Spidering Hacks

From the "Holy crap, I never knew this existed, but it's awesome!" department, check out Template::Extract:

One day, I was fiddling about with the Template Toolkit (http://www.template-toolkit.com/) and it dawned on me that all these sites were, at some level, generated with some templating engine. The Template Toolkit takes a template and some data and produces HTML output.

Okay, you might think, very interesting, but how does this relate to scraping web pages for RSS? Well, we know what the HTML looks like, and we can make a reasonable guess at what the template ought to look like, but we want only the data. If only I could apply the Template Toolkit backward somehow. Taking HTML output and a template that could conceivably generate the output, I could retrieve the original data structure and, from then on, generating RSS from the data structure would be a piece of cake.

Like most brilliant ideas, this is hardly original, and an equally brilliant man named Autrijus Tang not only had the idea a long time before me, but—and this is the hard part—actually worked out how to implement it. His Template::Extract Perl module (http://search.cpan.org/author/AUTRIJUS/Template-Extract/) does precisely this: extract a data structure from its template and output.

This tip was an excerpt from O'Reilly's new book, Spidering Hacks by Kevin Hemenway, author of AmphetaDesk, and Tara Calishain (of ResearchBuzz fame!)

Also check out Autrijus Tang's Template::Generate, which completes the trifecta of Perl template parsing tools:

 Template:           ($template + $data) ==> $document   # normal
 Template::Extract:  ($document + $template) ==> $data   # tricky
 Template::Generate: ($data + $document) ==> $template   # very tricky

You've got to be kidding me Smiley

← Super Mario Brothers 3... in 11 minutesCox & Forkum are very talented →

Comments XML gif

Kayode Okeyode (http://www.kayodeok.co.uk/weblog/) wrote:

You may not realise it, but a Tutorial was published in the Perl Advent Calendar today on Template::Extract!

Template::Extract
http://perladvent.org/2003/5th/

...thought you might find it interesting.

∴ Kayode Okeyode | 5-Dec-2003 4:31pm est | http://www.kayodeok.co.uk/weblog/ | #3453

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

July 2008
SunMonTueWedThuFriSat
 12345
6789101112
13141516171819
20212223242526
2728293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 2 posts

Recent comments XML

getElementsByClass function

http://pitfalls.wordpress.com/2008/​07/07/querying-it-jquery-way-getele​ments...

maxgandalf: Jul 7, 5:50am

Generated in about 0.148s.

(Used 8 db queries)

mobile phone