A. Sinan Unur

Parsing HTML with Perl

Efficiently manipulate documents on the Web

The need to extract interesting bits of an HTML document comes up often enough that by now we have all seen many ways of doing it wrong and some ways of doing it right for some values of “right”.

One might think that one of the most fascinating answers on Stackoverflow has put an end to the desire to parse HTML using regular expressions, but time and again such a desire proves too tempting.

Let’s say you want to check all the links on a page to identify stale ones, using regular expressions:

use strict;
use warnings;
use feature 'say';

my $re = qr/<as+href=["']([^"']+)["']/i;
my $html = do { local $/; <DATA> }; # slurp _DATA_ section

my @links = ($html =~ m{ $re }gx);

say for @links;

__DATA__
<html><body>

<p><a href="http://example.com/">An Example</a></p>

<!-- <a href="http://invalid.example.com/">An Example</a> -->
</body></html>

In this self-contained example, I put a small document in the __DATA__ section. This example corresponds to a situation where the maintainer of the page commented out a previously broken link, and replaced it with the correct link.

When run, this script produces the output:

Read more…

Can One Write Readable and Maintainable Perl?

Perl’s flexibility helps you avoid writing superfluous code.

The answer to this simple but somehow controversial question is an emphatic yes! Unfortunately, there is a lot of bad Perl out there owing to Perl’s history of being the language of getting things done in the 90s. It is easy for a newcomer to feel overwhelmed by such examples.

One can avoid that feeling by basically only learning from Perl that does not look like gibberish.

I decided to learn Perl a little late. Or, maybe just at the right time. I had all the tools to learn good habits right from the get go.

Read more…