=head1 NAME+ −
+ −
XML::SAX::Intro - An Introduction to SAX Parsing with Perl+ −
+ −
=head1 Introduction+ −
+ −
XML::SAX is a new way to work with XML Parsers in Perl. In this article+ −
we'll discuss why you should be using SAX, why you should be using+ −
XML::SAX, and we'll see some of the finer implementation details. The+ −
text below assumes some familiarity with callback, or push based+ −
parsing, but if you are unfamiliar with these techniques then a good+ −
place to start is Kip Hampton's excellent series of articles on XML.com.+ −
+ −
=head1 Replacing XML::Parser+ −
+ −
The de-facto way of parsing XML under perl is to use Larry Wall and+ −
Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around+ −
the expat XML parser library by James Clark. It has been a hugely+ −
successful project, but suffers from a couple of rather major flaws.+ −
Firstly it is a proprietary API, designed before the SAX API was+ −
conceived, which means that it is not easily replaceable by other+ −
streaming parsers. Secondly it's callbacks are subrefs. This doesn't+ −
sound like much of an issue, but unfortunately leads to code like:+ −
+ −
sub handle_start {+ −
my ($e, $el, %attrs) = @_;+ −
if ($el eq 'foo') {+ −
$e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.+ −
}+ −
}+ −
+ −
As you can see, we're using the $e object to hold our state+ −
information, which is a bad idea because we don't own that object - we+ −
didn't create it. It's an internal object of XML::Parser, that happens+ −
to be a hashref. We could all too easily overwrite XML::Parser internal+ −
state variables by using this, or Clark could change it to an array ref+ −
(not that he would, because it would break so much code, but he could).+ −
+ −
The only way currently with XML::Parser to safely maintain state is to+ −
use a closure:+ −
+ −
my $state = MyState->new();+ −
$parser->setHandlers(Start => sub { handle_start($state, @_) });+ −
+ −
This closure traps the $state variable, which now gets passed as the+ −
first parameter to your callback. Unfortunately very few people use+ −
this technique, as it is not documented in the XML::Parser POD files.+ −
+ −
Another reason you might not want to use XML::Parser is because you+ −
need some feature that it doesn't provide (such as validation), or you+ −
might need to use a library that doesn't use expat, due to it not being+ −
installed on your system, or due to having a restrictive ISP. Using SAX+ −
allows you to work around these restrictions.+ −
+ −
=head1 Introducing SAX+ −
+ −
SAX stands for the Simple API for XML. And simple it really is.+ −
Constructing a SAX parser and passing events to handlers is done as+ −
simply as:+ −
+ −
use XML::SAX;+ −
use MySAXHandler;+ −
+ −
my $parser = XML::SAX::ParserFactory->parser(+ −
Handler => MySAXHandler->new+ −
);+ −
+ −
$parser->parse_uri("foo.xml");+ −
+ −
The important concept to grasp here is that SAX uses a factory class+ −
called XML::SAX::ParserFactory to create a new parser instance. The+ −
reason for this is so that you can support other underlying+ −
parser implementations for different feature sets. This is one thing+ −
that XML::Parser has always sorely lacked.+ −
+ −
In the code above we see the parse_uri method used, but we could+ −
have equally well+ −
called parse_file, parse_string, or parse(). Please see XML::SAX::Base+ −
for what these methods take as parameters, but don't be fooled into+ −
believing parse_file takes a filename. No, it takes a file handle, a+ −
glob, or a subclass of IO::Handle. Beware.+ −
+ −
SAX works very similarly to XML::Parser's default callback method,+ −
except it has one major difference: rather than setting individual+ −
callbacks, you create a new class in which to recieve the callbacks.+ −
Each callback is called as a method call on an instance of that handler+ −
class. An example will best demonstrate this:+ −
+ −
package MySAXHandler;+ −
use base qw(XML::SAX::Base);+ −
+ −
sub start_document {+ −
my ($self, $doc) = @_;+ −
# process document start event+ −
}+ −
+ −
sub start_element {+ −
my ($self, $el) = @_;+ −
# process element start event+ −
}+ −
+ −
Now, when we instantiate this as above, and parse some XML with this as+ −
the handler, the methods start_document and start_element will be+ −
called as method calls, so this would be the equivalent of directly+ −
calling:+ −
+ −
$object->start_element($el);+ −
+ −
Notice how this is different to XML::Parser's calling style, which+ −
calls:+ −
+ −
start_element($e, $name, %attribs);+ −
+ −
It's the difference between function calling and method calling which+ −
allows you to subclass SAX handlers which contributes to SAX being a+ −
powerful solution.+ −
+ −
As you can see, unlike XML::Parser, we have to define a new package in+ −
which to do our processing (there are hacks you can do to make this+ −
uneccessary, but I'll leave figuring those out to the experts). The+ −
biggest benefit of this is that you maintain your own state variable+ −
($self in the above example) thus freeing you of the concerns listed+ −
above. It is also an improvement in maintainability - you can place the+ −
code in a separate file if you wish to, and your callback methods are+ −
always called the same thing, rather than having to choose a suitable+ −
name for them as you had to with XML::Parser. This is an obvious win.+ −
+ −
SAX parsers are also very flexible in how you pass a handler to them.+ −
You can use a constructor parameter as we saw above, or we can pass the+ −
handler directly in the call to one of the parse methods:+ −
+ −
$parser->parse(Handler => $handler, + −
Source => { SystemId => "foo.xml" });+ −
# or...+ −
$parser->parse_file($fh, Handler => $handler);+ −
+ −
This flexibility allows for one parser to be used in many different+ −
scenarios throughout your script (though one shouldn't feel pressure to+ −
use this method, as parser construction is generally not a time+ −
consuming process).+ −
+ −
=head1 Callback Parameters+ −
+ −
The only other thing you need to know to understand basic SAX is the+ −
structure of the parameters passed to each of the callbacks. In+ −
XML::Parser, all parameters are passed as multiple options to the+ −
callbacks, so for example the Start callback would be called as+ −
my_start($e, $name, %attributes), and the PI callback would be called+ −
as my_processing_instruction($e, $target, $data). In SAX, every+ −
callback is passed a hash reference, containing entries that define our+ −
"node". The key callbacks and the structures they receive are:+ −
+ −
=head2 start_element+ −
+ −
The start_element handler is called whenever a parser sees an opening+ −
tag. It is passed an element structure consisting of:+ −
+ −
=over 4+ −
+ −
=item LocalName+ −
+ −
The name of the element minus any namespace prefix it may+ −
have come with in the document.+ −
+ −
=item NamespaceURI+ −
+ −
The URI of the namespace associated with this element,+ −
or the empty string for none.+ −
+ −
=item Attributes+ −
+ −
A set of attributes as described below.+ −
+ −
=item Name+ −
+ −
The name of the element as it was seen in the document (i.e.+ −
including any prefix associated with it)+ −
+ −
=item Prefix+ −
+ −
The prefix used to qualify this element's namespace, or the + −
empty string if none.+ −
+ −
=back+ −
+ −
The B<Attributes> are a hash reference, keyed by what we have called+ −
"James Clark" notation. This means that the attribute name has been+ −
expanded to include any associated namespace URI, and put together as+ −
{ns}name, where "ns" is the expanded namespace URI of the attribute if+ −
and only if the attribute had a prefix, and "name" is the LocalName of+ −
the attribute.+ −
+ −
The value of each entry in the attributes hash is another hash+ −
structure consisting of:+ −
+ −
=over 4+ −
+ −
=item LocalName+ −
+ −
The name of the attribute minus any namespace prefix it may have+ −
come with in the document.+ −
+ −
=item NamespaceURI+ −
+ −
The URI of the namespace associated with this attribute. If the + −
attribute had no prefix, then this consists of just the empty string.+ −
+ −
=item Name+ −
+ −
The attribute's name as it appeared in the document, including any + −
namespace prefix.+ −
+ −
=item Prefix+ −
+ −
The prefix used to qualify this attribute's namepace, or the + −
empty string if none.+ −
+ −
=item Value+ −
+ −
The value of the attribute.+ −
+ −
=back+ −
+ −
So a full example, as output by Data::Dumper might be:+ −
+ −
....+ −
+ −
=head2 end_element+ −
+ −
The end_element handler is called either when a parser sees a closing+ −
tag, or after start_element has been called for an empty element (do+ −
note however that a parser may if it is so inclined call characters+ −
with an empty string when it sees an empty element. There is no simple+ −
way in SAX to determine if the parser in fact saw an empty element, a+ −
start and end element with no content..+ −
+ −
The end_element handler receives exactly the same structure as+ −
start_element, minus the Attributes entry. One must note though that it+ −
should not be a reference to the same data as start_element receives,+ −
so you may change the values in start_element but this will not affect+ −
the values later seen by end_element.+ −
+ −
=head2 characters+ −
+ −
The characters callback may be called in serveral circumstances. The+ −
most obvious one is when seeing ordinary character data in the markup.+ −
But it is also called for text in a CDATA section, and is also called+ −
in other situations. A SAX parser has to make no guarantees whatsoever+ −
about how many times it may call characters for a stretch of text in an+ −
XML document - it may call once, or it may call once for every+ −
character in the text. In order to work around this it is often+ −
important for the SAX developer to use a bundling technique, where text+ −
is gathered up and processed in one of the other callbacks. This is not+ −
always necessary, but it is a worthwhile technique to learn, which we+ −
will cover in XML::SAX::Advanced (when I get around to writing it).+ −
+ −
The characters handler is called with a very simple structure - a hash+ −
reference consisting of just one entry:+ −
+ −
=over 4+ −
+ −
=item Data+ −
+ −
The text data that was received.+ −
+ −
=back+ −
+ −
=head2 comment+ −
+ −
The comment callback is called for comment text. Unlike with+ −
C<characters()>, the comment callback *must* be invoked just once for an+ −
entire comment string. It receives a single simple structure - a hash+ −
reference containing just one entry:+ −
+ −
=over 4+ −
+ −
=item Data+ −
+ −
The text of the comment.+ −
+ −
=back+ −
+ −
=head2 processing_instruction+ −
+ −
The processing instruction handler is called for all processing+ −
instructions in the document. Note that these processing instructions+ −
may appear before the document root element, or after it, or anywhere+ −
where text and elements would normally appear within the document,+ −
according to the XML specification.+ −
+ −
The handler is passed a structure containing just two entries:+ −
+ −
=over 4+ −
+ −
=item Target+ −
+ −
The target of the processing instrcution+ −
+ −
=item Data+ −
+ −
The text data in the processing instruction. Can be an empty+ −
string for a processing instruction that has no data element. + −
For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.+ −
+ −
=back+ −
+ −
=head1 Tip of the iceberg+ −
+ −
What we have discussed above is really the tip of the SAX iceberg. And+ −
so far it looks like there's not much of interest to SAX beyond what we+ −
have seen with XML::Parser. But it does go much further than that, I+ −
promise.+ −
+ −
People who hate Object Oriented code for the sake of it may be thinking+ −
here that creating a new package just to parse something is a waste+ −
when they've been parsing things just fine up to now using procedural+ −
code. But there's reason to all this madness. And that reason is SAX+ −
Filters.+ −
+ −
As you saw right at the very start, to let the parser know about our+ −
class, we pass it an instance of our class as the Handler to the+ −
parser. But now imagine what would happen if our class could also take+ −
a Handler option, and simply do some processing and pass on our data+ −
further down the line? That in a nutshell is how SAX filters work. It's+ −
Unix pipes for the 21st century!+ −
+ −
There are two downsides to this. Number 1 - writing SAX filters can be+ −
tricky. If you look into the future and read the advanced tutorial I'm+ −
writing, you'll see that Handler can come in several shapes and sizes.+ −
So making sure your filter does the right thing can be tricky.+ −
Secondly, constructing complex filter chains can be difficult, and+ −
simple thinking tells us that we only get one pass at our document,+ −
when often we'll need more than that.+ −
+ −
Luckily though, those downsides have been fixed by the release of two+ −
very cool modules. What's even better is that I didn't write either of+ −
them!+ −
+ −
The first module is XML::SAX::Base. This is a VITAL SAX module that+ −
acts as a base class for all SAX parsers and filters. It provides an+ −
abstraction away from calling the handler methods, that makes sure your+ −
filter or parser does the right thing, and it does it FAST. So, if you+ −
ever need to write a SAX filter, which if you're processing XML -> XML,+ −
or XML -> HTML, then you probably do, then you need to be writing it as+ −
a subclass of XML::SAX::Base. Really - this is advice not to ignore+ −
lightly. I will not go into the details of writing a SAX filter here.+ −
Kip Hampton, the author of XML::SAX::Base has covered this nicely in+ −
his article on XML.com here <URI>.+ −
+ −
To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker+ −
who's modules you will probably have heard of or used, wrote a very+ −
clever module called XML::SAX::Machines. This combines some really+ −
clever SAX filter-type modules, with a construction toolkit for filters+ −
that makes building pipelines easy. But before we see how it makes+ −
things easy, first lets see how tricky it looks to build complex SAX+ −
filter pipelines.+ −
+ −
use XML::SAX::ParserFactory;+ −
use XML::Filter::Filter1;+ −
use XML::Filter::Filter2;+ −
use XML::SAX::Writer;+ −
+ −
my $output_string;+ −
my $writer = XML::SAX::Writer->new(Output => \$output_string);+ −
my $filter2 = XML::SAX::Filter2->new(Handler => $writer);+ −
my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);+ −
my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);+ −
+ −
$parser->parse_uri("foo.xml");+ −
+ −
This is a lot easier with XML::SAX::Machines:+ −
+ −
use XML::SAX::Machines qw(Pipeline);+ −
+ −
my $output_string;+ −
my $parser = Pipeline(+ −
XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string+ −
);+ −
+ −
$parser->parse_uri("foo.xml");+ −
+ −
One of the main benefits of XML::SAX::Machines is that the pipelines+ −
are constructed in natural order, rather than the reverse order we saw+ −
with manual pipeline construction. XML::SAX::Machines takes care of all+ −
the internals of pipe construction, providing you at the end with just+ −
a parser you can use (and you can re-use the same parser as many times+ −
as you need to).+ −
+ −
Just a final tip. If you ever get stuck and are confused about what is+ −
being passed from one SAX filter or parser to the next, then+ −
Devel::TraceSAX will come to your rescue. This perl debugger plugin+ −
will allow you to dump the SAX stream of events as it goes by. Usage is+ −
really very simple just call your perl script that uses SAX as follows:+ −
+ −
$ perl -d:TraceSAX <scriptname>+ −
+ −
And preferably pipe the output to a pager of some sort, such as more or+ −
less. The output is extremely verbose, but should help clear some+ −
issues up.+ −
+ −
=head1 AUTHOR+ −
+ −
Matt Sergeant, matt@sergeant.org+ −
+ −
$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $+ −
+ −
=cut+ −