176
+ − 1
=head1 NAME
+ − 2
+ − 3
XML::SAX::Intro - An Introduction to SAX Parsing with Perl
+ − 4
+ − 5
=head1 Introduction
+ − 6
+ − 7
XML::SAX is a new way to work with XML Parsers in Perl. In this article
+ − 8
we'll discuss why you should be using SAX, why you should be using
+ − 9
XML::SAX, and we'll see some of the finer implementation details. The
+ − 10
text below assumes some familiarity with callback, or push based
+ − 11
parsing, but if you are unfamiliar with these techniques then a good
+ − 12
place to start is Kip Hampton's excellent series of articles on XML.com.
+ − 13
+ − 14
=head1 Replacing XML::Parser
+ − 15
+ − 16
The de-facto way of parsing XML under perl is to use Larry Wall and
+ − 17
Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
+ − 18
the expat XML parser library by James Clark. It has been a hugely
+ − 19
successful project, but suffers from a couple of rather major flaws.
+ − 20
Firstly it is a proprietary API, designed before the SAX API was
+ − 21
conceived, which means that it is not easily replaceable by other
+ − 22
streaming parsers. Secondly it's callbacks are subrefs. This doesn't
+ − 23
sound like much of an issue, but unfortunately leads to code like:
+ − 24
+ − 25
sub handle_start {
+ − 26
my ($e, $el, %attrs) = @_;
+ − 27
if ($el eq 'foo') {
+ − 28
$e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
+ − 29
}
+ − 30
}
+ − 31
+ − 32
As you can see, we're using the $e object to hold our state
+ − 33
information, which is a bad idea because we don't own that object - we
+ − 34
didn't create it. It's an internal object of XML::Parser, that happens
+ − 35
to be a hashref. We could all too easily overwrite XML::Parser internal
+ − 36
state variables by using this, or Clark could change it to an array ref
+ − 37
(not that he would, because it would break so much code, but he could).
+ − 38
+ − 39
The only way currently with XML::Parser to safely maintain state is to
+ − 40
use a closure:
+ − 41
+ − 42
my $state = MyState->new();
+ − 43
$parser->setHandlers(Start => sub { handle_start($state, @_) });
+ − 44
+ − 45
This closure traps the $state variable, which now gets passed as the
+ − 46
first parameter to your callback. Unfortunately very few people use
+ − 47
this technique, as it is not documented in the XML::Parser POD files.
+ − 48
+ − 49
Another reason you might not want to use XML::Parser is because you
+ − 50
need some feature that it doesn't provide (such as validation), or you
+ − 51
might need to use a library that doesn't use expat, due to it not being
+ − 52
installed on your system, or due to having a restrictive ISP. Using SAX
+ − 53
allows you to work around these restrictions.
+ − 54
+ − 55
=head1 Introducing SAX
+ − 56
+ − 57
SAX stands for the Simple API for XML. And simple it really is.
+ − 58
Constructing a SAX parser and passing events to handlers is done as
+ − 59
simply as:
+ − 60
+ − 61
use XML::SAX;
+ − 62
use MySAXHandler;
+ − 63
+ − 64
my $parser = XML::SAX::ParserFactory->parser(
+ − 65
Handler => MySAXHandler->new
+ − 66
);
+ − 67
+ − 68
$parser->parse_uri("foo.xml");
+ − 69
+ − 70
The important concept to grasp here is that SAX uses a factory class
+ − 71
called XML::SAX::ParserFactory to create a new parser instance. The
+ − 72
reason for this is so that you can support other underlying
+ − 73
parser implementations for different feature sets. This is one thing
+ − 74
that XML::Parser has always sorely lacked.
+ − 75
+ − 76
In the code above we see the parse_uri method used, but we could
+ − 77
have equally well
+ − 78
called parse_file, parse_string, or parse(). Please see XML::SAX::Base
+ − 79
for what these methods take as parameters, but don't be fooled into
+ − 80
believing parse_file takes a filename. No, it takes a file handle, a
+ − 81
glob, or a subclass of IO::Handle. Beware.
+ − 82
+ − 83
SAX works very similarly to XML::Parser's default callback method,
+ − 84
except it has one major difference: rather than setting individual
+ − 85
callbacks, you create a new class in which to recieve the callbacks.
+ − 86
Each callback is called as a method call on an instance of that handler
+ − 87
class. An example will best demonstrate this:
+ − 88
+ − 89
package MySAXHandler;
+ − 90
use base qw(XML::SAX::Base);
+ − 91
+ − 92
sub start_document {
+ − 93
my ($self, $doc) = @_;
+ − 94
# process document start event
+ − 95
}
+ − 96
+ − 97
sub start_element {
+ − 98
my ($self, $el) = @_;
+ − 99
# process element start event
+ − 100
}
+ − 101
+ − 102
Now, when we instantiate this as above, and parse some XML with this as
+ − 103
the handler, the methods start_document and start_element will be
+ − 104
called as method calls, so this would be the equivalent of directly
+ − 105
calling:
+ − 106
+ − 107
$object->start_element($el);
+ − 108
+ − 109
Notice how this is different to XML::Parser's calling style, which
+ − 110
calls:
+ − 111
+ − 112
start_element($e, $name, %attribs);
+ − 113
+ − 114
It's the difference between function calling and method calling which
+ − 115
allows you to subclass SAX handlers which contributes to SAX being a
+ − 116
powerful solution.
+ − 117
+ − 118
As you can see, unlike XML::Parser, we have to define a new package in
+ − 119
which to do our processing (there are hacks you can do to make this
+ − 120
uneccessary, but I'll leave figuring those out to the experts). The
+ − 121
biggest benefit of this is that you maintain your own state variable
+ − 122
($self in the above example) thus freeing you of the concerns listed
+ − 123
above. It is also an improvement in maintainability - you can place the
+ − 124
code in a separate file if you wish to, and your callback methods are
+ − 125
always called the same thing, rather than having to choose a suitable
+ − 126
name for them as you had to with XML::Parser. This is an obvious win.
+ − 127
+ − 128
SAX parsers are also very flexible in how you pass a handler to them.
+ − 129
You can use a constructor parameter as we saw above, or we can pass the
+ − 130
handler directly in the call to one of the parse methods:
+ − 131
+ − 132
$parser->parse(Handler => $handler,
+ − 133
Source => { SystemId => "foo.xml" });
+ − 134
# or...
+ − 135
$parser->parse_file($fh, Handler => $handler);
+ − 136
+ − 137
This flexibility allows for one parser to be used in many different
+ − 138
scenarios throughout your script (though one shouldn't feel pressure to
+ − 139
use this method, as parser construction is generally not a time
+ − 140
consuming process).
+ − 141
+ − 142
=head1 Callback Parameters
+ − 143
+ − 144
The only other thing you need to know to understand basic SAX is the
+ − 145
structure of the parameters passed to each of the callbacks. In
+ − 146
XML::Parser, all parameters are passed as multiple options to the
+ − 147
callbacks, so for example the Start callback would be called as
+ − 148
my_start($e, $name, %attributes), and the PI callback would be called
+ − 149
as my_processing_instruction($e, $target, $data). In SAX, every
+ − 150
callback is passed a hash reference, containing entries that define our
+ − 151
"node". The key callbacks and the structures they receive are:
+ − 152
+ − 153
=head2 start_element
+ − 154
+ − 155
The start_element handler is called whenever a parser sees an opening
+ − 156
tag. It is passed an element structure consisting of:
+ − 157
+ − 158
=over 4
+ − 159
+ − 160
=item LocalName
+ − 161
+ − 162
The name of the element minus any namespace prefix it may
+ − 163
have come with in the document.
+ − 164
+ − 165
=item NamespaceURI
+ − 166
+ − 167
The URI of the namespace associated with this element,
+ − 168
or the empty string for none.
+ − 169
+ − 170
=item Attributes
+ − 171
+ − 172
A set of attributes as described below.
+ − 173
+ − 174
=item Name
+ − 175
+ − 176
The name of the element as it was seen in the document (i.e.
+ − 177
including any prefix associated with it)
+ − 178
+ − 179
=item Prefix
+ − 180
+ − 181
The prefix used to qualify this element's namespace, or the
+ − 182
empty string if none.
+ − 183
+ − 184
=back
+ − 185
+ − 186
The B<Attributes> are a hash reference, keyed by what we have called
+ − 187
"James Clark" notation. This means that the attribute name has been
+ − 188
expanded to include any associated namespace URI, and put together as
+ − 189
{ns}name, where "ns" is the expanded namespace URI of the attribute if
+ − 190
and only if the attribute had a prefix, and "name" is the LocalName of
+ − 191
the attribute.
+ − 192
+ − 193
The value of each entry in the attributes hash is another hash
+ − 194
structure consisting of:
+ − 195
+ − 196
=over 4
+ − 197
+ − 198
=item LocalName
+ − 199
+ − 200
The name of the attribute minus any namespace prefix it may have
+ − 201
come with in the document.
+ − 202
+ − 203
=item NamespaceURI
+ − 204
+ − 205
The URI of the namespace associated with this attribute. If the
+ − 206
attribute had no prefix, then this consists of just the empty string.
+ − 207
+ − 208
=item Name
+ − 209
+ − 210
The attribute's name as it appeared in the document, including any
+ − 211
namespace prefix.
+ − 212
+ − 213
=item Prefix
+ − 214
+ − 215
The prefix used to qualify this attribute's namepace, or the
+ − 216
empty string if none.
+ − 217
+ − 218
=item Value
+ − 219
+ − 220
The value of the attribute.
+ − 221
+ − 222
=back
+ − 223
+ − 224
So a full example, as output by Data::Dumper might be:
+ − 225
+ − 226
....
+ − 227
+ − 228
=head2 end_element
+ − 229
+ − 230
The end_element handler is called either when a parser sees a closing
+ − 231
tag, or after start_element has been called for an empty element (do
+ − 232
note however that a parser may if it is so inclined call characters
+ − 233
with an empty string when it sees an empty element. There is no simple
+ − 234
way in SAX to determine if the parser in fact saw an empty element, a
+ − 235
start and end element with no content..
+ − 236
+ − 237
The end_element handler receives exactly the same structure as
+ − 238
start_element, minus the Attributes entry. One must note though that it
+ − 239
should not be a reference to the same data as start_element receives,
+ − 240
so you may change the values in start_element but this will not affect
+ − 241
the values later seen by end_element.
+ − 242
+ − 243
=head2 characters
+ − 244
+ − 245
The characters callback may be called in serveral circumstances. The
+ − 246
most obvious one is when seeing ordinary character data in the markup.
+ − 247
But it is also called for text in a CDATA section, and is also called
+ − 248
in other situations. A SAX parser has to make no guarantees whatsoever
+ − 249
about how many times it may call characters for a stretch of text in an
+ − 250
XML document - it may call once, or it may call once for every
+ − 251
character in the text. In order to work around this it is often
+ − 252
important for the SAX developer to use a bundling technique, where text
+ − 253
is gathered up and processed in one of the other callbacks. This is not
+ − 254
always necessary, but it is a worthwhile technique to learn, which we
+ − 255
will cover in XML::SAX::Advanced (when I get around to writing it).
+ − 256
+ − 257
The characters handler is called with a very simple structure - a hash
+ − 258
reference consisting of just one entry:
+ − 259
+ − 260
=over 4
+ − 261
+ − 262
=item Data
+ − 263
+ − 264
The text data that was received.
+ − 265
+ − 266
=back
+ − 267
+ − 268
=head2 comment
+ − 269
+ − 270
The comment callback is called for comment text. Unlike with
+ − 271
C<characters()>, the comment callback *must* be invoked just once for an
+ − 272
entire comment string. It receives a single simple structure - a hash
+ − 273
reference containing just one entry:
+ − 274
+ − 275
=over 4
+ − 276
+ − 277
=item Data
+ − 278
+ − 279
The text of the comment.
+ − 280
+ − 281
=back
+ − 282
+ − 283
=head2 processing_instruction
+ − 284
+ − 285
The processing instruction handler is called for all processing
+ − 286
instructions in the document. Note that these processing instructions
+ − 287
may appear before the document root element, or after it, or anywhere
+ − 288
where text and elements would normally appear within the document,
+ − 289
according to the XML specification.
+ − 290
+ − 291
The handler is passed a structure containing just two entries:
+ − 292
+ − 293
=over 4
+ − 294
+ − 295
=item Target
+ − 296
+ − 297
The target of the processing instrcution
+ − 298
+ − 299
=item Data
+ − 300
+ − 301
The text data in the processing instruction. Can be an empty
+ − 302
string for a processing instruction that has no data element.
+ − 303
For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
+ − 304
+ − 305
=back
+ − 306
+ − 307
=head1 Tip of the iceberg
+ − 308
+ − 309
What we have discussed above is really the tip of the SAX iceberg. And
+ − 310
so far it looks like there's not much of interest to SAX beyond what we
+ − 311
have seen with XML::Parser. But it does go much further than that, I
+ − 312
promise.
+ − 313
+ − 314
People who hate Object Oriented code for the sake of it may be thinking
+ − 315
here that creating a new package just to parse something is a waste
+ − 316
when they've been parsing things just fine up to now using procedural
+ − 317
code. But there's reason to all this madness. And that reason is SAX
+ − 318
Filters.
+ − 319
+ − 320
As you saw right at the very start, to let the parser know about our
+ − 321
class, we pass it an instance of our class as the Handler to the
+ − 322
parser. But now imagine what would happen if our class could also take
+ − 323
a Handler option, and simply do some processing and pass on our data
+ − 324
further down the line? That in a nutshell is how SAX filters work. It's
+ − 325
Unix pipes for the 21st century!
+ − 326
+ − 327
There are two downsides to this. Number 1 - writing SAX filters can be
+ − 328
tricky. If you look into the future and read the advanced tutorial I'm
+ − 329
writing, you'll see that Handler can come in several shapes and sizes.
+ − 330
So making sure your filter does the right thing can be tricky.
+ − 331
Secondly, constructing complex filter chains can be difficult, and
+ − 332
simple thinking tells us that we only get one pass at our document,
+ − 333
when often we'll need more than that.
+ − 334
+ − 335
Luckily though, those downsides have been fixed by the release of two
+ − 336
very cool modules. What's even better is that I didn't write either of
+ − 337
them!
+ − 338
+ − 339
The first module is XML::SAX::Base. This is a VITAL SAX module that
+ − 340
acts as a base class for all SAX parsers and filters. It provides an
+ − 341
abstraction away from calling the handler methods, that makes sure your
+ − 342
filter or parser does the right thing, and it does it FAST. So, if you
+ − 343
ever need to write a SAX filter, which if you're processing XML -> XML,
+ − 344
or XML -> HTML, then you probably do, then you need to be writing it as
+ − 345
a subclass of XML::SAX::Base. Really - this is advice not to ignore
+ − 346
lightly. I will not go into the details of writing a SAX filter here.
+ − 347
Kip Hampton, the author of XML::SAX::Base has covered this nicely in
+ − 348
his article on XML.com here <URI>.
+ − 349
+ − 350
To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
+ − 351
who's modules you will probably have heard of or used, wrote a very
+ − 352
clever module called XML::SAX::Machines. This combines some really
+ − 353
clever SAX filter-type modules, with a construction toolkit for filters
+ − 354
that makes building pipelines easy. But before we see how it makes
+ − 355
things easy, first lets see how tricky it looks to build complex SAX
+ − 356
filter pipelines.
+ − 357
+ − 358
use XML::SAX::ParserFactory;
+ − 359
use XML::Filter::Filter1;
+ − 360
use XML::Filter::Filter2;
+ − 361
use XML::SAX::Writer;
+ − 362
+ − 363
my $output_string;
+ − 364
my $writer = XML::SAX::Writer->new(Output => \$output_string);
+ − 365
my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
+ − 366
my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
+ − 367
my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
+ − 368
+ − 369
$parser->parse_uri("foo.xml");
+ − 370
+ − 371
This is a lot easier with XML::SAX::Machines:
+ − 372
+ − 373
use XML::SAX::Machines qw(Pipeline);
+ − 374
+ − 375
my $output_string;
+ − 376
my $parser = Pipeline(
+ − 377
XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
+ − 378
);
+ − 379
+ − 380
$parser->parse_uri("foo.xml");
+ − 381
+ − 382
One of the main benefits of XML::SAX::Machines is that the pipelines
+ − 383
are constructed in natural order, rather than the reverse order we saw
+ − 384
with manual pipeline construction. XML::SAX::Machines takes care of all
+ − 385
the internals of pipe construction, providing you at the end with just
+ − 386
a parser you can use (and you can re-use the same parser as many times
+ − 387
as you need to).
+ − 388
+ − 389
Just a final tip. If you ever get stuck and are confused about what is
+ − 390
being passed from one SAX filter or parser to the next, then
+ − 391
Devel::TraceSAX will come to your rescue. This perl debugger plugin
+ − 392
will allow you to dump the SAX stream of events as it goes by. Usage is
+ − 393
really very simple just call your perl script that uses SAX as follows:
+ − 394
+ − 395
$ perl -d:TraceSAX <scriptname>
+ − 396
+ − 397
And preferably pipe the output to a pager of some sort, such as more or
+ − 398
less. The output is extremely verbose, but should help clear some
+ − 399
issues up.
+ − 400
+ − 401
=head1 AUTHOR
+ − 402
+ − 403
Matt Sergeant, matt@sergeant.org
+ − 404
+ − 405
$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
+ − 406
+ − 407
=cut