177
|
1 |
=head1 NAME
|
|
2 |
|
|
3 |
XML::SAX::Intro - An Introduction to SAX Parsing with Perl
|
|
4 |
|
|
5 |
=head1 Introduction
|
|
6 |
|
|
7 |
XML::SAX is a new way to work with XML Parsers in Perl. In this article
|
|
8 |
we'll discuss why you should be using SAX, why you should be using
|
|
9 |
XML::SAX, and we'll see some of the finer implementation details. The
|
|
10 |
text below assumes some familiarity with callback, or push based
|
|
11 |
parsing, but if you are unfamiliar with these techniques then a good
|
|
12 |
place to start is Kip Hampton's excellent series of articles on XML.com.
|
|
13 |
|
|
14 |
=head1 Replacing XML::Parser
|
|
15 |
|
|
16 |
The de-facto way of parsing XML under perl is to use Larry Wall and
|
|
17 |
Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
|
|
18 |
the expat XML parser library by James Clark. It has been a hugely
|
|
19 |
successful project, but suffers from a couple of rather major flaws.
|
|
20 |
Firstly it is a proprietary API, designed before the SAX API was
|
|
21 |
conceived, which means that it is not easily replaceable by other
|
|
22 |
streaming parsers. Secondly it's callbacks are subrefs. This doesn't
|
|
23 |
sound like much of an issue, but unfortunately leads to code like:
|
|
24 |
|
|
25 |
sub handle_start {
|
|
26 |
my ($e, $el, %attrs) = @_;
|
|
27 |
if ($el eq 'foo') {
|
|
28 |
$e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
|
|
29 |
}
|
|
30 |
}
|
|
31 |
|
|
32 |
As you can see, we're using the $e object to hold our state
|
|
33 |
information, which is a bad idea because we don't own that object - we
|
|
34 |
didn't create it. It's an internal object of XML::Parser, that happens
|
|
35 |
to be a hashref. We could all too easily overwrite XML::Parser internal
|
|
36 |
state variables by using this, or Clark could change it to an array ref
|
|
37 |
(not that he would, because it would break so much code, but he could).
|
|
38 |
|
|
39 |
The only way currently with XML::Parser to safely maintain state is to
|
|
40 |
use a closure:
|
|
41 |
|
|
42 |
my $state = MyState->new();
|
|
43 |
$parser->setHandlers(Start => sub { handle_start($state, @_) });
|
|
44 |
|
|
45 |
This closure traps the $state variable, which now gets passed as the
|
|
46 |
first parameter to your callback. Unfortunately very few people use
|
|
47 |
this technique, as it is not documented in the XML::Parser POD files.
|
|
48 |
|
|
49 |
Another reason you might not want to use XML::Parser is because you
|
|
50 |
need some feature that it doesn't provide (such as validation), or you
|
|
51 |
might need to use a library that doesn't use expat, due to it not being
|
|
52 |
installed on your system, or due to having a restrictive ISP. Using SAX
|
|
53 |
allows you to work around these restrictions.
|
|
54 |
|
|
55 |
=head1 Introducing SAX
|
|
56 |
|
|
57 |
SAX stands for the Simple API for XML. And simple it really is.
|
|
58 |
Constructing a SAX parser and passing events to handlers is done as
|
|
59 |
simply as:
|
|
60 |
|
|
61 |
use XML::SAX;
|
|
62 |
use MySAXHandler;
|
|
63 |
|
|
64 |
my $parser = XML::SAX::ParserFactory->parser(
|
|
65 |
Handler => MySAXHandler->new
|
|
66 |
);
|
|
67 |
|
|
68 |
$parser->parse_uri("foo.xml");
|
|
69 |
|
|
70 |
The important concept to grasp here is that SAX uses a factory class
|
|
71 |
called XML::SAX::ParserFactory to create a new parser instance. The
|
|
72 |
reason for this is so that you can support other underlying
|
|
73 |
parser implementations for different feature sets. This is one thing
|
|
74 |
that XML::Parser has always sorely lacked.
|
|
75 |
|
|
76 |
In the code above we see the parse_uri method used, but we could
|
|
77 |
have equally well
|
|
78 |
called parse_file, parse_string, or parse(). Please see XML::SAX::Base
|
|
79 |
for what these methods take as parameters, but don't be fooled into
|
|
80 |
believing parse_file takes a filename. No, it takes a file handle, a
|
|
81 |
glob, or a subclass of IO::Handle. Beware.
|
|
82 |
|
|
83 |
SAX works very similarly to XML::Parser's default callback method,
|
|
84 |
except it has one major difference: rather than setting individual
|
|
85 |
callbacks, you create a new class in which to recieve the callbacks.
|
|
86 |
Each callback is called as a method call on an instance of that handler
|
|
87 |
class. An example will best demonstrate this:
|
|
88 |
|
|
89 |
package MySAXHandler;
|
|
90 |
use base qw(XML::SAX::Base);
|
|
91 |
|
|
92 |
sub start_document {
|
|
93 |
my ($self, $doc) = @_;
|
|
94 |
# process document start event
|
|
95 |
}
|
|
96 |
|
|
97 |
sub start_element {
|
|
98 |
my ($self, $el) = @_;
|
|
99 |
# process element start event
|
|
100 |
}
|
|
101 |
|
|
102 |
Now, when we instantiate this as above, and parse some XML with this as
|
|
103 |
the handler, the methods start_document and start_element will be
|
|
104 |
called as method calls, so this would be the equivalent of directly
|
|
105 |
calling:
|
|
106 |
|
|
107 |
$object->start_element($el);
|
|
108 |
|
|
109 |
Notice how this is different to XML::Parser's calling style, which
|
|
110 |
calls:
|
|
111 |
|
|
112 |
start_element($e, $name, %attribs);
|
|
113 |
|
|
114 |
It's the difference between function calling and method calling which
|
|
115 |
allows you to subclass SAX handlers which contributes to SAX being a
|
|
116 |
powerful solution.
|
|
117 |
|
|
118 |
As you can see, unlike XML::Parser, we have to define a new package in
|
|
119 |
which to do our processing (there are hacks you can do to make this
|
|
120 |
uneccessary, but I'll leave figuring those out to the experts). The
|
|
121 |
biggest benefit of this is that you maintain your own state variable
|
|
122 |
($self in the above example) thus freeing you of the concerns listed
|
|
123 |
above. It is also an improvement in maintainability - you can place the
|
|
124 |
code in a separate file if you wish to, and your callback methods are
|
|
125 |
always called the same thing, rather than having to choose a suitable
|
|
126 |
name for them as you had to with XML::Parser. This is an obvious win.
|
|
127 |
|
|
128 |
SAX parsers are also very flexible in how you pass a handler to them.
|
|
129 |
You can use a constructor parameter as we saw above, or we can pass the
|
|
130 |
handler directly in the call to one of the parse methods:
|
|
131 |
|
|
132 |
$parser->parse(Handler => $handler,
|
|
133 |
Source => { SystemId => "foo.xml" });
|
|
134 |
# or...
|
|
135 |
$parser->parse_file($fh, Handler => $handler);
|
|
136 |
|
|
137 |
This flexibility allows for one parser to be used in many different
|
|
138 |
scenarios throughout your script (though one shouldn't feel pressure to
|
|
139 |
use this method, as parser construction is generally not a time
|
|
140 |
consuming process).
|
|
141 |
|
|
142 |
=head1 Callback Parameters
|
|
143 |
|
|
144 |
The only other thing you need to know to understand basic SAX is the
|
|
145 |
structure of the parameters passed to each of the callbacks. In
|
|
146 |
XML::Parser, all parameters are passed as multiple options to the
|
|
147 |
callbacks, so for example the Start callback would be called as
|
|
148 |
my_start($e, $name, %attributes), and the PI callback would be called
|
|
149 |
as my_processing_instruction($e, $target, $data). In SAX, every
|
|
150 |
callback is passed a hash reference, containing entries that define our
|
|
151 |
"node". The key callbacks and the structures they receive are:
|
|
152 |
|
|
153 |
=head2 start_element
|
|
154 |
|
|
155 |
The start_element handler is called whenever a parser sees an opening
|
|
156 |
tag. It is passed an element structure consisting of:
|
|
157 |
|
|
158 |
=over 4
|
|
159 |
|
|
160 |
=item LocalName
|
|
161 |
|
|
162 |
The name of the element minus any namespace prefix it may
|
|
163 |
have come with in the document.
|
|
164 |
|
|
165 |
=item NamespaceURI
|
|
166 |
|
|
167 |
The URI of the namespace associated with this element,
|
|
168 |
or the empty string for none.
|
|
169 |
|
|
170 |
=item Attributes
|
|
171 |
|
|
172 |
A set of attributes as described below.
|
|
173 |
|
|
174 |
=item Name
|
|
175 |
|
|
176 |
The name of the element as it was seen in the document (i.e.
|
|
177 |
including any prefix associated with it)
|
|
178 |
|
|
179 |
=item Prefix
|
|
180 |
|
|
181 |
The prefix used to qualify this element's namespace, or the
|
|
182 |
empty string if none.
|
|
183 |
|
|
184 |
=back
|
|
185 |
|
|
186 |
The B<Attributes> are a hash reference, keyed by what we have called
|
|
187 |
"James Clark" notation. This means that the attribute name has been
|
|
188 |
expanded to include any associated namespace URI, and put together as
|
|
189 |
{ns}name, where "ns" is the expanded namespace URI of the attribute if
|
|
190 |
and only if the attribute had a prefix, and "name" is the LocalName of
|
|
191 |
the attribute.
|
|
192 |
|
|
193 |
The value of each entry in the attributes hash is another hash
|
|
194 |
structure consisting of:
|
|
195 |
|
|
196 |
=over 4
|
|
197 |
|
|
198 |
=item LocalName
|
|
199 |
|
|
200 |
The name of the attribute minus any namespace prefix it may have
|
|
201 |
come with in the document.
|
|
202 |
|
|
203 |
=item NamespaceURI
|
|
204 |
|
|
205 |
The URI of the namespace associated with this attribute. If the
|
|
206 |
attribute had no prefix, then this consists of just the empty string.
|
|
207 |
|
|
208 |
=item Name
|
|
209 |
|
|
210 |
The attribute's name as it appeared in the document, including any
|
|
211 |
namespace prefix.
|
|
212 |
|
|
213 |
=item Prefix
|
|
214 |
|
|
215 |
The prefix used to qualify this attribute's namepace, or the
|
|
216 |
empty string if none.
|
|
217 |
|
|
218 |
=item Value
|
|
219 |
|
|
220 |
The value of the attribute.
|
|
221 |
|
|
222 |
=back
|
|
223 |
|
|
224 |
So a full example, as output by Data::Dumper might be:
|
|
225 |
|
|
226 |
....
|
|
227 |
|
|
228 |
=head2 end_element
|
|
229 |
|
|
230 |
The end_element handler is called either when a parser sees a closing
|
|
231 |
tag, or after start_element has been called for an empty element (do
|
|
232 |
note however that a parser may if it is so inclined call characters
|
|
233 |
with an empty string when it sees an empty element. There is no simple
|
|
234 |
way in SAX to determine if the parser in fact saw an empty element, a
|
|
235 |
start and end element with no content..
|
|
236 |
|
|
237 |
The end_element handler receives exactly the same structure as
|
|
238 |
start_element, minus the Attributes entry. One must note though that it
|
|
239 |
should not be a reference to the same data as start_element receives,
|
|
240 |
so you may change the values in start_element but this will not affect
|
|
241 |
the values later seen by end_element.
|
|
242 |
|
|
243 |
=head2 characters
|
|
244 |
|
|
245 |
The characters callback may be called in serveral circumstances. The
|
|
246 |
most obvious one is when seeing ordinary character data in the markup.
|
|
247 |
But it is also called for text in a CDATA section, and is also called
|
|
248 |
in other situations. A SAX parser has to make no guarantees whatsoever
|
|
249 |
about how many times it may call characters for a stretch of text in an
|
|
250 |
XML document - it may call once, or it may call once for every
|
|
251 |
character in the text. In order to work around this it is often
|
|
252 |
important for the SAX developer to use a bundling technique, where text
|
|
253 |
is gathered up and processed in one of the other callbacks. This is not
|
|
254 |
always necessary, but it is a worthwhile technique to learn, which we
|
|
255 |
will cover in XML::SAX::Advanced (when I get around to writing it).
|
|
256 |
|
|
257 |
The characters handler is called with a very simple structure - a hash
|
|
258 |
reference consisting of just one entry:
|
|
259 |
|
|
260 |
=over 4
|
|
261 |
|
|
262 |
=item Data
|
|
263 |
|
|
264 |
The text data that was received.
|
|
265 |
|
|
266 |
=back
|
|
267 |
|
|
268 |
=head2 comment
|
|
269 |
|
|
270 |
The comment callback is called for comment text. Unlike with
|
|
271 |
C<characters()>, the comment callback *must* be invoked just once for an
|
|
272 |
entire comment string. It receives a single simple structure - a hash
|
|
273 |
reference containing just one entry:
|
|
274 |
|
|
275 |
=over 4
|
|
276 |
|
|
277 |
=item Data
|
|
278 |
|
|
279 |
The text of the comment.
|
|
280 |
|
|
281 |
=back
|
|
282 |
|
|
283 |
=head2 processing_instruction
|
|
284 |
|
|
285 |
The processing instruction handler is called for all processing
|
|
286 |
instructions in the document. Note that these processing instructions
|
|
287 |
may appear before the document root element, or after it, or anywhere
|
|
288 |
where text and elements would normally appear within the document,
|
|
289 |
according to the XML specification.
|
|
290 |
|
|
291 |
The handler is passed a structure containing just two entries:
|
|
292 |
|
|
293 |
=over 4
|
|
294 |
|
|
295 |
=item Target
|
|
296 |
|
|
297 |
The target of the processing instrcution
|
|
298 |
|
|
299 |
=item Data
|
|
300 |
|
|
301 |
The text data in the processing instruction. Can be an empty
|
|
302 |
string for a processing instruction that has no data element.
|
|
303 |
For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
|
|
304 |
|
|
305 |
=back
|
|
306 |
|
|
307 |
=head1 Tip of the iceberg
|
|
308 |
|
|
309 |
What we have discussed above is really the tip of the SAX iceberg. And
|
|
310 |
so far it looks like there's not much of interest to SAX beyond what we
|
|
311 |
have seen with XML::Parser. But it does go much further than that, I
|
|
312 |
promise.
|
|
313 |
|
|
314 |
People who hate Object Oriented code for the sake of it may be thinking
|
|
315 |
here that creating a new package just to parse something is a waste
|
|
316 |
when they've been parsing things just fine up to now using procedural
|
|
317 |
code. But there's reason to all this madness. And that reason is SAX
|
|
318 |
Filters.
|
|
319 |
|
|
320 |
As you saw right at the very start, to let the parser know about our
|
|
321 |
class, we pass it an instance of our class as the Handler to the
|
|
322 |
parser. But now imagine what would happen if our class could also take
|
|
323 |
a Handler option, and simply do some processing and pass on our data
|
|
324 |
further down the line? That in a nutshell is how SAX filters work. It's
|
|
325 |
Unix pipes for the 21st century!
|
|
326 |
|
|
327 |
There are two downsides to this. Number 1 - writing SAX filters can be
|
|
328 |
tricky. If you look into the future and read the advanced tutorial I'm
|
|
329 |
writing, you'll see that Handler can come in several shapes and sizes.
|
|
330 |
So making sure your filter does the right thing can be tricky.
|
|
331 |
Secondly, constructing complex filter chains can be difficult, and
|
|
332 |
simple thinking tells us that we only get one pass at our document,
|
|
333 |
when often we'll need more than that.
|
|
334 |
|
|
335 |
Luckily though, those downsides have been fixed by the release of two
|
|
336 |
very cool modules. What's even better is that I didn't write either of
|
|
337 |
them!
|
|
338 |
|
|
339 |
The first module is XML::SAX::Base. This is a VITAL SAX module that
|
|
340 |
acts as a base class for all SAX parsers and filters. It provides an
|
|
341 |
abstraction away from calling the handler methods, that makes sure your
|
|
342 |
filter or parser does the right thing, and it does it FAST. So, if you
|
|
343 |
ever need to write a SAX filter, which if you're processing XML -> XML,
|
|
344 |
or XML -> HTML, then you probably do, then you need to be writing it as
|
|
345 |
a subclass of XML::SAX::Base. Really - this is advice not to ignore
|
|
346 |
lightly. I will not go into the details of writing a SAX filter here.
|
|
347 |
Kip Hampton, the author of XML::SAX::Base has covered this nicely in
|
|
348 |
his article on XML.com here <URI>.
|
|
349 |
|
|
350 |
To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
|
|
351 |
who's modules you will probably have heard of or used, wrote a very
|
|
352 |
clever module called XML::SAX::Machines. This combines some really
|
|
353 |
clever SAX filter-type modules, with a construction toolkit for filters
|
|
354 |
that makes building pipelines easy. But before we see how it makes
|
|
355 |
things easy, first lets see how tricky it looks to build complex SAX
|
|
356 |
filter pipelines.
|
|
357 |
|
|
358 |
use XML::SAX::ParserFactory;
|
|
359 |
use XML::Filter::Filter1;
|
|
360 |
use XML::Filter::Filter2;
|
|
361 |
use XML::SAX::Writer;
|
|
362 |
|
|
363 |
my $output_string;
|
|
364 |
my $writer = XML::SAX::Writer->new(Output => \$output_string);
|
|
365 |
my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
|
|
366 |
my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
|
|
367 |
my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
|
|
368 |
|
|
369 |
$parser->parse_uri("foo.xml");
|
|
370 |
|
|
371 |
This is a lot easier with XML::SAX::Machines:
|
|
372 |
|
|
373 |
use XML::SAX::Machines qw(Pipeline);
|
|
374 |
|
|
375 |
my $output_string;
|
|
376 |
my $parser = Pipeline(
|
|
377 |
XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
|
|
378 |
);
|
|
379 |
|
|
380 |
$parser->parse_uri("foo.xml");
|
|
381 |
|
|
382 |
One of the main benefits of XML::SAX::Machines is that the pipelines
|
|
383 |
are constructed in natural order, rather than the reverse order we saw
|
|
384 |
with manual pipeline construction. XML::SAX::Machines takes care of all
|
|
385 |
the internals of pipe construction, providing you at the end with just
|
|
386 |
a parser you can use (and you can re-use the same parser as many times
|
|
387 |
as you need to).
|
|
388 |
|
|
389 |
Just a final tip. If you ever get stuck and are confused about what is
|
|
390 |
being passed from one SAX filter or parser to the next, then
|
|
391 |
Devel::TraceSAX will come to your rescue. This perl debugger plugin
|
|
392 |
will allow you to dump the SAX stream of events as it goes by. Usage is
|
|
393 |
really very simple just call your perl script that uses SAX as follows:
|
|
394 |
|
|
395 |
$ perl -d:TraceSAX <scriptname>
|
|
396 |
|
|
397 |
And preferably pipe the output to a pager of some sort, such as more or
|
|
398 |
less. The output is extremely verbose, but should help clear some
|
|
399 |
issues up.
|
|
400 |
|
|
401 |
=head1 AUTHOR
|
|
402 |
|
|
403 |
Matt Sergeant, matt@sergeant.org
|
|
404 |
|
|
405 |
$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
|
|
406 |
|
|
407 |
=cut
|