|
1 =head1 NAME |
|
2 |
|
3 XML::SAX::Intro - An Introduction to SAX Parsing with Perl |
|
4 |
|
5 =head1 Introduction |
|
6 |
|
7 XML::SAX is a new way to work with XML Parsers in Perl. In this article |
|
8 we'll discuss why you should be using SAX, why you should be using |
|
9 XML::SAX, and we'll see some of the finer implementation details. The |
|
10 text below assumes some familiarity with callback, or push based |
|
11 parsing, but if you are unfamiliar with these techniques then a good |
|
12 place to start is Kip Hampton's excellent series of articles on XML.com. |
|
13 |
|
14 =head1 Replacing XML::Parser |
|
15 |
|
16 The de-facto way of parsing XML under perl is to use Larry Wall and |
|
17 Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around |
|
18 the expat XML parser library by James Clark. It has been a hugely |
|
19 successful project, but suffers from a couple of rather major flaws. |
|
20 Firstly it is a proprietary API, designed before the SAX API was |
|
21 conceived, which means that it is not easily replaceable by other |
|
22 streaming parsers. Secondly it's callbacks are subrefs. This doesn't |
|
23 sound like much of an issue, but unfortunately leads to code like: |
|
24 |
|
25 sub handle_start { |
|
26 my ($e, $el, %attrs) = @_; |
|
27 if ($el eq 'foo') { |
|
28 $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object. |
|
29 } |
|
30 } |
|
31 |
|
32 As you can see, we're using the $e object to hold our state |
|
33 information, which is a bad idea because we don't own that object - we |
|
34 didn't create it. It's an internal object of XML::Parser, that happens |
|
35 to be a hashref. We could all too easily overwrite XML::Parser internal |
|
36 state variables by using this, or Clark could change it to an array ref |
|
37 (not that he would, because it would break so much code, but he could). |
|
38 |
|
39 The only way currently with XML::Parser to safely maintain state is to |
|
40 use a closure: |
|
41 |
|
42 my $state = MyState->new(); |
|
43 $parser->setHandlers(Start => sub { handle_start($state, @_) }); |
|
44 |
|
45 This closure traps the $state variable, which now gets passed as the |
|
46 first parameter to your callback. Unfortunately very few people use |
|
47 this technique, as it is not documented in the XML::Parser POD files. |
|
48 |
|
49 Another reason you might not want to use XML::Parser is because you |
|
50 need some feature that it doesn't provide (such as validation), or you |
|
51 might need to use a library that doesn't use expat, due to it not being |
|
52 installed on your system, or due to having a restrictive ISP. Using SAX |
|
53 allows you to work around these restrictions. |
|
54 |
|
55 =head1 Introducing SAX |
|
56 |
|
57 SAX stands for the Simple API for XML. And simple it really is. |
|
58 Constructing a SAX parser and passing events to handlers is done as |
|
59 simply as: |
|
60 |
|
61 use XML::SAX; |
|
62 use MySAXHandler; |
|
63 |
|
64 my $parser = XML::SAX::ParserFactory->parser( |
|
65 Handler => MySAXHandler->new |
|
66 ); |
|
67 |
|
68 $parser->parse_uri("foo.xml"); |
|
69 |
|
70 The important concept to grasp here is that SAX uses a factory class |
|
71 called XML::SAX::ParserFactory to create a new parser instance. The |
|
72 reason for this is so that you can support other underlying |
|
73 parser implementations for different feature sets. This is one thing |
|
74 that XML::Parser has always sorely lacked. |
|
75 |
|
76 In the code above we see the parse_uri method used, but we could |
|
77 have equally well |
|
78 called parse_file, parse_string, or parse(). Please see XML::SAX::Base |
|
79 for what these methods take as parameters, but don't be fooled into |
|
80 believing parse_file takes a filename. No, it takes a file handle, a |
|
81 glob, or a subclass of IO::Handle. Beware. |
|
82 |
|
83 SAX works very similarly to XML::Parser's default callback method, |
|
84 except it has one major difference: rather than setting individual |
|
85 callbacks, you create a new class in which to recieve the callbacks. |
|
86 Each callback is called as a method call on an instance of that handler |
|
87 class. An example will best demonstrate this: |
|
88 |
|
89 package MySAXHandler; |
|
90 use base qw(XML::SAX::Base); |
|
91 |
|
92 sub start_document { |
|
93 my ($self, $doc) = @_; |
|
94 # process document start event |
|
95 } |
|
96 |
|
97 sub start_element { |
|
98 my ($self, $el) = @_; |
|
99 # process element start event |
|
100 } |
|
101 |
|
102 Now, when we instantiate this as above, and parse some XML with this as |
|
103 the handler, the methods start_document and start_element will be |
|
104 called as method calls, so this would be the equivalent of directly |
|
105 calling: |
|
106 |
|
107 $object->start_element($el); |
|
108 |
|
109 Notice how this is different to XML::Parser's calling style, which |
|
110 calls: |
|
111 |
|
112 start_element($e, $name, %attribs); |
|
113 |
|
114 It's the difference between function calling and method calling which |
|
115 allows you to subclass SAX handlers which contributes to SAX being a |
|
116 powerful solution. |
|
117 |
|
118 As you can see, unlike XML::Parser, we have to define a new package in |
|
119 which to do our processing (there are hacks you can do to make this |
|
120 uneccessary, but I'll leave figuring those out to the experts). The |
|
121 biggest benefit of this is that you maintain your own state variable |
|
122 ($self in the above example) thus freeing you of the concerns listed |
|
123 above. It is also an improvement in maintainability - you can place the |
|
124 code in a separate file if you wish to, and your callback methods are |
|
125 always called the same thing, rather than having to choose a suitable |
|
126 name for them as you had to with XML::Parser. This is an obvious win. |
|
127 |
|
128 SAX parsers are also very flexible in how you pass a handler to them. |
|
129 You can use a constructor parameter as we saw above, or we can pass the |
|
130 handler directly in the call to one of the parse methods: |
|
131 |
|
132 $parser->parse(Handler => $handler, |
|
133 Source => { SystemId => "foo.xml" }); |
|
134 # or... |
|
135 $parser->parse_file($fh, Handler => $handler); |
|
136 |
|
137 This flexibility allows for one parser to be used in many different |
|
138 scenarios throughout your script (though one shouldn't feel pressure to |
|
139 use this method, as parser construction is generally not a time |
|
140 consuming process). |
|
141 |
|
142 =head1 Callback Parameters |
|
143 |
|
144 The only other thing you need to know to understand basic SAX is the |
|
145 structure of the parameters passed to each of the callbacks. In |
|
146 XML::Parser, all parameters are passed as multiple options to the |
|
147 callbacks, so for example the Start callback would be called as |
|
148 my_start($e, $name, %attributes), and the PI callback would be called |
|
149 as my_processing_instruction($e, $target, $data). In SAX, every |
|
150 callback is passed a hash reference, containing entries that define our |
|
151 "node". The key callbacks and the structures they receive are: |
|
152 |
|
153 =head2 start_element |
|
154 |
|
155 The start_element handler is called whenever a parser sees an opening |
|
156 tag. It is passed an element structure consisting of: |
|
157 |
|
158 =over 4 |
|
159 |
|
160 =item LocalName |
|
161 |
|
162 The name of the element minus any namespace prefix it may |
|
163 have come with in the document. |
|
164 |
|
165 =item NamespaceURI |
|
166 |
|
167 The URI of the namespace associated with this element, |
|
168 or the empty string for none. |
|
169 |
|
170 =item Attributes |
|
171 |
|
172 A set of attributes as described below. |
|
173 |
|
174 =item Name |
|
175 |
|
176 The name of the element as it was seen in the document (i.e. |
|
177 including any prefix associated with it) |
|
178 |
|
179 =item Prefix |
|
180 |
|
181 The prefix used to qualify this element's namespace, or the |
|
182 empty string if none. |
|
183 |
|
184 =back |
|
185 |
|
186 The B<Attributes> are a hash reference, keyed by what we have called |
|
187 "James Clark" notation. This means that the attribute name has been |
|
188 expanded to include any associated namespace URI, and put together as |
|
189 {ns}name, where "ns" is the expanded namespace URI of the attribute if |
|
190 and only if the attribute had a prefix, and "name" is the LocalName of |
|
191 the attribute. |
|
192 |
|
193 The value of each entry in the attributes hash is another hash |
|
194 structure consisting of: |
|
195 |
|
196 =over 4 |
|
197 |
|
198 =item LocalName |
|
199 |
|
200 The name of the attribute minus any namespace prefix it may have |
|
201 come with in the document. |
|
202 |
|
203 =item NamespaceURI |
|
204 |
|
205 The URI of the namespace associated with this attribute. If the |
|
206 attribute had no prefix, then this consists of just the empty string. |
|
207 |
|
208 =item Name |
|
209 |
|
210 The attribute's name as it appeared in the document, including any |
|
211 namespace prefix. |
|
212 |
|
213 =item Prefix |
|
214 |
|
215 The prefix used to qualify this attribute's namepace, or the |
|
216 empty string if none. |
|
217 |
|
218 =item Value |
|
219 |
|
220 The value of the attribute. |
|
221 |
|
222 =back |
|
223 |
|
224 So a full example, as output by Data::Dumper might be: |
|
225 |
|
226 .... |
|
227 |
|
228 =head2 end_element |
|
229 |
|
230 The end_element handler is called either when a parser sees a closing |
|
231 tag, or after start_element has been called for an empty element (do |
|
232 note however that a parser may if it is so inclined call characters |
|
233 with an empty string when it sees an empty element. There is no simple |
|
234 way in SAX to determine if the parser in fact saw an empty element, a |
|
235 start and end element with no content.. |
|
236 |
|
237 The end_element handler receives exactly the same structure as |
|
238 start_element, minus the Attributes entry. One must note though that it |
|
239 should not be a reference to the same data as start_element receives, |
|
240 so you may change the values in start_element but this will not affect |
|
241 the values later seen by end_element. |
|
242 |
|
243 =head2 characters |
|
244 |
|
245 The characters callback may be called in serveral circumstances. The |
|
246 most obvious one is when seeing ordinary character data in the markup. |
|
247 But it is also called for text in a CDATA section, and is also called |
|
248 in other situations. A SAX parser has to make no guarantees whatsoever |
|
249 about how many times it may call characters for a stretch of text in an |
|
250 XML document - it may call once, or it may call once for every |
|
251 character in the text. In order to work around this it is often |
|
252 important for the SAX developer to use a bundling technique, where text |
|
253 is gathered up and processed in one of the other callbacks. This is not |
|
254 always necessary, but it is a worthwhile technique to learn, which we |
|
255 will cover in XML::SAX::Advanced (when I get around to writing it). |
|
256 |
|
257 The characters handler is called with a very simple structure - a hash |
|
258 reference consisting of just one entry: |
|
259 |
|
260 =over 4 |
|
261 |
|
262 =item Data |
|
263 |
|
264 The text data that was received. |
|
265 |
|
266 =back |
|
267 |
|
268 =head2 comment |
|
269 |
|
270 The comment callback is called for comment text. Unlike with |
|
271 C<characters()>, the comment callback *must* be invoked just once for an |
|
272 entire comment string. It receives a single simple structure - a hash |
|
273 reference containing just one entry: |
|
274 |
|
275 =over 4 |
|
276 |
|
277 =item Data |
|
278 |
|
279 The text of the comment. |
|
280 |
|
281 =back |
|
282 |
|
283 =head2 processing_instruction |
|
284 |
|
285 The processing instruction handler is called for all processing |
|
286 instructions in the document. Note that these processing instructions |
|
287 may appear before the document root element, or after it, or anywhere |
|
288 where text and elements would normally appear within the document, |
|
289 according to the XML specification. |
|
290 |
|
291 The handler is passed a structure containing just two entries: |
|
292 |
|
293 =over 4 |
|
294 |
|
295 =item Target |
|
296 |
|
297 The target of the processing instrcution |
|
298 |
|
299 =item Data |
|
300 |
|
301 The text data in the processing instruction. Can be an empty |
|
302 string for a processing instruction that has no data element. |
|
303 For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction. |
|
304 |
|
305 =back |
|
306 |
|
307 =head1 Tip of the iceberg |
|
308 |
|
309 What we have discussed above is really the tip of the SAX iceberg. And |
|
310 so far it looks like there's not much of interest to SAX beyond what we |
|
311 have seen with XML::Parser. But it does go much further than that, I |
|
312 promise. |
|
313 |
|
314 People who hate Object Oriented code for the sake of it may be thinking |
|
315 here that creating a new package just to parse something is a waste |
|
316 when they've been parsing things just fine up to now using procedural |
|
317 code. But there's reason to all this madness. And that reason is SAX |
|
318 Filters. |
|
319 |
|
320 As you saw right at the very start, to let the parser know about our |
|
321 class, we pass it an instance of our class as the Handler to the |
|
322 parser. But now imagine what would happen if our class could also take |
|
323 a Handler option, and simply do some processing and pass on our data |
|
324 further down the line? That in a nutshell is how SAX filters work. It's |
|
325 Unix pipes for the 21st century! |
|
326 |
|
327 There are two downsides to this. Number 1 - writing SAX filters can be |
|
328 tricky. If you look into the future and read the advanced tutorial I'm |
|
329 writing, you'll see that Handler can come in several shapes and sizes. |
|
330 So making sure your filter does the right thing can be tricky. |
|
331 Secondly, constructing complex filter chains can be difficult, and |
|
332 simple thinking tells us that we only get one pass at our document, |
|
333 when often we'll need more than that. |
|
334 |
|
335 Luckily though, those downsides have been fixed by the release of two |
|
336 very cool modules. What's even better is that I didn't write either of |
|
337 them! |
|
338 |
|
339 The first module is XML::SAX::Base. This is a VITAL SAX module that |
|
340 acts as a base class for all SAX parsers and filters. It provides an |
|
341 abstraction away from calling the handler methods, that makes sure your |
|
342 filter or parser does the right thing, and it does it FAST. So, if you |
|
343 ever need to write a SAX filter, which if you're processing XML -> XML, |
|
344 or XML -> HTML, then you probably do, then you need to be writing it as |
|
345 a subclass of XML::SAX::Base. Really - this is advice not to ignore |
|
346 lightly. I will not go into the details of writing a SAX filter here. |
|
347 Kip Hampton, the author of XML::SAX::Base has covered this nicely in |
|
348 his article on XML.com here <URI>. |
|
349 |
|
350 To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker |
|
351 who's modules you will probably have heard of or used, wrote a very |
|
352 clever module called XML::SAX::Machines. This combines some really |
|
353 clever SAX filter-type modules, with a construction toolkit for filters |
|
354 that makes building pipelines easy. But before we see how it makes |
|
355 things easy, first lets see how tricky it looks to build complex SAX |
|
356 filter pipelines. |
|
357 |
|
358 use XML::SAX::ParserFactory; |
|
359 use XML::Filter::Filter1; |
|
360 use XML::Filter::Filter2; |
|
361 use XML::SAX::Writer; |
|
362 |
|
363 my $output_string; |
|
364 my $writer = XML::SAX::Writer->new(Output => \$output_string); |
|
365 my $filter2 = XML::SAX::Filter2->new(Handler => $writer); |
|
366 my $filter1 = XML::SAX::Filter1->new(Handler => $filter2); |
|
367 my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1); |
|
368 |
|
369 $parser->parse_uri("foo.xml"); |
|
370 |
|
371 This is a lot easier with XML::SAX::Machines: |
|
372 |
|
373 use XML::SAX::Machines qw(Pipeline); |
|
374 |
|
375 my $output_string; |
|
376 my $parser = Pipeline( |
|
377 XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string |
|
378 ); |
|
379 |
|
380 $parser->parse_uri("foo.xml"); |
|
381 |
|
382 One of the main benefits of XML::SAX::Machines is that the pipelines |
|
383 are constructed in natural order, rather than the reverse order we saw |
|
384 with manual pipeline construction. XML::SAX::Machines takes care of all |
|
385 the internals of pipe construction, providing you at the end with just |
|
386 a parser you can use (and you can re-use the same parser as many times |
|
387 as you need to). |
|
388 |
|
389 Just a final tip. If you ever get stuck and are confused about what is |
|
390 being passed from one SAX filter or parser to the next, then |
|
391 Devel::TraceSAX will come to your rescue. This perl debugger plugin |
|
392 will allow you to dump the SAX stream of events as it goes by. Usage is |
|
393 really very simple just call your perl script that uses SAX as follows: |
|
394 |
|
395 $ perl -d:TraceSAX <scriptname> |
|
396 |
|
397 And preferably pipe the output to a pager of some sort, such as more or |
|
398 less. The output is extremely verbose, but should help clear some |
|
399 issues up. |
|
400 |
|
401 =head1 AUTHOR |
|
402 |
|
403 Matt Sergeant, matt@sergeant.org |
|
404 |
|
405 $Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $ |
|
406 |
|
407 =cut |