uh_parser/XML/SAX/Intro.pod
changeset 177 6d3c3db11e72
equal deleted inserted replaced
176:266a7e9b9237 177:6d3c3db11e72
       
     1 =head1 NAME
       
     2 
       
     3 XML::SAX::Intro - An Introduction to SAX Parsing with Perl
       
     4 
       
     5 =head1 Introduction
       
     6 
       
     7 XML::SAX is a new way to work with XML Parsers in Perl. In this article
       
     8 we'll discuss why you should be using SAX, why you should be using
       
     9 XML::SAX, and we'll see some of the finer implementation details. The
       
    10 text below assumes some familiarity with callback, or push based
       
    11 parsing, but if you are unfamiliar with these techniques then a good
       
    12 place to start is Kip Hampton's excellent series of articles on XML.com.
       
    13 
       
    14 =head1 Replacing XML::Parser
       
    15 
       
    16 The de-facto way of parsing XML under perl is to use Larry Wall and
       
    17 Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
       
    18 the expat XML parser library by James Clark. It has been a hugely
       
    19 successful project, but suffers from a couple of rather major flaws.
       
    20 Firstly it is a proprietary API, designed before the SAX API was
       
    21 conceived, which means that it is not easily replaceable by other
       
    22 streaming parsers. Secondly it's callbacks are subrefs. This doesn't
       
    23 sound like much of an issue, but unfortunately leads to code like:
       
    24 
       
    25   sub handle_start {
       
    26     my ($e, $el, %attrs) = @_;
       
    27     if ($el eq 'foo') {
       
    28       $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
       
    29     }
       
    30   }
       
    31 
       
    32 As you can see, we're using the $e object to hold our state
       
    33 information, which is a bad idea because we don't own that object - we
       
    34 didn't create it. It's an internal object of XML::Parser, that happens
       
    35 to be a hashref. We could all too easily overwrite XML::Parser internal
       
    36 state variables by using this, or Clark could change it to an array ref
       
    37 (not that he would, because it would break so much code, but he could).
       
    38 
       
    39 The only way currently with XML::Parser to safely maintain state is to
       
    40 use a closure:
       
    41 
       
    42   my $state = MyState->new();
       
    43   $parser->setHandlers(Start => sub { handle_start($state, @_) });
       
    44 
       
    45 This closure traps the $state variable, which now gets passed as the
       
    46 first parameter to your callback. Unfortunately very few people use
       
    47 this technique, as it is not documented in the XML::Parser POD files.
       
    48 
       
    49 Another reason you might not want to use XML::Parser is because you
       
    50 need some feature that it doesn't provide (such as validation), or you
       
    51 might need to use a library that doesn't use expat, due to it not being
       
    52 installed on your system, or due to having a restrictive ISP. Using SAX
       
    53 allows you to work around these restrictions.
       
    54 
       
    55 =head1 Introducing SAX
       
    56 
       
    57 SAX stands for the Simple API for XML. And simple it really is.
       
    58 Constructing a SAX parser and passing events to handlers is done as
       
    59 simply as:
       
    60 
       
    61   use XML::SAX;
       
    62   use MySAXHandler;
       
    63   
       
    64   my $parser = XML::SAX::ParserFactory->parser(
       
    65   	Handler => MySAXHandler->new
       
    66   );
       
    67   
       
    68   $parser->parse_uri("foo.xml");
       
    69 
       
    70 The important concept to grasp here is that SAX uses a factory class
       
    71 called XML::SAX::ParserFactory to create a new parser instance. The
       
    72 reason for this is so that you can support other underlying
       
    73 parser implementations for different feature sets. This is one thing
       
    74 that XML::Parser has always sorely lacked.
       
    75 
       
    76 In the code above we see the parse_uri method used, but we could
       
    77 have equally well
       
    78 called parse_file, parse_string, or parse(). Please see XML::SAX::Base
       
    79 for what these methods take as parameters, but don't be fooled into
       
    80 believing parse_file takes a filename. No, it takes a file handle, a
       
    81 glob, or a subclass of IO::Handle. Beware.
       
    82 
       
    83 SAX works very similarly to XML::Parser's default callback method,
       
    84 except it has one major difference: rather than setting individual
       
    85 callbacks, you create a new class in which to recieve the callbacks.
       
    86 Each callback is called as a method call on an instance of that handler
       
    87 class. An example will best demonstrate this:
       
    88 
       
    89   package MySAXHandler;
       
    90   use base qw(XML::SAX::Base);
       
    91   
       
    92   sub start_document {
       
    93     my ($self, $doc) = @_;
       
    94     # process document start event
       
    95   }
       
    96   
       
    97   sub start_element {
       
    98     my ($self, $el) = @_;
       
    99     # process element start event
       
   100   }
       
   101 
       
   102 Now, when we instantiate this as above, and parse some XML with this as
       
   103 the handler, the methods start_document and start_element will be
       
   104 called as method calls, so this would be the equivalent of directly
       
   105 calling:
       
   106 
       
   107   $object->start_element($el);
       
   108 
       
   109 Notice how this is different to XML::Parser's calling style, which
       
   110 calls:
       
   111 
       
   112   start_element($e, $name, %attribs);
       
   113 
       
   114 It's the difference between function calling and method calling which
       
   115 allows you to subclass SAX handlers which contributes to SAX being a
       
   116 powerful solution.
       
   117 
       
   118 As you can see, unlike XML::Parser, we have to define a new package in
       
   119 which to do our processing (there are hacks you can do to make this
       
   120 uneccessary, but I'll leave figuring those out to the experts). The
       
   121 biggest benefit of this is that you maintain your own state variable
       
   122 ($self in the above example) thus freeing you of the concerns listed
       
   123 above. It is also an improvement in maintainability - you can place the
       
   124 code in a separate file if you wish to, and your callback methods are
       
   125 always called the same thing, rather than having to choose a suitable
       
   126 name for them as you had to with XML::Parser. This is an obvious win.
       
   127 
       
   128 SAX parsers are also very flexible in how you pass a handler to them.
       
   129 You can use a constructor parameter as we saw above, or we can pass the
       
   130 handler directly in the call to one of the parse methods:
       
   131 
       
   132   $parser->parse(Handler => $handler, 
       
   133                  Source => { SystemId => "foo.xml" });
       
   134   # or...
       
   135   $parser->parse_file($fh, Handler => $handler);
       
   136 
       
   137 This flexibility allows for one parser to be used in many different
       
   138 scenarios throughout your script (though one shouldn't feel pressure to
       
   139 use this method, as parser construction is generally not a time
       
   140 consuming process).
       
   141 
       
   142 =head1 Callback Parameters
       
   143 
       
   144 The only other thing you need to know to understand basic SAX is the
       
   145 structure of the parameters passed to each of the callbacks. In
       
   146 XML::Parser, all parameters are passed as multiple options to the
       
   147 callbacks, so for example the Start callback would be called as
       
   148 my_start($e, $name, %attributes), and the PI callback would be called
       
   149 as my_processing_instruction($e, $target, $data). In SAX, every
       
   150 callback is passed a hash reference, containing entries that define our
       
   151 "node". The key callbacks and the structures they receive are:
       
   152 
       
   153 =head2 start_element
       
   154 
       
   155 The start_element handler is called whenever a parser sees an opening
       
   156 tag. It is passed an element structure consisting of:
       
   157 
       
   158 =over 4
       
   159 
       
   160 =item LocalName
       
   161 
       
   162 The name of the element minus any namespace prefix it may
       
   163 have come with in the document.
       
   164 
       
   165 =item NamespaceURI
       
   166 
       
   167 The URI of the namespace associated with this element,
       
   168 or the empty string for none.
       
   169 
       
   170 =item Attributes
       
   171 
       
   172 A set of attributes as described below.
       
   173 
       
   174 =item Name
       
   175 
       
   176 The name of the element as it was seen in the document (i.e.
       
   177 including any prefix associated with it)
       
   178 
       
   179 =item Prefix
       
   180 
       
   181 The prefix used to qualify this element's namespace, or the 
       
   182 empty string if none.
       
   183 
       
   184 =back
       
   185 
       
   186 The B<Attributes> are a hash reference, keyed by what we have called
       
   187 "James Clark" notation. This means that the attribute name has been
       
   188 expanded to include any associated namespace URI, and put together as
       
   189 {ns}name, where "ns" is the expanded namespace URI of the attribute if
       
   190 and only if the attribute had a prefix, and "name" is the LocalName of
       
   191 the attribute.
       
   192 
       
   193 The value of each entry in the attributes hash is another hash
       
   194 structure consisting of:
       
   195 
       
   196 =over 4
       
   197 
       
   198 =item LocalName
       
   199 
       
   200 The name of the attribute minus any namespace prefix it may have
       
   201 come with in the document.
       
   202 
       
   203 =item NamespaceURI
       
   204 
       
   205 The URI of the namespace associated with this attribute. If the 
       
   206 attribute had no prefix, then this consists of just the empty string.
       
   207 
       
   208 =item Name
       
   209 
       
   210 The attribute's name as it appeared in the document, including any 
       
   211 namespace prefix.
       
   212 
       
   213 =item Prefix
       
   214 
       
   215 The prefix used to qualify this attribute's namepace, or the 
       
   216 empty string if none.
       
   217 
       
   218 =item Value
       
   219 
       
   220 The value of the attribute.
       
   221 
       
   222 =back
       
   223 
       
   224 So a full example, as output by Data::Dumper might be:
       
   225 
       
   226   ....
       
   227 
       
   228 =head2 end_element
       
   229 
       
   230 The end_element handler is called either when a parser sees a closing
       
   231 tag, or after start_element has been called for an empty element (do
       
   232 note however that a parser may if it is so inclined call characters
       
   233 with an empty string when it sees an empty element. There is no simple
       
   234 way in SAX to determine if the parser in fact saw an empty element, a
       
   235 start and end element with no content..
       
   236 
       
   237 The end_element handler receives exactly the same structure as
       
   238 start_element, minus the Attributes entry. One must note though that it
       
   239 should not be a reference to the same data as start_element receives,
       
   240 so you may change the values in start_element but this will not affect
       
   241 the values later seen by end_element.
       
   242 
       
   243 =head2 characters
       
   244 
       
   245 The characters callback may be called in serveral circumstances. The
       
   246 most obvious one is when seeing ordinary character data in the markup.
       
   247 But it is also called for text in a CDATA section, and is also called
       
   248 in other situations. A SAX parser has to make no guarantees whatsoever
       
   249 about how many times it may call characters for a stretch of text in an
       
   250 XML document - it may call once, or it may call once for every
       
   251 character in the text. In order to work around this it is often
       
   252 important for the SAX developer to use a bundling technique, where text
       
   253 is gathered up and processed in one of the other callbacks. This is not
       
   254 always necessary, but it is a worthwhile technique to learn, which we
       
   255 will cover in XML::SAX::Advanced (when I get around to writing it).
       
   256 
       
   257 The characters handler is called with a very simple structure - a hash
       
   258 reference consisting of just one entry:
       
   259 
       
   260 =over 4
       
   261 
       
   262 =item Data
       
   263 
       
   264 The text data that was received.
       
   265 
       
   266 =back
       
   267 
       
   268 =head2 comment
       
   269 
       
   270 The comment callback is called for comment text. Unlike with
       
   271 C<characters()>, the comment callback *must* be invoked just once for an
       
   272 entire comment string. It receives a single simple structure - a hash
       
   273 reference containing just one entry:
       
   274 
       
   275 =over 4
       
   276 
       
   277 =item Data
       
   278 
       
   279 The text of the comment.
       
   280 
       
   281 =back
       
   282 
       
   283 =head2 processing_instruction
       
   284 
       
   285 The processing instruction handler is called for all processing
       
   286 instructions in the document. Note that these processing instructions
       
   287 may appear before the document root element, or after it, or anywhere
       
   288 where text and elements would normally appear within the document,
       
   289 according to the XML specification.
       
   290 
       
   291 The handler is passed a structure containing just two entries:
       
   292 
       
   293 =over 4
       
   294 
       
   295 =item Target
       
   296 
       
   297 The target of the processing instrcution
       
   298 
       
   299 =item Data
       
   300 
       
   301 The text data in the processing instruction. Can be an empty
       
   302 string for a processing instruction that has no data element. 
       
   303 For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
       
   304 
       
   305 =back
       
   306 
       
   307 =head1 Tip of the iceberg
       
   308 
       
   309 What we have discussed above is really the tip of the SAX iceberg. And
       
   310 so far it looks like there's not much of interest to SAX beyond what we
       
   311 have seen with XML::Parser. But it does go much further than that, I
       
   312 promise.
       
   313 
       
   314 People who hate Object Oriented code for the sake of it may be thinking
       
   315 here that creating a new package just to parse something is a waste
       
   316 when they've been parsing things just fine up to now using procedural
       
   317 code. But there's reason to all this madness. And that reason is SAX
       
   318 Filters.
       
   319 
       
   320 As you saw right at the very start, to let the parser know about our
       
   321 class, we pass it an instance of our class as the Handler to the
       
   322 parser. But now imagine what would happen if our class could also take
       
   323 a Handler option, and simply do some processing and pass on our data
       
   324 further down the line? That in a nutshell is how SAX filters work. It's
       
   325 Unix pipes for the 21st century!
       
   326 
       
   327 There are two downsides to this. Number 1 - writing SAX filters can be
       
   328 tricky. If you look into the future and read the advanced tutorial I'm
       
   329 writing, you'll see that Handler can come in several shapes and sizes.
       
   330 So making sure your filter does the right thing can be tricky.
       
   331 Secondly, constructing complex filter chains can be difficult, and
       
   332 simple thinking tells us that we only get one pass at our document,
       
   333 when often we'll need more than that.
       
   334 
       
   335 Luckily though, those downsides have been fixed by the release of two
       
   336 very cool modules. What's even better is that I didn't write either of
       
   337 them!
       
   338 
       
   339 The first module is XML::SAX::Base. This is a VITAL SAX module that
       
   340 acts as a base class for all SAX parsers and filters. It provides an
       
   341 abstraction away from calling the handler methods, that makes sure your
       
   342 filter or parser does the right thing, and it does it FAST. So, if you
       
   343 ever need to write a SAX filter, which if you're processing XML -> XML,
       
   344 or XML -> HTML, then you probably do, then you need to be writing it as
       
   345 a subclass of XML::SAX::Base. Really - this is advice not to ignore
       
   346 lightly. I will not go into the details of writing a SAX filter here.
       
   347 Kip Hampton, the author of XML::SAX::Base has covered this nicely in
       
   348 his article on XML.com here <URI>.
       
   349 
       
   350 To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
       
   351 who's modules you will probably have heard of or used, wrote a very
       
   352 clever module called XML::SAX::Machines. This combines some really
       
   353 clever SAX filter-type modules, with a construction toolkit for filters
       
   354 that makes building pipelines easy. But before we see how it makes
       
   355 things easy, first lets see how tricky it looks to build complex SAX
       
   356 filter pipelines.
       
   357 
       
   358   use XML::SAX::ParserFactory;
       
   359   use XML::Filter::Filter1;
       
   360   use XML::Filter::Filter2;
       
   361   use XML::SAX::Writer;
       
   362   
       
   363   my $output_string;
       
   364   my $writer = XML::SAX::Writer->new(Output => \$output_string);
       
   365   my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
       
   366   my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
       
   367   my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
       
   368   
       
   369   $parser->parse_uri("foo.xml");
       
   370 
       
   371 This is a lot easier with XML::SAX::Machines:
       
   372 
       
   373   use XML::SAX::Machines qw(Pipeline);
       
   374   
       
   375   my $output_string;
       
   376   my $parser = Pipeline(
       
   377   	XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
       
   378   	);
       
   379   
       
   380   $parser->parse_uri("foo.xml");
       
   381 
       
   382 One of the main benefits of XML::SAX::Machines is that the pipelines
       
   383 are constructed in natural order, rather than the reverse order we saw
       
   384 with manual pipeline construction. XML::SAX::Machines takes care of all
       
   385 the internals of pipe construction, providing you at the end with just
       
   386 a parser you can use (and you can re-use the same parser as many times
       
   387 as you need to).
       
   388 
       
   389 Just a final tip. If you ever get stuck and are confused about what is
       
   390 being passed from one SAX filter or parser to the next, then
       
   391 Devel::TraceSAX will come to your rescue. This perl debugger plugin
       
   392 will allow you to dump the SAX stream of events as it goes by. Usage is
       
   393 really very simple just call your perl script that uses SAX as follows:
       
   394 
       
   395   $ perl -d:TraceSAX <scriptname>
       
   396 
       
   397 And preferably pipe the output to a pager of some sort, such as more or
       
   398 less. The output is extremely verbose, but should help clear some
       
   399 issues up.
       
   400 
       
   401 =head1 AUTHOR
       
   402 
       
   403 Matt Sergeant, matt@sergeant.org
       
   404 
       
   405 $Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
       
   406 
       
   407 =cut