uh_parser/XML/SAX/Intro.pod
author William Roberts <williamr@symbian.org>
Fri, 27 Aug 2010 15:27:06 +0100
changeset 281 c62bd4f9dbce
parent 176 6d3c3db11e72
permissions -rw-r--r--
Add delete_builds.pl - a utility for making space quickly on build machines This Perl script deletes some directories known to contain very large files first, before deleting the rest of the build which contains millions of small files. Given multiple builds, it will do this breadth first, so that lost of space is released quickly.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
176
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     1
=head1 NAME
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     2
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     3
XML::SAX::Intro - An Introduction to SAX Parsing with Perl
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     5
=head1 Introduction
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     6
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     7
XML::SAX is a new way to work with XML Parsers in Perl. In this article
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     8
we'll discuss why you should be using SAX, why you should be using
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     9
XML::SAX, and we'll see some of the finer implementation details. The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    10
text below assumes some familiarity with callback, or push based
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    11
parsing, but if you are unfamiliar with these techniques then a good
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    12
place to start is Kip Hampton's excellent series of articles on XML.com.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    13
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    14
=head1 Replacing XML::Parser
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    15
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    16
The de-facto way of parsing XML under perl is to use Larry Wall and
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    17
Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    18
the expat XML parser library by James Clark. It has been a hugely
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    19
successful project, but suffers from a couple of rather major flaws.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    20
Firstly it is a proprietary API, designed before the SAX API was
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    21
conceived, which means that it is not easily replaceable by other
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    22
streaming parsers. Secondly it's callbacks are subrefs. This doesn't
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    23
sound like much of an issue, but unfortunately leads to code like:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    24
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    25
  sub handle_start {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    26
    my ($e, $el, %attrs) = @_;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    27
    if ($el eq 'foo') {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    28
      $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    29
    }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    30
  }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    31
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    32
As you can see, we're using the $e object to hold our state
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    33
information, which is a bad idea because we don't own that object - we
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    34
didn't create it. It's an internal object of XML::Parser, that happens
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    35
to be a hashref. We could all too easily overwrite XML::Parser internal
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    36
state variables by using this, or Clark could change it to an array ref
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    37
(not that he would, because it would break so much code, but he could).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    38
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    39
The only way currently with XML::Parser to safely maintain state is to
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    40
use a closure:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    41
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    42
  my $state = MyState->new();
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    43
  $parser->setHandlers(Start => sub { handle_start($state, @_) });
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    44
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    45
This closure traps the $state variable, which now gets passed as the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    46
first parameter to your callback. Unfortunately very few people use
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    47
this technique, as it is not documented in the XML::Parser POD files.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    48
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    49
Another reason you might not want to use XML::Parser is because you
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    50
need some feature that it doesn't provide (such as validation), or you
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    51
might need to use a library that doesn't use expat, due to it not being
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    52
installed on your system, or due to having a restrictive ISP. Using SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    53
allows you to work around these restrictions.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    54
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    55
=head1 Introducing SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    56
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    57
SAX stands for the Simple API for XML. And simple it really is.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    58
Constructing a SAX parser and passing events to handlers is done as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    59
simply as:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    60
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    61
  use XML::SAX;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    62
  use MySAXHandler;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    63
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    64
  my $parser = XML::SAX::ParserFactory->parser(
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    65
  	Handler => MySAXHandler->new
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    66
  );
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    67
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    68
  $parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    69
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    70
The important concept to grasp here is that SAX uses a factory class
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    71
called XML::SAX::ParserFactory to create a new parser instance. The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    72
reason for this is so that you can support other underlying
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    73
parser implementations for different feature sets. This is one thing
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    74
that XML::Parser has always sorely lacked.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    75
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    76
In the code above we see the parse_uri method used, but we could
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    77
have equally well
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    78
called parse_file, parse_string, or parse(). Please see XML::SAX::Base
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    79
for what these methods take as parameters, but don't be fooled into
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    80
believing parse_file takes a filename. No, it takes a file handle, a
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    81
glob, or a subclass of IO::Handle. Beware.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    82
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    83
SAX works very similarly to XML::Parser's default callback method,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    84
except it has one major difference: rather than setting individual
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    85
callbacks, you create a new class in which to recieve the callbacks.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    86
Each callback is called as a method call on an instance of that handler
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    87
class. An example will best demonstrate this:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    88
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    89
  package MySAXHandler;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    90
  use base qw(XML::SAX::Base);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    91
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    92
  sub start_document {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    93
    my ($self, $doc) = @_;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    94
    # process document start event
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    95
  }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    96
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    97
  sub start_element {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    98
    my ($self, $el) = @_;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    99
    # process element start event
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   100
  }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   101
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   102
Now, when we instantiate this as above, and parse some XML with this as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   103
the handler, the methods start_document and start_element will be
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   104
called as method calls, so this would be the equivalent of directly
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   105
calling:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   106
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   107
  $object->start_element($el);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   108
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   109
Notice how this is different to XML::Parser's calling style, which
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   110
calls:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   111
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   112
  start_element($e, $name, %attribs);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   113
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   114
It's the difference between function calling and method calling which
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   115
allows you to subclass SAX handlers which contributes to SAX being a
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   116
powerful solution.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   117
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   118
As you can see, unlike XML::Parser, we have to define a new package in
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   119
which to do our processing (there are hacks you can do to make this
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   120
uneccessary, but I'll leave figuring those out to the experts). The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   121
biggest benefit of this is that you maintain your own state variable
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   122
($self in the above example) thus freeing you of the concerns listed
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   123
above. It is also an improvement in maintainability - you can place the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   124
code in a separate file if you wish to, and your callback methods are
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   125
always called the same thing, rather than having to choose a suitable
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   126
name for them as you had to with XML::Parser. This is an obvious win.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   127
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   128
SAX parsers are also very flexible in how you pass a handler to them.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   129
You can use a constructor parameter as we saw above, or we can pass the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   130
handler directly in the call to one of the parse methods:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   131
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   132
  $parser->parse(Handler => $handler, 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   133
                 Source => { SystemId => "foo.xml" });
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   134
  # or...
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   135
  $parser->parse_file($fh, Handler => $handler);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   136
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   137
This flexibility allows for one parser to be used in many different
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   138
scenarios throughout your script (though one shouldn't feel pressure to
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   139
use this method, as parser construction is generally not a time
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   140
consuming process).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   141
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   142
=head1 Callback Parameters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   143
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   144
The only other thing you need to know to understand basic SAX is the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   145
structure of the parameters passed to each of the callbacks. In
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   146
XML::Parser, all parameters are passed as multiple options to the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   147
callbacks, so for example the Start callback would be called as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   148
my_start($e, $name, %attributes), and the PI callback would be called
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   149
as my_processing_instruction($e, $target, $data). In SAX, every
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   150
callback is passed a hash reference, containing entries that define our
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   151
"node". The key callbacks and the structures they receive are:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   152
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   153
=head2 start_element
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   154
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   155
The start_element handler is called whenever a parser sees an opening
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   156
tag. It is passed an element structure consisting of:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   157
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   158
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   159
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   160
=item LocalName
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   161
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   162
The name of the element minus any namespace prefix it may
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   163
have come with in the document.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   164
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   165
=item NamespaceURI
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   166
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   167
The URI of the namespace associated with this element,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   168
or the empty string for none.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   169
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   170
=item Attributes
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   171
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   172
A set of attributes as described below.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   173
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   174
=item Name
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   175
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   176
The name of the element as it was seen in the document (i.e.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   177
including any prefix associated with it)
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   178
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   179
=item Prefix
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   180
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   181
The prefix used to qualify this element's namespace, or the 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   182
empty string if none.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   183
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   184
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   185
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   186
The B<Attributes> are a hash reference, keyed by what we have called
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   187
"James Clark" notation. This means that the attribute name has been
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   188
expanded to include any associated namespace URI, and put together as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   189
{ns}name, where "ns" is the expanded namespace URI of the attribute if
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   190
and only if the attribute had a prefix, and "name" is the LocalName of
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   191
the attribute.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   192
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   193
The value of each entry in the attributes hash is another hash
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   194
structure consisting of:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   195
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   196
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   197
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   198
=item LocalName
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   199
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   200
The name of the attribute minus any namespace prefix it may have
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   201
come with in the document.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   202
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   203
=item NamespaceURI
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   204
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   205
The URI of the namespace associated with this attribute. If the 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   206
attribute had no prefix, then this consists of just the empty string.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   207
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   208
=item Name
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   209
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   210
The attribute's name as it appeared in the document, including any 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   211
namespace prefix.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   212
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   213
=item Prefix
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   214
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   215
The prefix used to qualify this attribute's namepace, or the 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   216
empty string if none.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   217
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   218
=item Value
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   219
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   220
The value of the attribute.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   221
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   222
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   223
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   224
So a full example, as output by Data::Dumper might be:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   225
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   226
  ....
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   227
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   228
=head2 end_element
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   229
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   230
The end_element handler is called either when a parser sees a closing
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   231
tag, or after start_element has been called for an empty element (do
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   232
note however that a parser may if it is so inclined call characters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   233
with an empty string when it sees an empty element. There is no simple
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   234
way in SAX to determine if the parser in fact saw an empty element, a
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   235
start and end element with no content..
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   236
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   237
The end_element handler receives exactly the same structure as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   238
start_element, minus the Attributes entry. One must note though that it
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   239
should not be a reference to the same data as start_element receives,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   240
so you may change the values in start_element but this will not affect
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   241
the values later seen by end_element.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   242
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   243
=head2 characters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   244
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   245
The characters callback may be called in serveral circumstances. The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   246
most obvious one is when seeing ordinary character data in the markup.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   247
But it is also called for text in a CDATA section, and is also called
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   248
in other situations. A SAX parser has to make no guarantees whatsoever
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   249
about how many times it may call characters for a stretch of text in an
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   250
XML document - it may call once, or it may call once for every
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   251
character in the text. In order to work around this it is often
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   252
important for the SAX developer to use a bundling technique, where text
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   253
is gathered up and processed in one of the other callbacks. This is not
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   254
always necessary, but it is a worthwhile technique to learn, which we
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   255
will cover in XML::SAX::Advanced (when I get around to writing it).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   256
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   257
The characters handler is called with a very simple structure - a hash
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   258
reference consisting of just one entry:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   259
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   260
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   261
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   262
=item Data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   263
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   264
The text data that was received.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   265
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   266
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   267
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   268
=head2 comment
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   269
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   270
The comment callback is called for comment text. Unlike with
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   271
C<characters()>, the comment callback *must* be invoked just once for an
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   272
entire comment string. It receives a single simple structure - a hash
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   273
reference containing just one entry:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   274
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   275
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   276
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   277
=item Data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   278
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   279
The text of the comment.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   280
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   281
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   282
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   283
=head2 processing_instruction
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   284
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   285
The processing instruction handler is called for all processing
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   286
instructions in the document. Note that these processing instructions
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   287
may appear before the document root element, or after it, or anywhere
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   288
where text and elements would normally appear within the document,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   289
according to the XML specification.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   290
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   291
The handler is passed a structure containing just two entries:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   292
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   293
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   294
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   295
=item Target
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   296
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   297
The target of the processing instrcution
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   298
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   299
=item Data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   300
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   301
The text data in the processing instruction. Can be an empty
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   302
string for a processing instruction that has no data element. 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   303
For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   304
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   305
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   306
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   307
=head1 Tip of the iceberg
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   308
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   309
What we have discussed above is really the tip of the SAX iceberg. And
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   310
so far it looks like there's not much of interest to SAX beyond what we
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   311
have seen with XML::Parser. But it does go much further than that, I
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   312
promise.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   313
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   314
People who hate Object Oriented code for the sake of it may be thinking
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   315
here that creating a new package just to parse something is a waste
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   316
when they've been parsing things just fine up to now using procedural
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   317
code. But there's reason to all this madness. And that reason is SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   318
Filters.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   319
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   320
As you saw right at the very start, to let the parser know about our
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   321
class, we pass it an instance of our class as the Handler to the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   322
parser. But now imagine what would happen if our class could also take
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   323
a Handler option, and simply do some processing and pass on our data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   324
further down the line? That in a nutshell is how SAX filters work. It's
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   325
Unix pipes for the 21st century!
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   326
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   327
There are two downsides to this. Number 1 - writing SAX filters can be
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   328
tricky. If you look into the future and read the advanced tutorial I'm
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   329
writing, you'll see that Handler can come in several shapes and sizes.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   330
So making sure your filter does the right thing can be tricky.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   331
Secondly, constructing complex filter chains can be difficult, and
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   332
simple thinking tells us that we only get one pass at our document,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   333
when often we'll need more than that.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   334
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   335
Luckily though, those downsides have been fixed by the release of two
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   336
very cool modules. What's even better is that I didn't write either of
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   337
them!
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   338
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   339
The first module is XML::SAX::Base. This is a VITAL SAX module that
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   340
acts as a base class for all SAX parsers and filters. It provides an
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   341
abstraction away from calling the handler methods, that makes sure your
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   342
filter or parser does the right thing, and it does it FAST. So, if you
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   343
ever need to write a SAX filter, which if you're processing XML -> XML,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   344
or XML -> HTML, then you probably do, then you need to be writing it as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   345
a subclass of XML::SAX::Base. Really - this is advice not to ignore
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   346
lightly. I will not go into the details of writing a SAX filter here.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   347
Kip Hampton, the author of XML::SAX::Base has covered this nicely in
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   348
his article on XML.com here <URI>.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   349
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   350
To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   351
who's modules you will probably have heard of or used, wrote a very
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   352
clever module called XML::SAX::Machines. This combines some really
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   353
clever SAX filter-type modules, with a construction toolkit for filters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   354
that makes building pipelines easy. But before we see how it makes
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   355
things easy, first lets see how tricky it looks to build complex SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   356
filter pipelines.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   357
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   358
  use XML::SAX::ParserFactory;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   359
  use XML::Filter::Filter1;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   360
  use XML::Filter::Filter2;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   361
  use XML::SAX::Writer;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   362
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   363
  my $output_string;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   364
  my $writer = XML::SAX::Writer->new(Output => \$output_string);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   365
  my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   366
  my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   367
  my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   368
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   369
  $parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   370
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   371
This is a lot easier with XML::SAX::Machines:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   372
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   373
  use XML::SAX::Machines qw(Pipeline);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   374
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   375
  my $output_string;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   376
  my $parser = Pipeline(
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   377
  	XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   378
  	);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   379
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   380
  $parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   381
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   382
One of the main benefits of XML::SAX::Machines is that the pipelines
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   383
are constructed in natural order, rather than the reverse order we saw
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   384
with manual pipeline construction. XML::SAX::Machines takes care of all
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   385
the internals of pipe construction, providing you at the end with just
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   386
a parser you can use (and you can re-use the same parser as many times
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   387
as you need to).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   388
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   389
Just a final tip. If you ever get stuck and are confused about what is
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   390
being passed from one SAX filter or parser to the next, then
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   391
Devel::TraceSAX will come to your rescue. This perl debugger plugin
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   392
will allow you to dump the SAX stream of events as it goes by. Usage is
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   393
really very simple just call your perl script that uses SAX as follows:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   394
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   395
  $ perl -d:TraceSAX <scriptname>
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   396
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   397
And preferably pipe the output to a pager of some sort, such as more or
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   398
less. The output is extremely verbose, but should help clear some
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   399
issues up.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   400
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   401
=head1 AUTHOR
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   402
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   403
Matt Sergeant, matt@sergeant.org
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   404
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   405
$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   406
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   407
=cut