uh_parser/XML/SAX/Intro.pod
author Dario Sestito <darios@symbian.org>
Wed, 03 Mar 2010 16:51:26 +0000
changeset 177 6d3c3db11e72
permissions -rw-r--r--
Add Raptor uh parser
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
177
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     1
=head1 NAME
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     2
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     3
XML::SAX::Intro - An Introduction to SAX Parsing with Perl
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     5
=head1 Introduction
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     6
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     7
XML::SAX is a new way to work with XML Parsers in Perl. In this article
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     8
we'll discuss why you should be using SAX, why you should be using
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
     9
XML::SAX, and we'll see some of the finer implementation details. The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    10
text below assumes some familiarity with callback, or push based
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    11
parsing, but if you are unfamiliar with these techniques then a good
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    12
place to start is Kip Hampton's excellent series of articles on XML.com.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    13
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    14
=head1 Replacing XML::Parser
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    15
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    16
The de-facto way of parsing XML under perl is to use Larry Wall and
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    17
Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    18
the expat XML parser library by James Clark. It has been a hugely
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    19
successful project, but suffers from a couple of rather major flaws.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    20
Firstly it is a proprietary API, designed before the SAX API was
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    21
conceived, which means that it is not easily replaceable by other
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    22
streaming parsers. Secondly it's callbacks are subrefs. This doesn't
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    23
sound like much of an issue, but unfortunately leads to code like:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    24
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    25
  sub handle_start {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    26
    my ($e, $el, %attrs) = @_;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    27
    if ($el eq 'foo') {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    28
      $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    29
    }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    30
  }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    31
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    32
As you can see, we're using the $e object to hold our state
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    33
information, which is a bad idea because we don't own that object - we
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    34
didn't create it. It's an internal object of XML::Parser, that happens
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    35
to be a hashref. We could all too easily overwrite XML::Parser internal
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    36
state variables by using this, or Clark could change it to an array ref
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    37
(not that he would, because it would break so much code, but he could).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    38
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    39
The only way currently with XML::Parser to safely maintain state is to
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    40
use a closure:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    41
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    42
  my $state = MyState->new();
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    43
  $parser->setHandlers(Start => sub { handle_start($state, @_) });
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    44
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    45
This closure traps the $state variable, which now gets passed as the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    46
first parameter to your callback. Unfortunately very few people use
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    47
this technique, as it is not documented in the XML::Parser POD files.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    48
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    49
Another reason you might not want to use XML::Parser is because you
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    50
need some feature that it doesn't provide (such as validation), or you
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    51
might need to use a library that doesn't use expat, due to it not being
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    52
installed on your system, or due to having a restrictive ISP. Using SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    53
allows you to work around these restrictions.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    54
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    55
=head1 Introducing SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    56
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    57
SAX stands for the Simple API for XML. And simple it really is.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    58
Constructing a SAX parser and passing events to handlers is done as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    59
simply as:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    60
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    61
  use XML::SAX;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    62
  use MySAXHandler;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    63
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    64
  my $parser = XML::SAX::ParserFactory->parser(
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    65
  	Handler => MySAXHandler->new
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    66
  );
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    67
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    68
  $parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    69
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    70
The important concept to grasp here is that SAX uses a factory class
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    71
called XML::SAX::ParserFactory to create a new parser instance. The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    72
reason for this is so that you can support other underlying
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    73
parser implementations for different feature sets. This is one thing
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    74
that XML::Parser has always sorely lacked.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    75
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    76
In the code above we see the parse_uri method used, but we could
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    77
have equally well
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    78
called parse_file, parse_string, or parse(). Please see XML::SAX::Base
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    79
for what these methods take as parameters, but don't be fooled into
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    80
believing parse_file takes a filename. No, it takes a file handle, a
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    81
glob, or a subclass of IO::Handle. Beware.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    82
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    83
SAX works very similarly to XML::Parser's default callback method,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    84
except it has one major difference: rather than setting individual
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    85
callbacks, you create a new class in which to recieve the callbacks.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    86
Each callback is called as a method call on an instance of that handler
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    87
class. An example will best demonstrate this:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    88
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    89
  package MySAXHandler;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    90
  use base qw(XML::SAX::Base);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    91
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    92
  sub start_document {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    93
    my ($self, $doc) = @_;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    94
    # process document start event
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    95
  }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    96
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    97
  sub start_element {
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    98
    my ($self, $el) = @_;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
    99
    # process element start event
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   100
  }
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   101
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   102
Now, when we instantiate this as above, and parse some XML with this as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   103
the handler, the methods start_document and start_element will be
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   104
called as method calls, so this would be the equivalent of directly
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   105
calling:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   106
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   107
  $object->start_element($el);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   108
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   109
Notice how this is different to XML::Parser's calling style, which
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   110
calls:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   111
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   112
  start_element($e, $name, %attribs);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   113
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   114
It's the difference between function calling and method calling which
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   115
allows you to subclass SAX handlers which contributes to SAX being a
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   116
powerful solution.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   117
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   118
As you can see, unlike XML::Parser, we have to define a new package in
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   119
which to do our processing (there are hacks you can do to make this
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   120
uneccessary, but I'll leave figuring those out to the experts). The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   121
biggest benefit of this is that you maintain your own state variable
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   122
($self in the above example) thus freeing you of the concerns listed
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   123
above. It is also an improvement in maintainability - you can place the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   124
code in a separate file if you wish to, and your callback methods are
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   125
always called the same thing, rather than having to choose a suitable
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   126
name for them as you had to with XML::Parser. This is an obvious win.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   127
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   128
SAX parsers are also very flexible in how you pass a handler to them.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   129
You can use a constructor parameter as we saw above, or we can pass the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   130
handler directly in the call to one of the parse methods:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   131
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   132
  $parser->parse(Handler => $handler, 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   133
                 Source => { SystemId => "foo.xml" });
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   134
  # or...
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   135
  $parser->parse_file($fh, Handler => $handler);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   136
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   137
This flexibility allows for one parser to be used in many different
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   138
scenarios throughout your script (though one shouldn't feel pressure to
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   139
use this method, as parser construction is generally not a time
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   140
consuming process).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   141
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   142
=head1 Callback Parameters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   143
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   144
The only other thing you need to know to understand basic SAX is the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   145
structure of the parameters passed to each of the callbacks. In
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   146
XML::Parser, all parameters are passed as multiple options to the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   147
callbacks, so for example the Start callback would be called as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   148
my_start($e, $name, %attributes), and the PI callback would be called
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   149
as my_processing_instruction($e, $target, $data). In SAX, every
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   150
callback is passed a hash reference, containing entries that define our
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   151
"node". The key callbacks and the structures they receive are:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   152
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   153
=head2 start_element
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   154
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   155
The start_element handler is called whenever a parser sees an opening
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   156
tag. It is passed an element structure consisting of:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   157
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   158
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   159
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   160
=item LocalName
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   161
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   162
The name of the element minus any namespace prefix it may
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   163
have come with in the document.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   164
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   165
=item NamespaceURI
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   166
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   167
The URI of the namespace associated with this element,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   168
or the empty string for none.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   169
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   170
=item Attributes
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   171
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   172
A set of attributes as described below.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   173
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   174
=item Name
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   175
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   176
The name of the element as it was seen in the document (i.e.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   177
including any prefix associated with it)
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   178
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   179
=item Prefix
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   180
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   181
The prefix used to qualify this element's namespace, or the 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   182
empty string if none.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   183
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   184
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   185
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   186
The B<Attributes> are a hash reference, keyed by what we have called
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   187
"James Clark" notation. This means that the attribute name has been
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   188
expanded to include any associated namespace URI, and put together as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   189
{ns}name, where "ns" is the expanded namespace URI of the attribute if
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   190
and only if the attribute had a prefix, and "name" is the LocalName of
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   191
the attribute.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   192
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   193
The value of each entry in the attributes hash is another hash
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   194
structure consisting of:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   195
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   196
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   197
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   198
=item LocalName
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   199
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   200
The name of the attribute minus any namespace prefix it may have
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   201
come with in the document.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   202
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   203
=item NamespaceURI
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   204
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   205
The URI of the namespace associated with this attribute. If the 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   206
attribute had no prefix, then this consists of just the empty string.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   207
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   208
=item Name
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   209
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   210
The attribute's name as it appeared in the document, including any 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   211
namespace prefix.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   212
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   213
=item Prefix
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   214
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   215
The prefix used to qualify this attribute's namepace, or the 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   216
empty string if none.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   217
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   218
=item Value
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   219
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   220
The value of the attribute.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   221
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   222
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   223
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   224
So a full example, as output by Data::Dumper might be:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   225
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   226
  ....
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   227
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   228
=head2 end_element
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   229
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   230
The end_element handler is called either when a parser sees a closing
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   231
tag, or after start_element has been called for an empty element (do
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   232
note however that a parser may if it is so inclined call characters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   233
with an empty string when it sees an empty element. There is no simple
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   234
way in SAX to determine if the parser in fact saw an empty element, a
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   235
start and end element with no content..
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   236
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   237
The end_element handler receives exactly the same structure as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   238
start_element, minus the Attributes entry. One must note though that it
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   239
should not be a reference to the same data as start_element receives,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   240
so you may change the values in start_element but this will not affect
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   241
the values later seen by end_element.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   242
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   243
=head2 characters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   244
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   245
The characters callback may be called in serveral circumstances. The
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   246
most obvious one is when seeing ordinary character data in the markup.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   247
But it is also called for text in a CDATA section, and is also called
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   248
in other situations. A SAX parser has to make no guarantees whatsoever
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   249
about how many times it may call characters for a stretch of text in an
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   250
XML document - it may call once, or it may call once for every
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   251
character in the text. In order to work around this it is often
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   252
important for the SAX developer to use a bundling technique, where text
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   253
is gathered up and processed in one of the other callbacks. This is not
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   254
always necessary, but it is a worthwhile technique to learn, which we
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   255
will cover in XML::SAX::Advanced (when I get around to writing it).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   256
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   257
The characters handler is called with a very simple structure - a hash
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   258
reference consisting of just one entry:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   259
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   260
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   261
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   262
=item Data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   263
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   264
The text data that was received.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   265
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   266
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   267
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   268
=head2 comment
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   269
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   270
The comment callback is called for comment text. Unlike with
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   271
C<characters()>, the comment callback *must* be invoked just once for an
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   272
entire comment string. It receives a single simple structure - a hash
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   273
reference containing just one entry:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   274
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   275
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   276
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   277
=item Data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   278
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   279
The text of the comment.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   280
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   281
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   282
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   283
=head2 processing_instruction
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   284
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   285
The processing instruction handler is called for all processing
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   286
instructions in the document. Note that these processing instructions
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   287
may appear before the document root element, or after it, or anywhere
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   288
where text and elements would normally appear within the document,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   289
according to the XML specification.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   290
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   291
The handler is passed a structure containing just two entries:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   292
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   293
=over 4
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   294
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   295
=item Target
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   296
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   297
The target of the processing instrcution
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   298
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   299
=item Data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   300
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   301
The text data in the processing instruction. Can be an empty
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   302
string for a processing instruction that has no data element. 
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   303
For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   304
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   305
=back
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   306
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   307
=head1 Tip of the iceberg
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   308
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   309
What we have discussed above is really the tip of the SAX iceberg. And
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   310
so far it looks like there's not much of interest to SAX beyond what we
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   311
have seen with XML::Parser. But it does go much further than that, I
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   312
promise.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   313
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   314
People who hate Object Oriented code for the sake of it may be thinking
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   315
here that creating a new package just to parse something is a waste
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   316
when they've been parsing things just fine up to now using procedural
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   317
code. But there's reason to all this madness. And that reason is SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   318
Filters.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   319
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   320
As you saw right at the very start, to let the parser know about our
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   321
class, we pass it an instance of our class as the Handler to the
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   322
parser. But now imagine what would happen if our class could also take
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   323
a Handler option, and simply do some processing and pass on our data
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   324
further down the line? That in a nutshell is how SAX filters work. It's
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   325
Unix pipes for the 21st century!
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   326
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   327
There are two downsides to this. Number 1 - writing SAX filters can be
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   328
tricky. If you look into the future and read the advanced tutorial I'm
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   329
writing, you'll see that Handler can come in several shapes and sizes.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   330
So making sure your filter does the right thing can be tricky.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   331
Secondly, constructing complex filter chains can be difficult, and
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   332
simple thinking tells us that we only get one pass at our document,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   333
when often we'll need more than that.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   334
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   335
Luckily though, those downsides have been fixed by the release of two
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   336
very cool modules. What's even better is that I didn't write either of
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   337
them!
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   338
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   339
The first module is XML::SAX::Base. This is a VITAL SAX module that
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   340
acts as a base class for all SAX parsers and filters. It provides an
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   341
abstraction away from calling the handler methods, that makes sure your
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   342
filter or parser does the right thing, and it does it FAST. So, if you
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   343
ever need to write a SAX filter, which if you're processing XML -> XML,
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   344
or XML -> HTML, then you probably do, then you need to be writing it as
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   345
a subclass of XML::SAX::Base. Really - this is advice not to ignore
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   346
lightly. I will not go into the details of writing a SAX filter here.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   347
Kip Hampton, the author of XML::SAX::Base has covered this nicely in
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   348
his article on XML.com here <URI>.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   349
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   350
To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   351
who's modules you will probably have heard of or used, wrote a very
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   352
clever module called XML::SAX::Machines. This combines some really
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   353
clever SAX filter-type modules, with a construction toolkit for filters
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   354
that makes building pipelines easy. But before we see how it makes
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   355
things easy, first lets see how tricky it looks to build complex SAX
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   356
filter pipelines.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   357
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   358
  use XML::SAX::ParserFactory;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   359
  use XML::Filter::Filter1;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   360
  use XML::Filter::Filter2;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   361
  use XML::SAX::Writer;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   362
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   363
  my $output_string;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   364
  my $writer = XML::SAX::Writer->new(Output => \$output_string);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   365
  my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   366
  my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   367
  my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   368
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   369
  $parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   370
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   371
This is a lot easier with XML::SAX::Machines:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   372
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   373
  use XML::SAX::Machines qw(Pipeline);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   374
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   375
  my $output_string;
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   376
  my $parser = Pipeline(
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   377
  	XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   378
  	);
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   379
  
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   380
  $parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   381
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   382
One of the main benefits of XML::SAX::Machines is that the pipelines
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   383
are constructed in natural order, rather than the reverse order we saw
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   384
with manual pipeline construction. XML::SAX::Machines takes care of all
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   385
the internals of pipe construction, providing you at the end with just
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   386
a parser you can use (and you can re-use the same parser as many times
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   387
as you need to).
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   388
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   389
Just a final tip. If you ever get stuck and are confused about what is
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   390
being passed from one SAX filter or parser to the next, then
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   391
Devel::TraceSAX will come to your rescue. This perl debugger plugin
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   392
will allow you to dump the SAX stream of events as it goes by. Usage is
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   393
really very simple just call your perl script that uses SAX as follows:
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   394
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   395
  $ perl -d:TraceSAX <scriptname>
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   396
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   397
And preferably pipe the output to a pager of some sort, such as more or
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   398
less. The output is extremely verbose, but should help clear some
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   399
issues up.
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   400
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   401
=head1 AUTHOR
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   402
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   403
Matt Sergeant, matt@sergeant.org
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   404
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   405
$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   406
6d3c3db11e72 Add Raptor uh parser
Dario Sestito <darios@symbian.org>
parents:
diff changeset
   407
=cut