FCL/sftools/fbf/configs/default: comparison common/tools/raptor/XML/SAX/Intro.pod

equal deleted inserted replaced

-:996297fad800
+:5ccf9d5ab663
-=head1 NAME
-XML::SAX::Intro - An Introduction to SAX Parsing with Perl
-=head1 Introduction
-XML::SAX is a new way to work with XML Parsers in Perl. In this article
-we'll discuss why you should be using SAX, why you should be using
-XML::SAX, and we'll see some of the finer implementation details. The
-text below assumes some familiarity with callback, or push based
-parsing, but if you are unfamiliar with these techniques then a good
-place to start is Kip Hampton's excellent series of articles on XML.com.
-=head1 Replacing XML::Parser
-The de-facto way of parsing XML under perl is to use Larry Wall and
-Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
-the expat XML parser library by James Clark. It has been a hugely
-successful project, but suffers from a couple of rather major flaws.
-Firstly it is a proprietary API, designed before the SAX API was
-conceived, which means that it is not easily replaceable by other
-streaming parsers. Secondly it's callbacks are subrefs. This doesn't
-sound like much of an issue, but unfortunately leads to code like:
-sub handle_start {
-my ($e, $el, %attrs) = @_;
-if ($el eq 'foo') {
-$e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
-}
-}
-As you can see, we're using the $e object to hold our state
-information, which is a bad idea because we don't own that object - we
-didn't create it. It's an internal object of XML::Parser, that happens
-to be a hashref. We could all too easily overwrite XML::Parser internal
-state variables by using this, or Clark could change it to an array ref
-(not that he would, because it would break so much code, but he could).
-The only way currently with XML::Parser to safely maintain state is to
-use a closure:
-my $state = MyState->new();
-$parser->setHandlers(Start => sub { handle_start($state, @_) });
-This closure traps the $state variable, which now gets passed as the
-first parameter to your callback. Unfortunately very few people use
-this technique, as it is not documented in the XML::Parser POD files.
-Another reason you might not want to use XML::Parser is because you
-need some feature that it doesn't provide (such as validation), or you
-might need to use a library that doesn't use expat, due to it not being
-installed on your system, or due to having a restrictive ISP. Using SAX
-allows you to work around these restrictions.
-=head1 Introducing SAX
-SAX stands for the Simple API for XML. And simple it really is.
-Constructing a SAX parser and passing events to handlers is done as
-simply as:
-use XML::SAX;
-use MySAXHandler;
-my $parser = XML::SAX::ParserFactory->parser(
-	Handler => MySAXHandler->new
-);
-$parser->parse_uri("foo.xml");
-The important concept to grasp here is that SAX uses a factory class
-called XML::SAX::ParserFactory to create a new parser instance. The
-reason for this is so that you can support other underlying
-parser implementations for different feature sets. This is one thing
-that XML::Parser has always sorely lacked.
-In the code above we see the parse_uri method used, but we could
-have equally well
-called parse_file, parse_string, or parse(). Please see XML::SAX::Base
-for what these methods take as parameters, but don't be fooled into
-believing parse_file takes a filename. No, it takes a file handle, a
-glob, or a subclass of IO::Handle. Beware.
-SAX works very similarly to XML::Parser's default callback method,
-except it has one major difference: rather than setting individual
-callbacks, you create a new class in which to recieve the callbacks.
-Each callback is called as a method call on an instance of that handler
-class. An example will best demonstrate this:
-package MySAXHandler;
-use base qw(XML::SAX::Base);
-sub start_document {
-my ($self, $doc) = @_;
-# process document start event
-}
-sub start_element {
-my ($self, $el) = @_;
-# process element start event
-}
-Now, when we instantiate this as above, and parse some XML with this as
-the handler, the methods start_document and start_element will be
-called as method calls, so this would be the equivalent of directly
-calling:
-$object->start_element($el);
-Notice how this is different to XML::Parser's calling style, which
-calls:
-start_element($e, $name, %attribs);
-It's the difference between function calling and method calling which
-allows you to subclass SAX handlers which contributes to SAX being a
-powerful solution.
-As you can see, unlike XML::Parser, we have to define a new package in
-which to do our processing (there are hacks you can do to make this
-uneccessary, but I'll leave figuring those out to the experts). The
-biggest benefit of this is that you maintain your own state variable
-($self in the above example) thus freeing you of the concerns listed
-above. It is also an improvement in maintainability - you can place the
-code in a separate file if you wish to, and your callback methods are
-always called the same thing, rather than having to choose a suitable
-name for them as you had to with XML::Parser. This is an obvious win.
-SAX parsers are also very flexible in how you pass a handler to them.
-You can use a constructor parameter as we saw above, or we can pass the
-handler directly in the call to one of the parse methods:
-$parser->parse(Handler => $handler,
-Source => { SystemId => "foo.xml" });
-# or...
-$parser->parse_file($fh, Handler => $handler);
-This flexibility allows for one parser to be used in many different
-scenarios throughout your script (though one shouldn't feel pressure to
-use this method, as parser construction is generally not a time
-consuming process).
-=head1 Callback Parameters
-The only other thing you need to know to understand basic SAX is the
-structure of the parameters passed to each of the callbacks. In
-XML::Parser, all parameters are passed as multiple options to the
-callbacks, so for example the Start callback would be called as
-my_start($e, $name, %attributes), and the PI callback would be called
-as my_processing_instruction($e, $target, $data). In SAX, every
-callback is passed a hash reference, containing entries that define our
-"node". The key callbacks and the structures they receive are:
-=head2 start_element
-The start_element handler is called whenever a parser sees an opening
-tag. It is passed an element structure consisting of:
-=over 4
-=item LocalName
-The name of the element minus any namespace prefix it may
-have come with in the document.
-=item NamespaceURI
-The URI of the namespace associated with this element,
-or the empty string for none.
-=item Attributes
-A set of attributes as described below.
-=item Name
-The name of the element as it was seen in the document (i.e.
-including any prefix associated with it)
-=item Prefix
-The prefix used to qualify this element's namespace, or the
-empty string if none.
-=back
-The B<Attributes> are a hash reference, keyed by what we have called
-"James Clark" notation. This means that the attribute name has been
-expanded to include any associated namespace URI, and put together as
-{ns}name, where "ns" is the expanded namespace URI of the attribute if
-and only if the attribute had a prefix, and "name" is the LocalName of
-the attribute.
-The value of each entry in the attributes hash is another hash
-structure consisting of:
-=over 4
-=item LocalName
-The name of the attribute minus any namespace prefix it may have
-come with in the document.
-=item NamespaceURI
-The URI of the namespace associated with this attribute. If the
-attribute had no prefix, then this consists of just the empty string.
-=item Name
-The attribute's name as it appeared in the document, including any
-namespace prefix.
-=item Prefix
-The prefix used to qualify this attribute's namepace, or the
-empty string if none.
-=item Value
-The value of the attribute.
-=back
-So a full example, as output by Data::Dumper might be:
-....
-=head2 end_element
-The end_element handler is called either when a parser sees a closing
-tag, or after start_element has been called for an empty element (do
-note however that a parser may if it is so inclined call characters
-with an empty string when it sees an empty element. There is no simple
-way in SAX to determine if the parser in fact saw an empty element, a
-start and end element with no content..
-The end_element handler receives exactly the same structure as
-start_element, minus the Attributes entry. One must note though that it
-should not be a reference to the same data as start_element receives,
-so you may change the values in start_element but this will not affect
-the values later seen by end_element.
-=head2 characters
-The characters callback may be called in serveral circumstances. The
-most obvious one is when seeing ordinary character data in the markup.
-But it is also called for text in a CDATA section, and is also called
-in other situations. A SAX parser has to make no guarantees whatsoever
-about how many times it may call characters for a stretch of text in an
-XML document - it may call once, or it may call once for every
-character in the text. In order to work around this it is often
-important for the SAX developer to use a bundling technique, where text
-is gathered up and processed in one of the other callbacks. This is not
-always necessary, but it is a worthwhile technique to learn, which we
-will cover in XML::SAX::Advanced (when I get around to writing it).
-The characters handler is called with a very simple structure - a hash
-reference consisting of just one entry:
-=over 4
-=item Data
-The text data that was received.
-=back
-=head2 comment
-The comment callback is called for comment text. Unlike with
-C<characters()>, the comment callback *must* be invoked just once for an
-entire comment string. It receives a single simple structure - a hash
-reference containing just one entry:
-=over 4
-=item Data
-The text of the comment.
-=back
-=head2 processing_instruction
-The processing instruction handler is called for all processing
-instructions in the document. Note that these processing instructions
-may appear before the document root element, or after it, or anywhere
-where text and elements would normally appear within the document,
-according to the XML specification.
-The handler is passed a structure containing just two entries:
-=over 4
-=item Target
-The target of the processing instrcution
-=item Data
-The text data in the processing instruction. Can be an empty
-string for a processing instruction that has no data element.
-For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
-=back
-=head1 Tip of the iceberg
-What we have discussed above is really the tip of the SAX iceberg. And
-so far it looks like there's not much of interest to SAX beyond what we
-have seen with XML::Parser. But it does go much further than that, I
-promise.
-People who hate Object Oriented code for the sake of it may be thinking
-here that creating a new package just to parse something is a waste
-when they've been parsing things just fine up to now using procedural
-code. But there's reason to all this madness. And that reason is SAX
-Filters.
-As you saw right at the very start, to let the parser know about our
-class, we pass it an instance of our class as the Handler to the
-parser. But now imagine what would happen if our class could also take
-a Handler option, and simply do some processing and pass on our data
-further down the line? That in a nutshell is how SAX filters work. It's
-Unix pipes for the 21st century!
-There are two downsides to this. Number 1 - writing SAX filters can be
-tricky. If you look into the future and read the advanced tutorial I'm
-writing, you'll see that Handler can come in several shapes and sizes.
-So making sure your filter does the right thing can be tricky.
-Secondly, constructing complex filter chains can be difficult, and
-simple thinking tells us that we only get one pass at our document,
-when often we'll need more than that.
-Luckily though, those downsides have been fixed by the release of two
-very cool modules. What's even better is that I didn't write either of
-them!
-The first module is XML::SAX::Base. This is a VITAL SAX module that
-acts as a base class for all SAX parsers and filters. It provides an
-abstraction away from calling the handler methods, that makes sure your
-filter or parser does the right thing, and it does it FAST. So, if you
-ever need to write a SAX filter, which if you're processing XML -> XML,
-or XML -> HTML, then you probably do, then you need to be writing it as
-a subclass of XML::SAX::Base. Really - this is advice not to ignore
-lightly. I will not go into the details of writing a SAX filter here.
-Kip Hampton, the author of XML::SAX::Base has covered this nicely in
-his article on XML.com here <URI>.
-To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
-who's modules you will probably have heard of or used, wrote a very
-clever module called XML::SAX::Machines. This combines some really
-clever SAX filter-type modules, with a construction toolkit for filters
-that makes building pipelines easy. But before we see how it makes
-things easy, first lets see how tricky it looks to build complex SAX
-filter pipelines.
-use XML::SAX::ParserFactory;
-use XML::Filter::Filter1;
-use XML::Filter::Filter2;
-use XML::SAX::Writer;
-my $output_string;
-my $writer = XML::SAX::Writer->new(Output => \$output_string);
-my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
-my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
-my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
-$parser->parse_uri("foo.xml");
-This is a lot easier with XML::SAX::Machines:
-use XML::SAX::Machines qw(Pipeline);
-my $output_string;
-my $parser = Pipeline(
-	XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
-	);
-$parser->parse_uri("foo.xml");
-One of the main benefits of XML::SAX::Machines is that the pipelines
-are constructed in natural order, rather than the reverse order we saw
-with manual pipeline construction. XML::SAX::Machines takes care of all
-the internals of pipe construction, providing you at the end with just
-a parser you can use (and you can re-use the same parser as many times
-as you need to).
-Just a final tip. If you ever get stuck and are confused about what is
-being passed from one SAX filter or parser to the next, then
-Devel::TraceSAX will come to your rescue. This perl debugger plugin
-will allow you to dump the SAX stream of events as it goes by. Usage is
-really very simple just call your perl script that uses SAX as follows:
-$ perl -d:TraceSAX <scriptname>
-And preferably pipe the output to a pager of some sort, such as more or
-less. The output is extremely verbose, but should help clear some
-issues up.
-=head1 AUTHOR
-Matt Sergeant, matt@sergeant.org
-$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
-=cut

changeset 923	5ccf9d5ab663
parent 922	996297fad800
parent 907	bab81256b297
child 924	a5ed0e6ca679