MCL/sftools/fbf/utilities: uh_parser/XML/SAX/Intro.pod@3b8bce67b574 (annotated)

176 6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	1	=head1 NAME
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	2
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	3	XML::SAX::Intro - An Introduction to SAX Parsing with Perl
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	4
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	5	=head1 Introduction
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	6
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	7	XML::SAX is a new way to work with XML Parsers in Perl. In this article
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	8	we'll discuss why you should be using SAX, why you should be using
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	9	XML::SAX, and we'll see some of the finer implementation details. The
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	10	text below assumes some familiarity with callback, or push based
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	11	parsing, but if you are unfamiliar with these techniques then a good
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	12	place to start is Kip Hampton's excellent series of articles on XML.com.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	13
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	14	=head1 Replacing XML::Parser
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	15
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	16	The de-facto way of parsing XML under perl is to use Larry Wall and
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	17	Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	18	the expat XML parser library by James Clark. It has been a hugely
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	19	successful project, but suffers from a couple of rather major flaws.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	20	Firstly it is a proprietary API, designed before the SAX API was
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	21	conceived, which means that it is not easily replaceable by other
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	22	streaming parsers. Secondly it's callbacks are subrefs. This doesn't
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	23	sound like much of an issue, but unfortunately leads to code like:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	24
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	25	sub handle_start {
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	26	my ($e, $el, %attrs) = @_;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	27	if ($el eq 'foo') {
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	28	$e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	29	}
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	30	}
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	31
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	32	As you can see, we're using the $e object to hold our state
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	33	information, which is a bad idea because we don't own that object - we
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	34	didn't create it. It's an internal object of XML::Parser, that happens
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	35	to be a hashref. We could all too easily overwrite XML::Parser internal
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	36	state variables by using this, or Clark could change it to an array ref
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	37	(not that he would, because it would break so much code, but he could).
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	38
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	39	The only way currently with XML::Parser to safely maintain state is to
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	40	use a closure:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	41
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	42	my $state = MyState->new();
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	43	$parser->setHandlers(Start => sub { handle_start($state, @_) });
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	44
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	45	This closure traps the $state variable, which now gets passed as the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	46	first parameter to your callback. Unfortunately very few people use
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	47	this technique, as it is not documented in the XML::Parser POD files.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	48
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	49	Another reason you might not want to use XML::Parser is because you
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	50	need some feature that it doesn't provide (such as validation), or you
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	51	might need to use a library that doesn't use expat, due to it not being
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	52	installed on your system, or due to having a restrictive ISP. Using SAX
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	53	allows you to work around these restrictions.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	54
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	55	=head1 Introducing SAX
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	56
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	57	SAX stands for the Simple API for XML. And simple it really is.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	58	Constructing a SAX parser and passing events to handlers is done as
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	59	simply as:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	60
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	61	use XML::SAX;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	62	use MySAXHandler;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	63
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	64	my $parser = XML::SAX::ParserFactory->parser(
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	65	Handler => MySAXHandler->new
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	66	);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	67
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	68	$parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	69
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	70	The important concept to grasp here is that SAX uses a factory class
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	71	called XML::SAX::ParserFactory to create a new parser instance. The
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	72	reason for this is so that you can support other underlying
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	73	parser implementations for different feature sets. This is one thing
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	74	that XML::Parser has always sorely lacked.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	75
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	76	In the code above we see the parse_uri method used, but we could
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	77	have equally well
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	78	called parse_file, parse_string, or parse(). Please see XML::SAX::Base
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	79	for what these methods take as parameters, but don't be fooled into
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	80	believing parse_file takes a filename. No, it takes a file handle, a
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	81	glob, or a subclass of IO::Handle. Beware.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	82
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	83	SAX works very similarly to XML::Parser's default callback method,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	84	except it has one major difference: rather than setting individual
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	85	callbacks, you create a new class in which to recieve the callbacks.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	86	Each callback is called as a method call on an instance of that handler
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	87	class. An example will best demonstrate this:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	88
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	89	package MySAXHandler;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	90	use base qw(XML::SAX::Base);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	91
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	92	sub start_document {
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	93	my ($self, $doc) = @_;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	94	# process document start event
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	95	}
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	96
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	97	sub start_element {
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	98	my ($self, $el) = @_;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	99	# process element start event
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	100	}
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	101
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	102	Now, when we instantiate this as above, and parse some XML with this as
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	103	the handler, the methods start_document and start_element will be
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	104	called as method calls, so this would be the equivalent of directly
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	105	calling:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	106
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	107	$object->start_element($el);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	108
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	109	Notice how this is different to XML::Parser's calling style, which
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	110	calls:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	111
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	112	start_element($e, $name, %attribs);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	113
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	114	It's the difference between function calling and method calling which
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	115	allows you to subclass SAX handlers which contributes to SAX being a
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	116	powerful solution.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	117
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	118	As you can see, unlike XML::Parser, we have to define a new package in
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	119	which to do our processing (there are hacks you can do to make this
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	120	uneccessary, but I'll leave figuring those out to the experts). The
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	121	biggest benefit of this is that you maintain your own state variable
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	122	($self in the above example) thus freeing you of the concerns listed
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	123	above. It is also an improvement in maintainability - you can place the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	124	code in a separate file if you wish to, and your callback methods are
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	125	always called the same thing, rather than having to choose a suitable
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	126	name for them as you had to with XML::Parser. This is an obvious win.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	127
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	128	SAX parsers are also very flexible in how you pass a handler to them.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	129	You can use a constructor parameter as we saw above, or we can pass the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	130	handler directly in the call to one of the parse methods:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	131
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	132	$parser->parse(Handler => $handler,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	133	Source => { SystemId => "foo.xml" });
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	134	# or...
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	135	$parser->parse_file($fh, Handler => $handler);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	136
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	137	This flexibility allows for one parser to be used in many different
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	138	scenarios throughout your script (though one shouldn't feel pressure to
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	139	use this method, as parser construction is generally not a time
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	140	consuming process).
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	141
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	142	=head1 Callback Parameters
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	143
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	144	The only other thing you need to know to understand basic SAX is the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	145	structure of the parameters passed to each of the callbacks. In
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	146	XML::Parser, all parameters are passed as multiple options to the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	147	callbacks, so for example the Start callback would be called as
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	148	my_start($e, $name, %attributes), and the PI callback would be called
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	149	as my_processing_instruction($e, $target, $data). In SAX, every
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	150	callback is passed a hash reference, containing entries that define our
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	151	"node". The key callbacks and the structures they receive are:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	152
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	153	=head2 start_element
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	154
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	155	The start_element handler is called whenever a parser sees an opening
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	156	tag. It is passed an element structure consisting of:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	157
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	158	=over 4
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	159
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	160	=item LocalName
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	161
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	162	The name of the element minus any namespace prefix it may
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	163	have come with in the document.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	164
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	165	=item NamespaceURI
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	166
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	167	The URI of the namespace associated with this element,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	168	or the empty string for none.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	169
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	170	=item Attributes
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	171
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	172	A set of attributes as described below.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	173
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	174	=item Name
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	175
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	176	The name of the element as it was seen in the document (i.e.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	177	including any prefix associated with it)
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	178
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	179	=item Prefix
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	180
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	181	The prefix used to qualify this element's namespace, or the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	182	empty string if none.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	183
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	184	=back
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	185
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	186	The B<Attributes> are a hash reference, keyed by what we have called
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	187	"James Clark" notation. This means that the attribute name has been
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	188	expanded to include any associated namespace URI, and put together as
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	189	{ns}name, where "ns" is the expanded namespace URI of the attribute if
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	190	and only if the attribute had a prefix, and "name" is the LocalName of
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	191	the attribute.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	192
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	193	The value of each entry in the attributes hash is another hash
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	194	structure consisting of:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	195
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	196	=over 4
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	197
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	198	=item LocalName
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	199
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	200	The name of the attribute minus any namespace prefix it may have
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	201	come with in the document.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	202
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	203	=item NamespaceURI
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	204
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	205	The URI of the namespace associated with this attribute. If the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	206	attribute had no prefix, then this consists of just the empty string.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	207
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	208	=item Name
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	209
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	210	The attribute's name as it appeared in the document, including any
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	211	namespace prefix.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	212
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	213	=item Prefix
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	214
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	215	The prefix used to qualify this attribute's namepace, or the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	216	empty string if none.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	217
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	218	=item Value
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	219
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	220	The value of the attribute.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	221
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	222	=back
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	223
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	224	So a full example, as output by Data::Dumper might be:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	225
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	226	....
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	227
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	228	=head2 end_element
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	229
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	230	The end_element handler is called either when a parser sees a closing
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	231	tag, or after start_element has been called for an empty element (do
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	232	note however that a parser may if it is so inclined call characters
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	233	with an empty string when it sees an empty element. There is no simple
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	234	way in SAX to determine if the parser in fact saw an empty element, a
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	235	start and end element with no content..
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	236
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	237	The end_element handler receives exactly the same structure as
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	238	start_element, minus the Attributes entry. One must note though that it
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	239	should not be a reference to the same data as start_element receives,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	240	so you may change the values in start_element but this will not affect
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	241	the values later seen by end_element.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	242
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	243	=head2 characters
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	244
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	245	The characters callback may be called in serveral circumstances. The
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	246	most obvious one is when seeing ordinary character data in the markup.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	247	But it is also called for text in a CDATA section, and is also called
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	248	in other situations. A SAX parser has to make no guarantees whatsoever
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	249	about how many times it may call characters for a stretch of text in an
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	250	XML document - it may call once, or it may call once for every
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	251	character in the text. In order to work around this it is often
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	252	important for the SAX developer to use a bundling technique, where text
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	253	is gathered up and processed in one of the other callbacks. This is not
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	254	always necessary, but it is a worthwhile technique to learn, which we
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	255	will cover in XML::SAX::Advanced (when I get around to writing it).
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	256
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	257	The characters handler is called with a very simple structure - a hash
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	258	reference consisting of just one entry:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	259
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	260	=over 4
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	261
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	262	=item Data
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	263
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	264	The text data that was received.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	265
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	266	=back
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	267
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	268	=head2 comment
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	269
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	270	The comment callback is called for comment text. Unlike with
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	271	C<characters()>, the comment callback must be invoked just once for an
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	272	entire comment string. It receives a single simple structure - a hash
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	273	reference containing just one entry:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	274
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	275	=over 4
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	276
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	277	=item Data
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	278
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	279	The text of the comment.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	280
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	281	=back
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	282
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	283	=head2 processing_instruction
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	284
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	285	The processing instruction handler is called for all processing
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	286	instructions in the document. Note that these processing instructions
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	287	may appear before the document root element, or after it, or anywhere
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	288	where text and elements would normally appear within the document,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	289	according to the XML specification.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	290
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	291	The handler is passed a structure containing just two entries:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	292
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	293	=over 4
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	294
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	295	=item Target
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	296
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	297	The target of the processing instrcution
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	298
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	299	=item Data
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	300
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	301	The text data in the processing instruction. Can be an empty
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	302	string for a processing instruction that has no data element.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	303	For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	304
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	305	=back
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	306
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	307	=head1 Tip of the iceberg
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	308
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	309	What we have discussed above is really the tip of the SAX iceberg. And
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	310	so far it looks like there's not much of interest to SAX beyond what we
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	311	have seen with XML::Parser. But it does go much further than that, I
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	312	promise.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	313
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	314	People who hate Object Oriented code for the sake of it may be thinking
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	315	here that creating a new package just to parse something is a waste
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	316	when they've been parsing things just fine up to now using procedural
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	317	code. But there's reason to all this madness. And that reason is SAX
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	318	Filters.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	319
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	320	As you saw right at the very start, to let the parser know about our
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	321	class, we pass it an instance of our class as the Handler to the
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	322	parser. But now imagine what would happen if our class could also take
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	323	a Handler option, and simply do some processing and pass on our data
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	324	further down the line? That in a nutshell is how SAX filters work. It's
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	325	Unix pipes for the 21st century!
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	326
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	327	There are two downsides to this. Number 1 - writing SAX filters can be
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	328	tricky. If you look into the future and read the advanced tutorial I'm
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	329	writing, you'll see that Handler can come in several shapes and sizes.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	330	So making sure your filter does the right thing can be tricky.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	331	Secondly, constructing complex filter chains can be difficult, and
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	332	simple thinking tells us that we only get one pass at our document,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	333	when often we'll need more than that.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	334
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	335	Luckily though, those downsides have been fixed by the release of two
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	336	very cool modules. What's even better is that I didn't write either of
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	337	them!
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	338
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	339	The first module is XML::SAX::Base. This is a VITAL SAX module that
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	340	acts as a base class for all SAX parsers and filters. It provides an
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	341	abstraction away from calling the handler methods, that makes sure your
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	342	filter or parser does the right thing, and it does it FAST. So, if you
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	343	ever need to write a SAX filter, which if you're processing XML -> XML,
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	344	or XML -> HTML, then you probably do, then you need to be writing it as
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	345	a subclass of XML::SAX::Base. Really - this is advice not to ignore
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	346	lightly. I will not go into the details of writing a SAX filter here.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	347	Kip Hampton, the author of XML::SAX::Base has covered this nicely in
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	348	his article on XML.com here <URI>.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	349
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	350	To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	351	who's modules you will probably have heard of or used, wrote a very
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	352	clever module called XML::SAX::Machines. This combines some really
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	353	clever SAX filter-type modules, with a construction toolkit for filters
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	354	that makes building pipelines easy. But before we see how it makes
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	355	things easy, first lets see how tricky it looks to build complex SAX
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	356	filter pipelines.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	357
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	358	use XML::SAX::ParserFactory;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	359	use XML::Filter::Filter1;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	360	use XML::Filter::Filter2;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	361	use XML::SAX::Writer;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	362
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	363	my $output_string;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	364	my $writer = XML::SAX::Writer->new(Output => \$output_string);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	365	my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	366	my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	367	my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	368
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	369	$parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	370
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	371	This is a lot easier with XML::SAX::Machines:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	372
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	373	use XML::SAX::Machines qw(Pipeline);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	374
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	375	my $output_string;
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	376	my $parser = Pipeline(
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	377	XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	378	);
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	379
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	380	$parser->parse_uri("foo.xml");
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	381
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	382	One of the main benefits of XML::SAX::Machines is that the pipelines
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	383	are constructed in natural order, rather than the reverse order we saw
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	384	with manual pipeline construction. XML::SAX::Machines takes care of all
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	385	the internals of pipe construction, providing you at the end with just
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	386	a parser you can use (and you can re-use the same parser as many times
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	387	as you need to).
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	388
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	389	Just a final tip. If you ever get stuck and are confused about what is
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	390	being passed from one SAX filter or parser to the next, then
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	391	Devel::TraceSAX will come to your rescue. This perl debugger plugin
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	392	will allow you to dump the SAX stream of events as it goes by. Usage is
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	393	really very simple just call your perl script that uses SAX as follows:
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	394
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	395	$ perl -d:TraceSAX <scriptname>
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	396
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	397	And preferably pipe the output to a pager of some sort, such as more or
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	398	less. The output is extremely verbose, but should help clear some
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	399	issues up.
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	400
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	401	=head1 AUTHOR
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	402
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	403	Matt Sergeant, matt@sergeant.org
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	404
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	405	$Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	406
6d3c3db11e72 Add Raptor uh parser Dario Sestito <darios@symbian.org> parents: diff changeset	407	=cut

author	William Roberts <williamr@symbian.org>
	Tue, 29 Jun 2010 17:58:44 +0100
changeset 274	3b8bce67b574
parent 176	6d3c3db11e72
permissions	-rw-r--r--