deprecated/buildtools/buildsystemtools/lib/XML/Parser.pod
changeset 662 60be34e1b006
parent 655 3f65fd25dfd4
equal deleted inserted replaced
654:7c11c3d8d025 662:60be34e1b006
       
     1 =head1 WARNING
       
     2 
       
     3 This manual page was copied from the XML::Parser distribution (version 2.27)
       
     4 written by Clark Cooper. You can find newer versions at CPAN.
       
     5 
       
     6 =head1 NAME
       
     7 
       
     8 XML::Parser - A perl module for parsing XML documents
       
     9 
       
    10 =head1 SYNOPSIS
       
    11 
       
    12   use XML::Parser;
       
    13   
       
    14   $p1 = new XML::Parser(Style => 'Debug');
       
    15   $p1->parsefile('REC-xml-19980210.xml');
       
    16   $p1->parse('<foo id="me">Hello World</foo>');
       
    17 
       
    18   # Alternative
       
    19   $p2 = new XML::Parser(Handlers => {Start => \&handle_start,
       
    20 				     End   => \&handle_end,
       
    21 				     Char  => \&handle_char});
       
    22   $p2->parse($socket);
       
    23 
       
    24   # Another alternative
       
    25   $p3 = new XML::Parser(ErrorContext => 2);
       
    26 
       
    27   $p3->setHandlers(Char    => \&text,
       
    28 		   Default => \&other);
       
    29 
       
    30   open(FOO, 'xmlgenerator |');
       
    31   $p3->parse(*FOO, ProtocolEncoding => 'ISO-8859-1');
       
    32   close(FOO);
       
    33 
       
    34   $p3->parsefile('junk.xml', ErrorContext => 3);
       
    35 
       
    36 =head1 DESCRIPTION
       
    37 
       
    38 This module provides ways to parse XML documents. It is built on top of
       
    39 L<XML::Parser::Expat>, which is a lower level interface to James Clark's
       
    40 expat library. Each call to one of the parsing methods creates a new
       
    41 instance of XML::Parser::Expat which is then used to parse the document.
       
    42 Expat options may be provided when the XML::Parser object is created.
       
    43 These options are then passed on to the Expat object on each parse call.
       
    44 They can also be given as extra arguments to the parse methods, in which
       
    45 case they override options given at XML::Parser creation time.
       
    46 
       
    47 The behavior of the parser is controlled either by C<L</Style>> and/or
       
    48 C<L</Handlers>> options, or by L</setHandlers> method. These all provide
       
    49 mechanisms for XML::Parser to set the handlers needed by XML::Parser::Expat.
       
    50 If neither C<Style> nor C<Handlers> are specified, then parsing just
       
    51 checks the document for being well-formed.
       
    52 
       
    53 When underlying handlers get called, they receive as their first parameter
       
    54 the I<Expat> object, not the Parser object.
       
    55 
       
    56 =head1 METHODS
       
    57 
       
    58 =over 4
       
    59 
       
    60 =item new
       
    61 
       
    62 This is a class method, the constructor for XML::Parser. Options are passed
       
    63 as keyword value pairs. Recognized options are:
       
    64 
       
    65 =over 4
       
    66 
       
    67 =item * Style
       
    68 
       
    69 This option provides an easy way to create a given style of parser. The
       
    70 built in styles are: L<"Debug">, L<"Subs">, L<"Tree">, L<"Objects">,
       
    71 and L<"Stream">.
       
    72 Custom styles can be provided by giving a full package name containing
       
    73 at least one '::'. This package should then have subs defined for each
       
    74 handler it wishes to have installed. See L<"STYLES"> below
       
    75 for a discussion of each built in style.
       
    76 
       
    77 =item * Handlers
       
    78 
       
    79 When provided, this option should be an anonymous hash containing as
       
    80 keys the type of handler and as values a sub reference to handle that
       
    81 type of event. All the handlers get passed as their 1st parameter the
       
    82 instance of expat that is parsing the document. Further details on
       
    83 handlers can be found in L<"HANDLERS">. Any handler set here
       
    84 overrides the corresponding handler set with the Style option.
       
    85 
       
    86 =item * Pkg
       
    87 
       
    88 Some styles will refer to subs defined in this package. If not provided,
       
    89 it defaults to the package which called the constructor.
       
    90 
       
    91 =item * ErrorContext
       
    92 
       
    93 This is an Expat option. When this option is defined, errors are reported
       
    94 in context. The value should be the number of lines to show on either side
       
    95 of the line in which the error occurred.
       
    96 
       
    97 =item * ProtocolEncoding
       
    98 
       
    99 This is an Expat option. This sets the protocol encoding name. It defaults
       
   100 to none. The built-in encodings are: C<UTF-8>, C<ISO-8859-1>, C<UTF-16>, and
       
   101 C<US-ASCII>. Other encodings may be used if they have encoding maps in one
       
   102 of the directories in the @Encoding_Path list. Check L<"ENCODINGS"> for
       
   103 more information on encoding maps. Setting the protocol encoding overrides
       
   104 any encoding in the XML declaration.
       
   105 
       
   106 =item * Namespaces
       
   107 
       
   108 This is an Expat option. If this is set to a true value, then namespace
       
   109 processing is done during the parse. See L<XML::Parser::Expat/"Namespaces">
       
   110 for further discussion of namespace processing.
       
   111 
       
   112 =item * NoExpand
       
   113 
       
   114 This is an Expat option. Normally, the parser will try to expand references
       
   115 to entities defined in the internal subset. If this option is set to a true
       
   116 value, and a default handler is also set, then the default handler will be
       
   117 called when an entity reference is seen in text. This has no effect if a
       
   118 default handler has not been registered, and it has no effect on the expansion
       
   119 of entity references inside attribute values.
       
   120 
       
   121 =item * Stream_Delimiter
       
   122 
       
   123 This is an Expat option. It takes a string value. When this string is found
       
   124 alone on a line while parsing from a stream, then the parse is ended as if it
       
   125 saw an end of file. The intended use is with a stream of xml documents in a
       
   126 MIME multipart format. The string should not contain a trailing newline.
       
   127 
       
   128 =item * ParseParamEnt
       
   129 
       
   130 This is an Expat option. Unless standalone is set to "yes" in the XML
       
   131 declaration, setting this to a true value allows the external DTD to be read,
       
   132 and parameter entities to be parsed and expanded.
       
   133 
       
   134 =item * Non-Expat-Options
       
   135 
       
   136 If provided, this should be an anonymous hash whose keys are options that
       
   137 shouldn't be passed to Expat. This should only be of concern to those
       
   138 subclassing XML::Parser.
       
   139 
       
   140 =back
       
   141 
       
   142 =item  setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
       
   143 
       
   144 This method registers handlers for various parser events. It overrides any
       
   145 previous handlers registered through the Style or Handler options or through
       
   146 earlier calls to setHandlers. By providing a false or undefined value as
       
   147 the handler, the existing handler can be unset.
       
   148 
       
   149 This method returns a list of type, handler pairs corresponding to the
       
   150 input. The handlers returned are the ones that were in effect prior to
       
   151 the call.
       
   152 
       
   153 See a description of the handler types in L<"HANDLERS">.
       
   154 
       
   155 =item parse(SOURCE [, OPT => OPT_VALUE [...]])
       
   156 
       
   157 The SOURCE parameter should either be a string containing the whole XML
       
   158 document, or it should be an open IO::Handle. Constructor options to
       
   159 XML::Parser::Expat given as keyword-value pairs may follow the SOURCE
       
   160 parameter. These override, for this call, any options or attributes passed
       
   161 through from the XML::Parser instance.
       
   162 
       
   163 A die call is thrown if a parse error occurs. Otherwise it will return 1
       
   164 or whatever is returned from the B<Final> handler, if one is installed.
       
   165 In other words, what parse may return depends on the style.
       
   166 
       
   167 =item parsestring
       
   168 
       
   169 This is just an alias for parse for backwards compatibility.
       
   170 
       
   171 =item parsefile(FILE [, OPT => OPT_VALUE [...]])
       
   172 
       
   173 Open FILE for reading, then call parse with the open handle. The file
       
   174 is closed no matter how parse returns. Returns what parse returns.
       
   175 
       
   176 =item parse_start([ OPT => OPT_VALUE [...]])
       
   177 
       
   178 Create and return a new instance of XML::Parser::ExpatNB. Constructor
       
   179 options may be provided. If an init handler has been provided, it is
       
   180 called before returning the ExpatNB object. Documents are parsed by
       
   181 making incremental calls to the parse_more method of this object, which
       
   182 takes a string. A single call to the parse_done method of this object,
       
   183 which takes no arguments, indicates that the document is finished.
       
   184 
       
   185 If there is a final handler installed, it is executed by the parse_done
       
   186 method before returning and the parse_done method returns whatever is
       
   187 returned by the final handler.
       
   188 
       
   189 =back
       
   190 
       
   191 =head1 HANDLERS
       
   192 
       
   193 Expat is an event based parser. As the parser recognizes parts of the
       
   194 document (say the start or end tag for an XML element), then any handlers
       
   195 registered for that type of an event are called with suitable parameters.
       
   196 All handlers receive an instance of XML::Parser::Expat as their first
       
   197 argument. See L<XML::Parser::Expat/"METHODS"> for a discussion of the
       
   198 methods that can be called on this object.
       
   199 
       
   200 =head2 Init		(Expat)
       
   201 
       
   202 This is called just before the parsing of the document starts.
       
   203 
       
   204 =head2 Final		(Expat)
       
   205 
       
   206 This is called just after parsing has finished, but only if no errors
       
   207 occurred during the parse. Parse returns what this returns.
       
   208 
       
   209 =head2 Start		(Expat, Element [, Attr, Val [,...]])
       
   210 
       
   211 This event is generated when an XML start tag is recognized. Element is the
       
   212 name of the XML element type that is opened with the start tag. The Attr &
       
   213 Val pairs are generated for each attribute in the start tag.
       
   214 
       
   215 =head2 End		(Expat, Element)
       
   216 
       
   217 This event is generated when an XML end tag is recognized. Note that
       
   218 an XML empty tag (<foo/>) generates both a start and an end event.
       
   219 
       
   220 =head2 Char		(Expat, String)
       
   221 
       
   222 This event is generated when non-markup is recognized. The non-markup
       
   223 sequence of characters is in String. A single non-markup sequence of
       
   224 characters may generate multiple calls to this handler. Whatever the
       
   225 encoding of the string in the original document, this is given to the
       
   226 handler in UTF-8.
       
   227 
       
   228 =head2 Proc		(Expat, Target, Data)
       
   229 
       
   230 This event is generated when a processing instruction is recognized.
       
   231 
       
   232 =head2 Comment		(Expat, Data)
       
   233 
       
   234 This event is generated when a comment is recognized.
       
   235 
       
   236 =head2 CdataStart	(Expat)
       
   237 
       
   238 This is called at the start of a CDATA section.
       
   239 
       
   240 =head2 CdataEnd		(Expat)
       
   241 
       
   242 This is called at the end of a CDATA section.
       
   243 
       
   244 =head2 Default		(Expat, String)
       
   245 
       
   246 This is called for any characters that don't have a registered handler.
       
   247 This includes both characters that are part of markup for which no
       
   248 events are generated (markup declarations) and characters that
       
   249 could generate events, but for which no handler has been registered.
       
   250 
       
   251 Whatever the encoding in the original document, the string is returned to
       
   252 the handler in UTF-8.
       
   253 
       
   254 =head2 Unparsed		(Expat, Entity, Base, Sysid, Pubid, Notation)
       
   255 
       
   256 This is called for a declaration of an unparsed entity. Entity is the name
       
   257 of the entity. Base is the base to be used for resolving a relative URI.
       
   258 Sysid is the system id. Pubid is the public id. Notation is the notation
       
   259 name. Base and Pubid may be undefined.
       
   260 
       
   261 =head2 Notation		(Expat, Notation, Base, Sysid, Pubid)
       
   262 
       
   263 This is called for a declaration of notation. Notation is the notation name.
       
   264 Base is the base to be used for resolving a relative URI. Sysid is the system
       
   265 id. Pubid is the public id. Base, Sysid, and Pubid may all be undefined.
       
   266 
       
   267 =head2 ExternEnt	(Expat, Base, Sysid, Pubid)
       
   268 
       
   269 This is called when an external entity is referenced. Base is the base to be
       
   270 used for resolving a relative URI. Sysid is the system id. Pubid is the public
       
   271 id. Base, and Pubid may be undefined.
       
   272 
       
   273 This handler should either return a string, which represents the contents of
       
   274 the external entity, or return an open filehandle that can be read to obtain
       
   275 the contents of the external entity, or return undef, which indicates the
       
   276 external entity couldn't be found and will generate a parse error.
       
   277 
       
   278 If an open filehandle is returned, it must be returned as either a glob
       
   279 (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle). The
       
   280 parser will close the filehandle after using it.
       
   281 
       
   282 A default handler, XML::Parser::default_ext_ent_handler, is installed
       
   283 for this. It only handles the file URL method and it assumes "file:" if
       
   284 it isn't there. The expat base method can be used to set a basename for
       
   285 relative pathnames. If no basename is given, or if the basename is itself
       
   286 a relative name, then it is relative to the current working directory.
       
   287 
       
   288 =head2 Entity		(Expat, Name, Val, Sysid, Pubid, Ndata)
       
   289 
       
   290 This is called when an entity is declared. For internal entities, the Val
       
   291 parameter will contain the value and the remaining three parameters will be
       
   292 undefined. For external entities, the Val parameter will be undefined, the
       
   293 Sysid parameter will have the system id, the Pubid parameter will have the
       
   294 public id if it was provided (it will be undefined otherwise), the Ndata
       
   295 parameter will contain the notation for unparsed entities. If this is a
       
   296 parameter entity declaration, then a '%' will be prefixed to the name.
       
   297 
       
   298 Note that this handler and the Unparsed handler above overlap. If both are
       
   299 set, then this handler will not be called for unparsed entities.
       
   300 
       
   301 =head2 Element		(Expat, Name, Model)
       
   302 
       
   303 The element handler is called when an element declaration is found. Name
       
   304 is the element name, and Model is the content model as a string.
       
   305 
       
   306 =head2 Attlist		(Expat, Elname, Attname, Type, Default, Fixed)
       
   307 
       
   308 This handler is called for each attribute in an ATTLIST declaration.
       
   309 So an ATTLIST declaration that has multiple attributes will generate multiple
       
   310 calls to this handler. The Elname parameter is the name of the element with
       
   311 which the attribute is being associated. The Attname parameter is the name
       
   312 of the attribute. Type is the attribute type, given as a string. Default is
       
   313 the default value, which will either be "#REQUIRED", "#IMPLIED" or a quoted
       
   314 string (i.e. the returned string will begin and end with a quote character).
       
   315 If Fixed is true, then this is a fixed attribute.
       
   316 
       
   317 =head2 Doctype		(Expat, Name, Sysid, Pubid, Internal)
       
   318 
       
   319 This handler is called for DOCTYPE declarations. Name is the document type
       
   320 name. Sysid is the system id of the document type, if it was provided,
       
   321 otherwise it's undefined. Pubid is the public id of the document type,
       
   322 which will be undefined if no public id was given. Internal is the internal
       
   323 subset, given as a string. If there was no internal subset, it will be
       
   324 undefined. Internal will contain all whitespace, comments, processing
       
   325 instructions, and declarations seen in the internal subset. The declarations
       
   326 will be there whether or not they have been processed by another handler
       
   327 (except for unparsed entities processed by the Unparsed handler). However,
       
   328 comments and processing instructions will not appear if they've been processed
       
   329 by their respective handlers.
       
   330 
       
   331 =head2 XMLDecl		(Expat, Version, Encoding, Standalone)
       
   332 
       
   333 This handler is called for xml declarations. Version is a string containg
       
   334 the version. Encoding is either undefined or contains an encoding string.
       
   335 Standalone will be either true, false, or undefined if the standalone attribute
       
   336 is yes, no, or not made respectively.
       
   337 
       
   338 =head1 STYLES
       
   339 
       
   340 =head2 Debug
       
   341 
       
   342 This just prints out the document in outline form. Nothing special is
       
   343 returned by parse.
       
   344 
       
   345 =head2 Subs
       
   346 
       
   347 Each time an element starts, a sub by that name in the package specified
       
   348 by the Pkg option is called with the same parameters that the Start
       
   349 handler gets called with.
       
   350 
       
   351 Each time an element ends, a sub with that name appended with an underscore
       
   352 ("_"), is called with the same parameters that the End handler gets called
       
   353 with.
       
   354 
       
   355 Nothing special is returned by parse.
       
   356 
       
   357 =head2 Tree
       
   358 
       
   359 Parse will return a parse tree for the document. Each node in the tree
       
   360 takes the form of a tag, content pair. Text nodes are represented with
       
   361 a pseudo-tag of "0" and the string that is their content. For elements,
       
   362 the content is an array reference. The first item in the array is a
       
   363 (possibly empty) hash reference containing attributes. The remainder of
       
   364 the array is a sequence of tag-content pairs representing the content
       
   365 of the element.
       
   366 
       
   367 So for example the result of parsing:
       
   368 
       
   369   <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo>
       
   370 
       
   371 would be:
       
   372              Tag   Content
       
   373   ==================================================================
       
   374   [foo, [{}, head, [{id => "a"}, 0, "Hello ",  em, [{}, 0, "there"]],
       
   375 	      bar, [         {}, 0, "Howdy",  ref, [{}]],
       
   376 	        0, "do"
       
   377 	]
       
   378   ]
       
   379 
       
   380 The root document "foo", has 3 children: a "head" element, a "bar"
       
   381 element and the text "do". After the empty attribute hash, these are
       
   382 represented in it's contents by 3 tag-content pairs.
       
   383 
       
   384 =head2 Objects
       
   385 
       
   386 This is similar to the Tree style, except that a hash object is created for
       
   387 each element. The corresponding object will be in the class whose name
       
   388 is created by appending "::" and the element name to the package set with
       
   389 the Pkg option. Non-markup text will be in the ::Characters class. The
       
   390 contents of the corresponding object will be in an anonymous array that
       
   391 is the value of the Kids property for that object.
       
   392 
       
   393 =head2 Stream
       
   394 
       
   395 This style also uses the Pkg package. If none of the subs that this
       
   396 style looks for is there, then the effect of parsing with this style is
       
   397 to print a canonical copy of the document without comments or declarations.
       
   398 All the subs receive as their 1st parameter the Expat instance for the
       
   399 document they're parsing.
       
   400 
       
   401 It looks for the following routines:
       
   402 
       
   403 =over 4
       
   404 
       
   405 =item * StartDocument
       
   406 
       
   407 Called at the start of the parse .
       
   408 
       
   409 =item * StartTag
       
   410 
       
   411 Called for every start tag with a second parameter of the element type. The $_
       
   412 variable will contain a copy of the tag and the %_ variable will contain
       
   413 attribute values supplied for that element.
       
   414 
       
   415 =item * EndTag
       
   416 
       
   417 Called for every end tag with a second parameter of the element type. The $_
       
   418 variable will contain a copy of the end tag.
       
   419 
       
   420 =item * Text
       
   421 
       
   422 Called just before start or end tags with accumulated non-markup text in
       
   423 the $_ variable.
       
   424 
       
   425 =item * PI
       
   426 
       
   427 Called for processing instructions. The $_ variable will contain a copy of
       
   428 the PI and the target and data are sent as 2nd and 3rd parameters
       
   429 respectively.
       
   430 
       
   431 =item * EndDocument
       
   432 
       
   433 Called at conclusion of the parse.
       
   434 
       
   435 =back
       
   436 
       
   437 =head1 ENCODINGS
       
   438 
       
   439 XML documents may be encoded in character sets other than Unicode as
       
   440 long as they may be mapped into the Unicode character set. Expat has
       
   441 further restrictions on encodings. Read the xmlparse.h header file in
       
   442 the expat distribution to see details on these restrictions.
       
   443 
       
   444 Expat has built-in encodings for: C<UTF-8>, C<ISO-8859-1>, C<UTF-16>, and
       
   445 C<US-ASCII>. Encodings are set either through the XML declaration
       
   446 encoding attribute or through the ProtocolEncoding option to XML::Parser
       
   447 or XML::Parser::Expat.
       
   448 
       
   449 For encodings other than the built-ins, expat calls the function
       
   450 load_encoding in the Expat package with the encoding name. This function
       
   451 looks for a file in the path list @XML::Parser::Expat::Encoding_Path, that
       
   452 matches the lower-cased name with a '.enc' extension. The first one it
       
   453 finds, it loads.
       
   454 
       
   455 If you wish to build your own encoding maps, check out the XML::Encoding
       
   456 module from CPAN.
       
   457 
       
   458 =head1 AUTHORS
       
   459 
       
   460 Larry Wall <F<larry@wall.org>> wrote version 1.0.
       
   461 
       
   462 Clark Cooper <F<coopercc@netheaven.com>> picked up support, changed the API
       
   463 for this version (2.x), provided documentation,
       
   464 and added some standard package features.
       
   465 
       
   466 =cut