|
1 =head1 WARNING |
|
2 |
|
3 This manual page was copied from the XML::Parser distribution (version 2.27) |
|
4 written by Clark Cooper. You can find newer versions at CPAN. |
|
5 |
|
6 =head1 NAME |
|
7 |
|
8 XML::Parser - A perl module for parsing XML documents |
|
9 |
|
10 =head1 SYNOPSIS |
|
11 |
|
12 use XML::Parser; |
|
13 |
|
14 $p1 = new XML::Parser(Style => 'Debug'); |
|
15 $p1->parsefile('REC-xml-19980210.xml'); |
|
16 $p1->parse('<foo id="me">Hello World</foo>'); |
|
17 |
|
18 # Alternative |
|
19 $p2 = new XML::Parser(Handlers => {Start => \&handle_start, |
|
20 End => \&handle_end, |
|
21 Char => \&handle_char}); |
|
22 $p2->parse($socket); |
|
23 |
|
24 # Another alternative |
|
25 $p3 = new XML::Parser(ErrorContext => 2); |
|
26 |
|
27 $p3->setHandlers(Char => \&text, |
|
28 Default => \&other); |
|
29 |
|
30 open(FOO, 'xmlgenerator |'); |
|
31 $p3->parse(*FOO, ProtocolEncoding => 'ISO-8859-1'); |
|
32 close(FOO); |
|
33 |
|
34 $p3->parsefile('junk.xml', ErrorContext => 3); |
|
35 |
|
36 =head1 DESCRIPTION |
|
37 |
|
38 This module provides ways to parse XML documents. It is built on top of |
|
39 L<XML::Parser::Expat>, which is a lower level interface to James Clark's |
|
40 expat library. Each call to one of the parsing methods creates a new |
|
41 instance of XML::Parser::Expat which is then used to parse the document. |
|
42 Expat options may be provided when the XML::Parser object is created. |
|
43 These options are then passed on to the Expat object on each parse call. |
|
44 They can also be given as extra arguments to the parse methods, in which |
|
45 case they override options given at XML::Parser creation time. |
|
46 |
|
47 The behavior of the parser is controlled either by C<L</Style>> and/or |
|
48 C<L</Handlers>> options, or by L</setHandlers> method. These all provide |
|
49 mechanisms for XML::Parser to set the handlers needed by XML::Parser::Expat. |
|
50 If neither C<Style> nor C<Handlers> are specified, then parsing just |
|
51 checks the document for being well-formed. |
|
52 |
|
53 When underlying handlers get called, they receive as their first parameter |
|
54 the I<Expat> object, not the Parser object. |
|
55 |
|
56 =head1 METHODS |
|
57 |
|
58 =over 4 |
|
59 |
|
60 =item new |
|
61 |
|
62 This is a class method, the constructor for XML::Parser. Options are passed |
|
63 as keyword value pairs. Recognized options are: |
|
64 |
|
65 =over 4 |
|
66 |
|
67 =item * Style |
|
68 |
|
69 This option provides an easy way to create a given style of parser. The |
|
70 built in styles are: L<"Debug">, L<"Subs">, L<"Tree">, L<"Objects">, |
|
71 and L<"Stream">. |
|
72 Custom styles can be provided by giving a full package name containing |
|
73 at least one '::'. This package should then have subs defined for each |
|
74 handler it wishes to have installed. See L<"STYLES"> below |
|
75 for a discussion of each built in style. |
|
76 |
|
77 =item * Handlers |
|
78 |
|
79 When provided, this option should be an anonymous hash containing as |
|
80 keys the type of handler and as values a sub reference to handle that |
|
81 type of event. All the handlers get passed as their 1st parameter the |
|
82 instance of expat that is parsing the document. Further details on |
|
83 handlers can be found in L<"HANDLERS">. Any handler set here |
|
84 overrides the corresponding handler set with the Style option. |
|
85 |
|
86 =item * Pkg |
|
87 |
|
88 Some styles will refer to subs defined in this package. If not provided, |
|
89 it defaults to the package which called the constructor. |
|
90 |
|
91 =item * ErrorContext |
|
92 |
|
93 This is an Expat option. When this option is defined, errors are reported |
|
94 in context. The value should be the number of lines to show on either side |
|
95 of the line in which the error occurred. |
|
96 |
|
97 =item * ProtocolEncoding |
|
98 |
|
99 This is an Expat option. This sets the protocol encoding name. It defaults |
|
100 to none. The built-in encodings are: C<UTF-8>, C<ISO-8859-1>, C<UTF-16>, and |
|
101 C<US-ASCII>. Other encodings may be used if they have encoding maps in one |
|
102 of the directories in the @Encoding_Path list. Check L<"ENCODINGS"> for |
|
103 more information on encoding maps. Setting the protocol encoding overrides |
|
104 any encoding in the XML declaration. |
|
105 |
|
106 =item * Namespaces |
|
107 |
|
108 This is an Expat option. If this is set to a true value, then namespace |
|
109 processing is done during the parse. See L<XML::Parser::Expat/"Namespaces"> |
|
110 for further discussion of namespace processing. |
|
111 |
|
112 =item * NoExpand |
|
113 |
|
114 This is an Expat option. Normally, the parser will try to expand references |
|
115 to entities defined in the internal subset. If this option is set to a true |
|
116 value, and a default handler is also set, then the default handler will be |
|
117 called when an entity reference is seen in text. This has no effect if a |
|
118 default handler has not been registered, and it has no effect on the expansion |
|
119 of entity references inside attribute values. |
|
120 |
|
121 =item * Stream_Delimiter |
|
122 |
|
123 This is an Expat option. It takes a string value. When this string is found |
|
124 alone on a line while parsing from a stream, then the parse is ended as if it |
|
125 saw an end of file. The intended use is with a stream of xml documents in a |
|
126 MIME multipart format. The string should not contain a trailing newline. |
|
127 |
|
128 =item * ParseParamEnt |
|
129 |
|
130 This is an Expat option. Unless standalone is set to "yes" in the XML |
|
131 declaration, setting this to a true value allows the external DTD to be read, |
|
132 and parameter entities to be parsed and expanded. |
|
133 |
|
134 =item * Non-Expat-Options |
|
135 |
|
136 If provided, this should be an anonymous hash whose keys are options that |
|
137 shouldn't be passed to Expat. This should only be of concern to those |
|
138 subclassing XML::Parser. |
|
139 |
|
140 =back |
|
141 |
|
142 =item setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]]) |
|
143 |
|
144 This method registers handlers for various parser events. It overrides any |
|
145 previous handlers registered through the Style or Handler options or through |
|
146 earlier calls to setHandlers. By providing a false or undefined value as |
|
147 the handler, the existing handler can be unset. |
|
148 |
|
149 This method returns a list of type, handler pairs corresponding to the |
|
150 input. The handlers returned are the ones that were in effect prior to |
|
151 the call. |
|
152 |
|
153 See a description of the handler types in L<"HANDLERS">. |
|
154 |
|
155 =item parse(SOURCE [, OPT => OPT_VALUE [...]]) |
|
156 |
|
157 The SOURCE parameter should either be a string containing the whole XML |
|
158 document, or it should be an open IO::Handle. Constructor options to |
|
159 XML::Parser::Expat given as keyword-value pairs may follow the SOURCE |
|
160 parameter. These override, for this call, any options or attributes passed |
|
161 through from the XML::Parser instance. |
|
162 |
|
163 A die call is thrown if a parse error occurs. Otherwise it will return 1 |
|
164 or whatever is returned from the B<Final> handler, if one is installed. |
|
165 In other words, what parse may return depends on the style. |
|
166 |
|
167 =item parsestring |
|
168 |
|
169 This is just an alias for parse for backwards compatibility. |
|
170 |
|
171 =item parsefile(FILE [, OPT => OPT_VALUE [...]]) |
|
172 |
|
173 Open FILE for reading, then call parse with the open handle. The file |
|
174 is closed no matter how parse returns. Returns what parse returns. |
|
175 |
|
176 =item parse_start([ OPT => OPT_VALUE [...]]) |
|
177 |
|
178 Create and return a new instance of XML::Parser::ExpatNB. Constructor |
|
179 options may be provided. If an init handler has been provided, it is |
|
180 called before returning the ExpatNB object. Documents are parsed by |
|
181 making incremental calls to the parse_more method of this object, which |
|
182 takes a string. A single call to the parse_done method of this object, |
|
183 which takes no arguments, indicates that the document is finished. |
|
184 |
|
185 If there is a final handler installed, it is executed by the parse_done |
|
186 method before returning and the parse_done method returns whatever is |
|
187 returned by the final handler. |
|
188 |
|
189 =back |
|
190 |
|
191 =head1 HANDLERS |
|
192 |
|
193 Expat is an event based parser. As the parser recognizes parts of the |
|
194 document (say the start or end tag for an XML element), then any handlers |
|
195 registered for that type of an event are called with suitable parameters. |
|
196 All handlers receive an instance of XML::Parser::Expat as their first |
|
197 argument. See L<XML::Parser::Expat/"METHODS"> for a discussion of the |
|
198 methods that can be called on this object. |
|
199 |
|
200 =head2 Init (Expat) |
|
201 |
|
202 This is called just before the parsing of the document starts. |
|
203 |
|
204 =head2 Final (Expat) |
|
205 |
|
206 This is called just after parsing has finished, but only if no errors |
|
207 occurred during the parse. Parse returns what this returns. |
|
208 |
|
209 =head2 Start (Expat, Element [, Attr, Val [,...]]) |
|
210 |
|
211 This event is generated when an XML start tag is recognized. Element is the |
|
212 name of the XML element type that is opened with the start tag. The Attr & |
|
213 Val pairs are generated for each attribute in the start tag. |
|
214 |
|
215 =head2 End (Expat, Element) |
|
216 |
|
217 This event is generated when an XML end tag is recognized. Note that |
|
218 an XML empty tag (<foo/>) generates both a start and an end event. |
|
219 |
|
220 =head2 Char (Expat, String) |
|
221 |
|
222 This event is generated when non-markup is recognized. The non-markup |
|
223 sequence of characters is in String. A single non-markup sequence of |
|
224 characters may generate multiple calls to this handler. Whatever the |
|
225 encoding of the string in the original document, this is given to the |
|
226 handler in UTF-8. |
|
227 |
|
228 =head2 Proc (Expat, Target, Data) |
|
229 |
|
230 This event is generated when a processing instruction is recognized. |
|
231 |
|
232 =head2 Comment (Expat, Data) |
|
233 |
|
234 This event is generated when a comment is recognized. |
|
235 |
|
236 =head2 CdataStart (Expat) |
|
237 |
|
238 This is called at the start of a CDATA section. |
|
239 |
|
240 =head2 CdataEnd (Expat) |
|
241 |
|
242 This is called at the end of a CDATA section. |
|
243 |
|
244 =head2 Default (Expat, String) |
|
245 |
|
246 This is called for any characters that don't have a registered handler. |
|
247 This includes both characters that are part of markup for which no |
|
248 events are generated (markup declarations) and characters that |
|
249 could generate events, but for which no handler has been registered. |
|
250 |
|
251 Whatever the encoding in the original document, the string is returned to |
|
252 the handler in UTF-8. |
|
253 |
|
254 =head2 Unparsed (Expat, Entity, Base, Sysid, Pubid, Notation) |
|
255 |
|
256 This is called for a declaration of an unparsed entity. Entity is the name |
|
257 of the entity. Base is the base to be used for resolving a relative URI. |
|
258 Sysid is the system id. Pubid is the public id. Notation is the notation |
|
259 name. Base and Pubid may be undefined. |
|
260 |
|
261 =head2 Notation (Expat, Notation, Base, Sysid, Pubid) |
|
262 |
|
263 This is called for a declaration of notation. Notation is the notation name. |
|
264 Base is the base to be used for resolving a relative URI. Sysid is the system |
|
265 id. Pubid is the public id. Base, Sysid, and Pubid may all be undefined. |
|
266 |
|
267 =head2 ExternEnt (Expat, Base, Sysid, Pubid) |
|
268 |
|
269 This is called when an external entity is referenced. Base is the base to be |
|
270 used for resolving a relative URI. Sysid is the system id. Pubid is the public |
|
271 id. Base, and Pubid may be undefined. |
|
272 |
|
273 This handler should either return a string, which represents the contents of |
|
274 the external entity, or return an open filehandle that can be read to obtain |
|
275 the contents of the external entity, or return undef, which indicates the |
|
276 external entity couldn't be found and will generate a parse error. |
|
277 |
|
278 If an open filehandle is returned, it must be returned as either a glob |
|
279 (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle). The |
|
280 parser will close the filehandle after using it. |
|
281 |
|
282 A default handler, XML::Parser::default_ext_ent_handler, is installed |
|
283 for this. It only handles the file URL method and it assumes "file:" if |
|
284 it isn't there. The expat base method can be used to set a basename for |
|
285 relative pathnames. If no basename is given, or if the basename is itself |
|
286 a relative name, then it is relative to the current working directory. |
|
287 |
|
288 =head2 Entity (Expat, Name, Val, Sysid, Pubid, Ndata) |
|
289 |
|
290 This is called when an entity is declared. For internal entities, the Val |
|
291 parameter will contain the value and the remaining three parameters will be |
|
292 undefined. For external entities, the Val parameter will be undefined, the |
|
293 Sysid parameter will have the system id, the Pubid parameter will have the |
|
294 public id if it was provided (it will be undefined otherwise), the Ndata |
|
295 parameter will contain the notation for unparsed entities. If this is a |
|
296 parameter entity declaration, then a '%' will be prefixed to the name. |
|
297 |
|
298 Note that this handler and the Unparsed handler above overlap. If both are |
|
299 set, then this handler will not be called for unparsed entities. |
|
300 |
|
301 =head2 Element (Expat, Name, Model) |
|
302 |
|
303 The element handler is called when an element declaration is found. Name |
|
304 is the element name, and Model is the content model as a string. |
|
305 |
|
306 =head2 Attlist (Expat, Elname, Attname, Type, Default, Fixed) |
|
307 |
|
308 This handler is called for each attribute in an ATTLIST declaration. |
|
309 So an ATTLIST declaration that has multiple attributes will generate multiple |
|
310 calls to this handler. The Elname parameter is the name of the element with |
|
311 which the attribute is being associated. The Attname parameter is the name |
|
312 of the attribute. Type is the attribute type, given as a string. Default is |
|
313 the default value, which will either be "#REQUIRED", "#IMPLIED" or a quoted |
|
314 string (i.e. the returned string will begin and end with a quote character). |
|
315 If Fixed is true, then this is a fixed attribute. |
|
316 |
|
317 =head2 Doctype (Expat, Name, Sysid, Pubid, Internal) |
|
318 |
|
319 This handler is called for DOCTYPE declarations. Name is the document type |
|
320 name. Sysid is the system id of the document type, if it was provided, |
|
321 otherwise it's undefined. Pubid is the public id of the document type, |
|
322 which will be undefined if no public id was given. Internal is the internal |
|
323 subset, given as a string. If there was no internal subset, it will be |
|
324 undefined. Internal will contain all whitespace, comments, processing |
|
325 instructions, and declarations seen in the internal subset. The declarations |
|
326 will be there whether or not they have been processed by another handler |
|
327 (except for unparsed entities processed by the Unparsed handler). However, |
|
328 comments and processing instructions will not appear if they've been processed |
|
329 by their respective handlers. |
|
330 |
|
331 =head2 XMLDecl (Expat, Version, Encoding, Standalone) |
|
332 |
|
333 This handler is called for xml declarations. Version is a string containg |
|
334 the version. Encoding is either undefined or contains an encoding string. |
|
335 Standalone will be either true, false, or undefined if the standalone attribute |
|
336 is yes, no, or not made respectively. |
|
337 |
|
338 =head1 STYLES |
|
339 |
|
340 =head2 Debug |
|
341 |
|
342 This just prints out the document in outline form. Nothing special is |
|
343 returned by parse. |
|
344 |
|
345 =head2 Subs |
|
346 |
|
347 Each time an element starts, a sub by that name in the package specified |
|
348 by the Pkg option is called with the same parameters that the Start |
|
349 handler gets called with. |
|
350 |
|
351 Each time an element ends, a sub with that name appended with an underscore |
|
352 ("_"), is called with the same parameters that the End handler gets called |
|
353 with. |
|
354 |
|
355 Nothing special is returned by parse. |
|
356 |
|
357 =head2 Tree |
|
358 |
|
359 Parse will return a parse tree for the document. Each node in the tree |
|
360 takes the form of a tag, content pair. Text nodes are represented with |
|
361 a pseudo-tag of "0" and the string that is their content. For elements, |
|
362 the content is an array reference. The first item in the array is a |
|
363 (possibly empty) hash reference containing attributes. The remainder of |
|
364 the array is a sequence of tag-content pairs representing the content |
|
365 of the element. |
|
366 |
|
367 So for example the result of parsing: |
|
368 |
|
369 <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo> |
|
370 |
|
371 would be: |
|
372 Tag Content |
|
373 ================================================================== |
|
374 [foo, [{}, head, [{id => "a"}, 0, "Hello ", em, [{}, 0, "there"]], |
|
375 bar, [ {}, 0, "Howdy", ref, [{}]], |
|
376 0, "do" |
|
377 ] |
|
378 ] |
|
379 |
|
380 The root document "foo", has 3 children: a "head" element, a "bar" |
|
381 element and the text "do". After the empty attribute hash, these are |
|
382 represented in it's contents by 3 tag-content pairs. |
|
383 |
|
384 =head2 Objects |
|
385 |
|
386 This is similar to the Tree style, except that a hash object is created for |
|
387 each element. The corresponding object will be in the class whose name |
|
388 is created by appending "::" and the element name to the package set with |
|
389 the Pkg option. Non-markup text will be in the ::Characters class. The |
|
390 contents of the corresponding object will be in an anonymous array that |
|
391 is the value of the Kids property for that object. |
|
392 |
|
393 =head2 Stream |
|
394 |
|
395 This style also uses the Pkg package. If none of the subs that this |
|
396 style looks for is there, then the effect of parsing with this style is |
|
397 to print a canonical copy of the document without comments or declarations. |
|
398 All the subs receive as their 1st parameter the Expat instance for the |
|
399 document they're parsing. |
|
400 |
|
401 It looks for the following routines: |
|
402 |
|
403 =over 4 |
|
404 |
|
405 =item * StartDocument |
|
406 |
|
407 Called at the start of the parse . |
|
408 |
|
409 =item * StartTag |
|
410 |
|
411 Called for every start tag with a second parameter of the element type. The $_ |
|
412 variable will contain a copy of the tag and the %_ variable will contain |
|
413 attribute values supplied for that element. |
|
414 |
|
415 =item * EndTag |
|
416 |
|
417 Called for every end tag with a second parameter of the element type. The $_ |
|
418 variable will contain a copy of the end tag. |
|
419 |
|
420 =item * Text |
|
421 |
|
422 Called just before start or end tags with accumulated non-markup text in |
|
423 the $_ variable. |
|
424 |
|
425 =item * PI |
|
426 |
|
427 Called for processing instructions. The $_ variable will contain a copy of |
|
428 the PI and the target and data are sent as 2nd and 3rd parameters |
|
429 respectively. |
|
430 |
|
431 =item * EndDocument |
|
432 |
|
433 Called at conclusion of the parse. |
|
434 |
|
435 =back |
|
436 |
|
437 =head1 ENCODINGS |
|
438 |
|
439 XML documents may be encoded in character sets other than Unicode as |
|
440 long as they may be mapped into the Unicode character set. Expat has |
|
441 further restrictions on encodings. Read the xmlparse.h header file in |
|
442 the expat distribution to see details on these restrictions. |
|
443 |
|
444 Expat has built-in encodings for: C<UTF-8>, C<ISO-8859-1>, C<UTF-16>, and |
|
445 C<US-ASCII>. Encodings are set either through the XML declaration |
|
446 encoding attribute or through the ProtocolEncoding option to XML::Parser |
|
447 or XML::Parser::Expat. |
|
448 |
|
449 For encodings other than the built-ins, expat calls the function |
|
450 load_encoding in the Expat package with the encoding name. This function |
|
451 looks for a file in the path list @XML::Parser::Expat::Encoding_Path, that |
|
452 matches the lower-cased name with a '.enc' extension. The first one it |
|
453 finds, it loads. |
|
454 |
|
455 If you wish to build your own encoding maps, check out the XML::Encoding |
|
456 module from CPAN. |
|
457 |
|
458 =head1 AUTHORS |
|
459 |
|
460 Larry Wall <F<larry@wall.org>> wrote version 1.0. |
|
461 |
|
462 Clark Cooper <F<coopercc@netheaven.com>> picked up support, changed the API |
|
463 for this version (2.x), provided documentation, |
|
464 and added some standard package features. |
|
465 |
|
466 =cut |