diff -r 889504eac4fb -r 604ca70b6235 xml/xmlexpatparser/src/expat-1.95.5/doc_pub/reference.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/xml/xmlexpatparser/src/expat-1.95.5/doc_pub/reference.html Wed Sep 01 12:37:34 2010 +0100 @@ -0,0 +1,1770 @@ + + + + + + Expat XML Parser + + + + + +

Expat XML Parser

+ +

Expat is a library, written in C, for parsing XML documents. It's +the underlying XML parser for the open source Mozilla project, Perl's +XML::Parser, Python's xml.parsers.expat, and +other open-source XML parsers.

+ +

This library is the creation of James Clark, who's also given us +groff (an nroff look-alike), Jade (an implemention of ISO's DSSSL +stylesheet language for SGML), XP (a Java XML parser package), XT (a +Java XSL engine). James was also the technical lead on the XML +Working Group at W3C that produced the XML specification.

+ +

This is free software, licensed under the MIT/X Consortium license. You may download it +from the Expat home page. +

+ +

The bulk of this document was originally commissioned as an article by +XML.com. They graciously allowed +Clark Cooper to retain copyright and to distribute it with Expat.

+ +

Overview
Building and Installing
Using Expat
Reference +
+

+ +

Overview

+ +

Expat is a stream-oriented parser. You register callback (or +handler) functions with the parser and then start feeding it the +document. As the parser recognizes parts of the document, it will +call the appropriate handler for that part (if you've registered one.) +The document is fed to the parser in pieces, so you can start parsing +before you have all the document. This also allows you to parse really +huge documents that won't fit into memory.

+ +

Expat can be intimidating due to the many kinds of handlers and +options you can set. But you only need to learn four functions in +order to do 90% of what you'll want to do with it:

+ +

XML_ParserCreate: Create a new parser object.
XML_SetElementHandler: Set handlers for start and end tags.
XML_SetCharacterDataHandler: Set handler for text.
XML_Parse: Pass a buffer full of document to the parser

+ +

These functions and others are described in the reference part of this document. The reference +section also describes in detail the parameters passed to the +different types of handlers.

+ +

Let's look at a very simple example program that only uses 3 of the +above functions (it doesn't need to set a character handler.) The +program outline.c prints an +element outline, indenting child elements to distinguish them from the +parent element that contains them. The start handler does all the +work. It prints two indenting spaces for every level of ancestor +elements, then it prints the element and attribute +information. Finally it increments the global Depth +variable.

+ +

+int Depth;
+
+void
+start(void *data, const char *el, const char **attr) {
+  int i;
+
+  for (i = 0; i < Depth; i++)
+    printf("  ");
+
+  printf("%s", el);
+
+  for (i = 0; attr[i]; i += 2) {
+    printf(" %s='%s'", attr[i], attr[i + 1]);
+  }
+
+  printf("\n");
+  Depth++;
+}  /* End of start handler */
+

+ +

The end tag simply does the bookkeeping work of decrementing +Depth.

+void
+end(void *data, const char *el) {
+  Depth--;
+}  /* End of end handler */
+

+ +

After creating the parser, the main program just has the job of +shoveling the document to the parser so that it can do its work.

+ +

Building and Installing Expat

+ +

The Expat distribution comes as a compressed (with GNU gzip) tar +file. You may download the latest version from Source Forge. After +unpacking this, cd into the directory. Then follow either the Win32 +directions or Unix directions below.

+ +

Building under Win32

+ +

If you're using the GNU compiler under cygwin, follow the Unix +directions in the next section. Otherwise if you have Microsoft's +Developer Studio installed, then from Windows Explorer double-click on +"expat.dsp" in the lib directory and build and install in the usual +manner.

+ +

Alternatively, you may download the Win32 binary package that +contains the "expat.h" include file and a pre-built DLL.

+ +

Building under Unix (or GNU)

+ +

First you'll need to run the configure shell script in order to +configure the Makefiles and headers for your system.

+ +

If you're happy with all the defaults that configure picks for you, +and you have permission on your system to install into /usr/local, you +can install Expat with this sequence of commands:

+ +

+   ./configure
+   make
+   make install
+

+ +

There are some options that you can provide to this script, but the +only one we'll mention here is the --prefix option. You +can find out all the options available by running configure with just +the --help option.

+ +

By default, the configure script sets things up so that the library +gets installed in /usr/local/lib and the associated +header file in /usr/local/include. But if you were to +give the option, --prefix=/home/me/mystuff, then the +library and header would get installed in +/home/me/mystuff/lib and +/home/me/mystuff/include respectively.

+ +

Using Expat

+ +

Compiling and Linking Against Expat

+ +

Unless you installed Expat in a location not expected by your +compiler and linker, all you have to do to use Expat in your programs +is to include the Expat header (#include <expat.h>) +in your files that make calls to it and to tell the linker that it +needs to link against the Expat library. On Unix systems, this would +usually be done with the -lexpat argument. Otherwise, +you'll need to tell the compiler where to look for the Expat header +and the linker where to find the Expat library. You may also need to +take steps to tell the operating system where to find this libary at +run time.

+ +

On a Unix-based system, here's what a Makefile might look like when +Expat is installed in a standard location:

+ +

+CC=cc
+LDFLAGS=
+LIBS= -lexpat
+xmlapp: xmlapp.o
+        $(CC) $(LDFLAGS) -o xmlapp xmlapp.o $(LIBS)
+

+ +

If you installed Expat in, say, /home/me/mystuff, then +the Makefile would look like this:

+ +

+CC=cc
+CFLAGS= -I/home/me/mystuff/include
+LDFLAGS=
+LIBS= -L/home/me/mystuff/lib -lexpat
+xmlapp: xmlapp.o
+        $(CC) $(LDFLAGS) -o xmlapp xmlapp.o $(LIBS)
+

+ +

You'd also have to set the environment variable +LD_LIBRARY_PATH to /home/me/mystuff/lib (or +to ${LD_LIBRARY_PATH}:/home/me/mystuff/lib if +LD_LIBRARY_PATH already has some directories in it) in order to run +your application.

+ +

Expat Basics

+ +

As we saw in the example in the overview, the first step in parsing +an XML document with Expat is to create a parser object. There are three functions in the Expat API for creating a +parser object. However, only two of these (XML_ParserCreate and XML_ParserCreateNS) can be used for +constructing a parser for a top-level document. The object returned +by these functions is an opaque pointer (i.e. "expat.h" declares it as +void *) to data with further internal structure. In order to free the +memory associated with this object you must call XML_ParserFree. Note that if you have +provided any user data that gets stored in the +parser, then your application is responsible for freeing it prior to +calling XML_ParserFree.

+ +

The objects returned by the parser creation functions are good for +parsing only one XML document or external parsed entity. If your +application needs to parse many XML documents, then it needs to create +a parser object for each one. The best way to deal with this is to +create a higher level object that contains all the default +initialization you want for your parser objects.

+ +

Walking through a document hierarchy with a stream oriented parser +will require a good stack mechanism in order to keep track of current +context. For instance, to answer the simple question, "What element +does this text belong to?" requires a stack, since the parser may have +descended into other elements that are children of the current one and +has encountered this text on the way out.

+ +

The things you're likely to want to keep on a stack are the +currently opened element and it's attributes. You push this +information onto the stack in the start handler and you pop it off in +the end handler.

+ +

For some tasks, it is sufficient to just keep information on what +the depth of the stack is (or would be if you had one.) The outline +program shown above presents one example. Another such task would be +skipping over a complete element. When you see the start tag for the +element you want to skip, you set a skip flag and record the depth at +which the element started. When the end tag handler encounters the +same depth, the skipped element has ended and the flag may be +cleared. If you follow the convention that the root element starts at +1, then you can use the same variable for skip flag and skip +depth.

+ +

+void
+init_info(Parseinfo *info) {
+  info->skip = 0;
+  info->depth = 1;
+  /* Other initializations here */
+}  /* End of init_info */
+
+void
+rawstart(void *data, const char *el, const char **attr) {
+  Parseinfo *inf = (Parseinfo *) data;
+
+  if (! inf->skip) {
+    if (should_skip(inf, el, attr)) {
+      inf->skip = inf->depth;
+    }
+    else
+      start(inf, el, attr);     /* This does rest of start handling */
+  }
+
+  inf->depth++;
+}  /* End of rawstart */
+
+void
+rawend(void *data, const char *el) {
+  Parseinfo *inf = (Parseinfo *) data;
+
+  inf->depth--;
+
+  if (! inf->skip)
+    end(inf, el);              /* This does rest of end handling */
+
+  if (inf->skip == inf->depth)
+    inf->skip = 0;
+}  /* End rawend */
+

+ +

Notice in the above example the difference in how depth is +manipulated in the start and end handlers. The end tag handler should +be the mirror image of the start tag handler. This is necessary to +properly model containment. Since, in the start tag handler, we +incremented depth after the main body of start tag code, then +in the end handler, we need to manipulate it before the main +body. If we'd decided to increment it first thing in the start +handler, then we'd have had to decrement it last thing in the end +handler.

+ +

Communicating between handlers

+ +

In order to be able to pass information between different handlers +without using globals, you'll need to define a data structure to hold +the shared variables. You can then tell Expat (with the XML_SetUserData function) to pass a +pointer to this structure to the handlers. This is typically the first +argument received by most handlers.

+ +

XML Version

+ +

Expat is an XML 1.0 parser, and as such never complains based on +the value of the version pseudo-attribute in the XML +declaration, if present.

+ +

If an application needs to check the version number (to support +alternate processing), it should use the XML_SetXmlDeclHandler function to +set a handler that uses the information in the XML declaration to +determine what to do. This example shows how to check that only a +version number of "1.0" is accepted:

+ +

+static int wrong_version;
+static XML_Parser parser;
+
+static void
+xmldecl_handler(void            *userData,
+                const XML_Char  *version,
+                const XML_Char  *encoding,
+                int              standalone)
+{
+  static const XML_Char Version_1_0[] = {'1', '.', '0', 0};
+
+  int i;
+
+  for (i = 0; i < (sizeof(Version_1_0) / sizeof(Version_1_0[0])); ++i) {
+    if (version[i] != Version_1_0[i]) {
+      wrong_version = 1;
+      /* also clear all other handlers: */
+      XML_SetCharacterDataHandler(parser, NULL);
+      ...
+      return;
+    }
+  }
+  ...
+}
+

+ +

Namespace Processing

+ +

When the parser is created using the XML_ParserCreateNS, function, Expat +performs namespace processing. Under namespace processing, Expat +consumes xmlns and xmlns:... attributes, +which declare namespaces for the scope of the element in which they +occur. This means that your start handler will not see these +attributes. Your application can still be informed of these +declarations by setting namespace declaration handlers with XML_SetNamespaceDeclHandler.

+ +

Element type and attribute names that belong to a given namespace +are passed to the appropriate handler in expanded form. By default +this expanded form is a concatenation of the namespace URI, the +separator character (which is the 2nd argument to XML_ParserCreateNS), and the local +name (i.e. the part after the colon). Names with undeclared prefixes +are passed through to the handlers unchanged, with the prefix and +colon still attached. Unprefixed attribute names are never expanded, +and unprefixed element names are only expanded when they are in the +scope of a default namespace.

+ +

However if XML_SetReturnNSTriplet has been called with a non-zero +do_nst parameter, then the expanded form for names with +an explicit prefix is a concatenation of: URI, separator, local name, +separator, prefix.

+ +

You can set handlers for the start of a namespace declaration and +for the end of a scope of a declaration with the XML_SetNamespaceDeclHandler +function. The StartNamespaceDeclHandler is called prior to the start +tag handler and the EndNamespaceDeclHandler is called before the +corresponding end tag that ends the namespace's scope. The namespace +start handler gets passed the prefix and URI for the namespace. For a +default namespace declaration (xmlns='...'), the prefix will be null. +The URI will be null for the case where the default namespace is being +unset. The namespace end handler just gets the prefix for the closing +scope.

+ +

These handlers are called for each declaration. So if, for +instance, a start tag had three namespace declarations, then the +StartNamespaceDeclHandler would be called three times before the start +tag handler is called, once for each declaration.

+ +

Character Encodings

+ +

While XML is based on Unicode, and every XML processor is required +to recognized UTF-8 and UTF-16 (1 and 2 byte encodings of Unicode), +other encodings may be declared in XML documents or entities. For the +main document, an XML declaration may contain an encoding +declaration:

+<?xml version="1.0" encoding="ISO-8859-2"?>
+

+ +

External parsed entities may begin with a text declaration, which +looks like an XML declaration with just an encoding declaration:

+<?xml encoding="Big5"?>
+

+ +

With Expat, you may also specify an encoding at the time of +creating a parser. This is useful when the encoding information may +come from a source outside the document itself (like a higher level +protocol.)

+ +

There are four built-in encodings +in Expat:

UTF-8
UTF-16
ISO-8859-1
US-ASCII

+ +

Anything else discovered in an encoding declaration or in the +protocol encoding specified in the parser constructor, triggers a call +to the UnknownEncodingHandler. This handler gets passed +the encoding name and a pointer to an XML_Encoding data +structure. Your handler must fill in this structure and return 1 if it +knows how to deal with the encoding. Otherwise the handler should +return 0. The handler also gets passed a pointer to an optional +application data structure that you may indicate when you set the +handler.

+ +

Expat places restrictions on character encodings that it can +support by filling in the XML_Encoding structure. +include file:

Every ASCII character that can appear in a well-formed XML document +must be represented by a single byte, and that byte must correspond to +it's ASCII encoding (except for the characters $@\^'{}~)
Characters must be encoded in 4 bytes or less.
All characters encoded must have Unicode scalar values less than or +equal to 65535 (0xFFFF)This does not apply to the built-in support +for UTF-16 and UTF-8
No character may be encoded by more that one distinct sequence of +bytes

+ +

XML_Encoding contains an array of integers that +correspond to the 1st byte of an encoding sequence. If the value in +the array for a byte is zero or positive, then the byte is a single +byte encoding that encodes the Unicode scalar value contained in the +array. A -1 in this array indicates a malformed byte. If the value is +-2, -3, or -4, then the byte is the beginning of a 2, 3, or 4 byte +sequence respectively. Multi-byte sequences are sent to the convert +function pointed at in the XML_Encoding structure. This +function should return the Unicode scalar value for the sequence or -1 +if the sequence is malformed.

+ +

One pitfall that novice Expat users are likely to fall into is that +although Expat may accept input in various encodings, the strings that +it passes to the handlers are always encoded in UTF-8 or UTF-16 +(depending on how Expat was compiled). Your application is responsible +for any translation of these strings into other encodings.

+ +

Handling External Entity References

+ +

Expat does not read or parse external entities directly. Note that +any external DTD is a special case of an external entity. If you've +set no ExternalEntityRefHandler, then external entity +references are silently ignored. Otherwise, it calls your handler with +the information needed to read and parse the external entity.

+ +

Your handler isn't actually responsible for parsing the entity, but +it is responsible for creating a subsidiary parser with XML_ExternalEntityParserCreate that will do the job. This +returns an instance of XML_Parser that has handlers and +other data structures initialized from the parent parser. You may then +use XML_Parse or XML_ParseBuffer calls against this +parser. Since external entities my refer to other external entities, +your handler should be prepared to be called recursively.

+ +

Parsing DTDs

+ +

In order to parse parameter entities, before starting the parse, +you must call XML_SetParamEntityParsing with one of the following +arguments:

XML_PARAM_ENTITY_PARSING_NEVER: Don't parse parameter entities or the external subset
XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE: Parse parameter entites and the external subset unless +standalone was set to "yes" in the XML declaration.
XML_PARAM_ENTITY_PARSING_ALWAYS: Always parse parameter entities and the external subset

+ +

In order to read an external DTD, you also have to set an external +entity reference handler as described above.

+ +

+ + +

Expat Reference

+ +

Parser Creation

+ +

+XML_Parser
+XML_ParserCreate(const XML_Char *encoding);
+

+Construct a new parser. If encoding is non-null, it specifies a +character encoding to use for the document. This overrides the document +encoding declaration. There are four built-in encodings: +

US-ASCII
UTF-8
UTF-16
ISO-8859-1

+Any other value will invoke a call to the UnknownEncodingHandler. +

+ +

+XML_Parser
+XML_ParserCreateNS(const XML_Char *encoding,
+                   XML_Char sep);
+

+Constructs a new parser that has namespace processing in effect. Namespace +expanded element names and attribute names are returned as a concatenation +of the namespace URI, sep, and the local part of the name. This +means that you should pick a character for sep that can't be +part of a legal URI.

+ +

+XML_Parser
+XML_ParserCreate_MM(const XML_Char *encoding,
+                    const XML_Memory_Handling_Suite *ms,
+		    const XML_Char *sep);
+

+typedef struct {
+  void *(*malloc_fcn)(size_t size);
+  void *(*realloc_fcn)(void *ptr, size_t size);
+  void (*free_fcn)(void *ptr);
+} XML_Memory_Handling_Suite;
+

Construct a new parser using the suite of memory handling functions +specified in ms. If ms is NULL, then use the +standard set of memory management functions. If sep is +non NULL, then namespace processing is enabled in the created parser +and the character pointed at by sep is used as the separator between +the namespace URI and the local part of the name.

+ +

+XML_Parser
+XML_ExternalEntityParserCreate(XML_Parser p,
+                               const XML_Char *context,
+                               const XML_Char *encoding);
+

+Construct a new XML_Parser object for parsing an external +general entity. Context is the context argument passed in a call to a +ExternalEntityRefHandler. Other state information such as handlers, +user data, namespace processing is inherited from the parser passed as +the 1st argument. So you shouldn't need to call any of the behavior +changing functions on this parser (unless you want it to act +differently than the parent parser). +

+ +

+void
+XML_ParserFree(XML_Parser p);
+

+Free memory used by the parser. Your application is responsible for +freeing any memory associated with user data. +

+ +

+XML_Bool
+XML_ParserReset(XML_Parser p);
+

+Clean up the memory structures maintained by the parser so that it may +be used again. After this has been called, parser is +ready to start parsing a new document. This function may not be used +on a parser created using XML_ExternalEntityParserCreate; it will return XML_FALSE in that case. Returns +XML_TRUE on success. Your application is responsible for +dealing with any memory associated with user data. +

+ +

Parsing

+ +

To state the obvious: the three parsing functions XML_Parse, XML_ParseBuffer and >XML_GetBuffer must not be +called from within a handler unless they operate on a separate parser +instance, that is, one that did not call the handler. For example, it +is OK to call the parsing functions from within an +XML_ExternalEntityRefHandler, if they apply to the parser +created by XML_ExternalEntityParserCreate.

+ +

+XML_Status
+XML_Parse(XML_Parser p,
+          const char *s,
+          int len,
+          int isFinal);
+

+enum XML_Status {
+  XML_STATUS_ERROR = 0,
+  XML_STATUS_OK = 1
+};
+

+Parse some more of the document. The string s is a buffer +containing part (or perhaps all) of the document. The number of bytes of s +that are part of the document is indicated by len. This means +that s doesn't have to be null terminated. It also means that +if len is larger than the number of bytes in the block of +memory that s points at, then a memory fault is likely. The +isFinal parameter informs the parser that this is the last +piece of the document. Frequently, the last piece is empty (i.e. +len is zero.) +If a parse error occurred, it returns XML_STATUS_ERROR. +Otherwise it returns XML_STATUS_OK value. +

+ +

+XML_Status
+XML_ParseBuffer(XML_Parser p,
+                int len,
+                int isFinal);
+

+This is just like XML_Parse, +except in this case Expat provides the buffer. By obtaining the +buffer from Expat with the XML_GetBuffer function, the application can avoid double +copying of the input. +

+ +

+void *
+XML_GetBuffer(XML_Parser p,
+              int len);
+

+Obtain a buffer of size len to read a piece of the document +into. A NULL value is returned if Expat can't allocate enough memory for +this buffer. This has to be called prior to every call to +XML_ParseBuffer. A +typical use would look like this: + +

+for (;;) {
+  int bytes_read;
+  void *buff = XML_GetBuffer(p, BUFF_SIZE);
+  if (buff == NULL) {
+    /* handle error */
+  }
+
+  bytes_read = read(docfd, buff, BUFF_SIZE);
+  if (bytes_read < 0) {
+    /* handle error */
+  }
+
+  if (! XML_ParseBuffer(p, bytes_read, bytes_read == 0)) {
+    /* handle parse error */
+  }
+
+  if (bytes_read == 0)
+    break;
+}
+

+ +

Handler Setting

+ +

Although handlers are typically set prior to parsing and left alone, an +application may choose to set or change the handler for a parsing event +while the parse is in progress. For instance, your application may choose +to ignore all text not descended from a para element. One +way it could do this is to set the character handler when a para start tag +is seen, and unset it for the corresponding end tag.

+ +

A handler may be unset by providing a NULL pointer to the +appropriate handler setter. None of the handler setting functions have +a return value.

+ +

Your handlers will be receiving strings in arrays of type +XML_Char. This type is defined in expat.h as char +* and contains bytes encoding UTF-8. Note that you'll receive +them in this form independent of the original encoding of the +document.

+ +

+XML_SetStartElementHandler(XML_Parser p,
+                           XML_StartElementHandler start);
+

+typedef void
+(*XML_StartElementHandler)(void *userData,
+                           const XML_Char *name,
+                           const XML_Char **atts);
+

Set handler for start (and empty) tags. Attributes are passed to the start +handler as a pointer to a vector of char pointers. Each attribute seen in +a start (or empty) tag occupies 2 consecutive places in this vector: the +attribute name followed by the attribute value. These pairs are terminated +by a null pointer.

Note that an empty tag generates a call to both start and end handlers +(in that order).

+ +

+XML_SetEndElementHandler(XML_Parser p,
+                         XML_EndElementHandler);
+

+typedef void
+(*XML_EndElementHandler)(void *userData,
+                         const XML_Char *name);
+

Set handler for end (and empty) tags. As noted above, an empty tag +generates a call to both start and end handlers.

+ +

+XML_SetElementHandler(XML_Parser p,
+                      XML_StartElementHandler start,
+                      XML_EndElementHandler end);
+

Set handlers for start and end tags with one call.

+ +

+XML_SetCharacterDataHandler(XML_Parser p,
+                            XML_CharacterDataHandler charhndl)
+

+typedef void
+(*XML_CharacterDataHandler)(void *userData,
+                            const XML_Char *s,
+                            int len);
+

Set a text handler. The string your handler receives +is NOT nul-terminated. You have to use the length argument +to deal with the end of the string. A single block of contiguous text +free of markup may still result in a sequence of calls to this handler. +In other words, if you're searching for a pattern in the text, it may +be split across calls to this handler.

+ +

+XML_SetProcessingInstructionHandler(XML_Parser p,
+                                    XML_ProcessingInstructionHandler proc)
+

+typedef void
+(*XML_ProcessingInstructionHandler)(void *userData,
+                                    const XML_Char *target,
+                                    const XML_Char *data);
+
+

Set a handler for processing instructions. The target is the first word +in the processing instruction. The data is the rest of the characters in +it after skipping all whitespace after the initial word.

+ +

+XML_SetCommentHandler(XML_Parser p,
+                      XML_CommentHandler cmnt)
+

+typedef void
+(*XML_CommentHandler)(void *userData,
+                      const XML_Char *data);
+

Set a handler for comments. The data is all text inside the comment +delimiters.

+ +

+XML_SetStartCdataSectionHandler(XML_Parser p,
+                                XML_StartCdataSectionHandler start);
+

+typedef void
+(*XML_StartCdataSectionHandler)(void *userData);
+

Set a handler that gets called at the beginning of a CDATA section.

+ +

+XML_SetEndCdataSectionHandler(XML_Parser p,
+                              XML_EndCdataSectionHandler end);
+

+typedef void
+(*XML_EndCdataSectionHandler)(void *userData);
+

Set a handler that gets called at the end of a CDATA section.

+ +

+XML_SetCdataSectionHandler(XML_Parser p,
+                           XML_StartCdataSectionHandler start,
+                           XML_EndCdataSectionHandler end)
+

Sets both CDATA section handlers with one call.

+ +

+XML_SetDefaultHandler(XML_Parser p,
+                      XML_DefaultHandler hndl)
+

+typedef void
+(*XML_DefaultHandler)(void *userData,
+                      const XML_Char *s,
+                      int len);
+

+ +

Sets a handler for any characters in the document which wouldn't +otherwise be handled. This includes both data for which no handlers +can be set (like some kinds of DTD declarations) and data which could +be reported but which currently has no handler set. The characters +are passed exactly as they were present in the XML document except +that they will be encoded in UTF-8 or UTF-16. Line boundaries are not +normalized. Note that a byte order mark character is not passed to the +default handler. There are no guarantees about how characters are +divided between calls to the default handler: for example, a comment +might be split between multiple calls. Setting the handler with +this call has the side effect of turning off expansion of references +to internally defined general entities. Instead these references are +passed to the default handler.

+ +

See also XML_DefaultCurrent.

+ +

+XML_SetExternalEntityRefHandler(XML_Parser p,
+                                XML_ExternalEntityRefHandler hndl)
+

+typedef int
+(*XML_ExternalEntityRefHandler)(XML_Parser p,
+                                const XML_Char *context,
+                                const XML_Char *base,
+                                const XML_Char *systemId,
+                                const XML_Char *publicId);
+

Set an external entity reference handler. This handler is also +called for processing an external DTD subset if parameter entity parsing +is in effect. (See +XML_SetParamEntityParsing.)

+ + +

The base parameter is the base to use for relative system identifiers. +It is set by XML_SetBase and may be null. The +public id parameter is the public id given in the entity declaration and +may be null. The system id is the system identifier specified in the entity +declaration and is never null.

+ +

There are a couple of ways in which this handler differs from others. +First, this handler returns an integer. A non-zero value should be returned +for successful handling of the external entity reference. Returning a zero +indicates failure, and causes the calling parser to return +an XML_ERROR_EXTERNAL_ENTITY_HANDLING error.

+ +

Second, instead of having userData as its first argument, it receives the +parser that encountered the entity reference. This, along with the context +parameter, may be used as arguments to a call to +XML_ExternalEntityParserCreate. +Using the returned parser, the body of the external entity can be recursively +parsed.

+ +

Since this handler may be called recursively, it should not be saving +information into global or static variables.

+ +

+XML_SetSkippedEntityHandler(XML_Parser p,
+                            XML_SkippedEntityHandler handler)
+

+typedef void
+(*XML_SkippedEntityHandler)(void *userData,
+                            const XML_Char *entityName,
+                            int is_parameter_entity);
+

Set a skipped entity handler. This is called in two situations:

An entity reference is encountered for which no declaration + has been read and this is not an error.
An internal entity reference is read, but not expanded, because + XML_SetDefaultHandler + has been called.

The is_parameter_entity argument will be non-zero for +a parameter entity and zero for a general entity.

Note: skipped +parameter entities in declarations and skipped general entities in +attribute values cannot be reported, because the event would be out of +sync with the reporting of the declarations or attribute values

+ +

+XML_SetUnknownEncodingHandler(XML_Parser p,
+                              XML_UnknownEncodingHandler enchandler,
+			      void *encodingHandlerData)
+

+typedef int
+(*XML_UnknownEncodingHandler)(void *encodingHandlerData,
+                              const XML_Char *name,
+                              XML_Encoding *info);
+
+typedef struct {
+  int map[256];
+  void *data;
+  int (*convert)(void *data, const char *s);
+  void (*release)(void *data);
+} XML_Encoding;
+

Set a handler to deal with encodings other than the +built in set. This should be done before +XML_Parse or XML_ParseBuffer have been called on the +given parser.

If the handler knows how to deal with an encoding with the given +name, it should fill in the info data structure and return +1. Otherwise it should return 0. The handler will be called at most +once per parsed (external) entity. The optional application data +pointer encodingHandlerData will be passed back to the +handler.

+ +

The map array contains information for every possible possible leading +byte in a byte sequence. If the corresponding value is >= 0, then it's +a single byte sequence and the byte encodes that Unicode value. If the +value is -1, then that byte is invalid as the initial byte in a sequence. +If the value is -n, where n is an integer > 1, then n is the number of +bytes in the sequence and the actual conversion is accomplished by a +call to the function pointed at by convert. This function may return -1 +if the sequence itself is invalid. The convert pointer may be null if +there are only single byte codes. The data parameter passed to the convert +function is the data pointer from XML_Encoding. The +string s is NOT nul-terminated and points at the sequence of +bytes to be converted.

+ +

The function pointed at by release is called by the +parser when it is finished with the encoding. It may be NULL.

+ +

+XML_SetStartNamespaceDeclHandler(XML_Parser p,
+			         XML_StartNamespaceDeclHandler start);
+

+typedef void
+(*XML_StartNamespaceDeclHandler)(void *userData,
+                                 const XML_Char *prefix,
+                                 const XML_Char *uri);
+

Set a handler to be called when a namespace is declared. Namespace +declarations occur inside start tags. But the namespace declaration start +handler is called before the start tag handler for each namespace declared +in that start tag.

+ +

+XML_SetEndNamespaceDeclHandler(XML_Parser p,
+			       XML_EndNamespaceDeclHandler end);
+

+typedef void
+(*XML_EndNamespaceDeclHandler)(void *userData,
+                               const XML_Char *prefix);
+

Set a handler to be called when leaving the scope of a namespace +declaration. This will be called, for each namespace declaration, +after the handler for the end tag of the element in which the +namespace was declared.

+ +

+XML_SetNamespaceDeclHandler(XML_Parser p,
+                            XML_StartNamespaceDeclHandler start,
+                            XML_EndNamespaceDeclHandler end)
+

Sets both namespace declaration handlers with a single call

+ +

+XML_SetXmlDeclHandler(XML_Parser p,
+		      XML_XmlDeclHandler xmldecl);
+

+typedef void
+(*XML_XmlDeclHandler) (void            *userData,
+                       const XML_Char  *version,
+                       const XML_Char  *encoding,
+                       int             standalone);
+

Sets a handler that is called for XML declarations and also for +text declarations discovered in external entities. The way to +distinguish is that the version parameter will be NULL +for text declarations. The encoding parameter may be NULL +for an XML declaration. The standalone argument will +contain -1, 0, or 1 indicating respectively that there was no +standalone parameter in the declaration, that it was given as no, or +that it was given as yes.

+ +

+XML_SetStartDoctypeDeclHandler(XML_Parser p,
+			       XML_StartDoctypeDeclHandler start);
+

+typedef void
+(*XML_StartDoctypeDeclHandler)(void           *userData,
+                               const XML_Char *doctypeName,
+                               const XML_Char *sysid,
+                               const XML_Char *pubid,
+                               int            has_internal_subset);
+

Set a handler that is called at the start of a DOCTYPE declaration, +before any external or internal subset is parsed. Both sysid +and pubid may be NULL. The has_internal_subset +will be non-zero if the DOCTYPE declaration has an internal subset.

+ +

+XML_SetEndDoctypeDeclHandler(XML_Parser p,
+			     XML_EndDoctypeDeclHandler end);
+

+typedef void
+(*XML_EndDoctypeDeclHandler)(void *userData);
+

Set a handler that is called at the end of a DOCTYPE declaration, +after parsing any external subset.

+ +

+XML_SetDoctypeDeclHandler(XML_Parser p,
+			  XML_StartDoctypeDeclHandler start,
+			  XML_EndDoctypeDeclHandler end);
+

Set both doctype handlers with one call.

+ +

+XML_SetElementDeclHandler(XML_Parser p,
+			  XML_ElementDeclHandler eldecl);
+

+typedef void
+(*XML_ElementDeclHandler)(void *userData,
+                          const XML_Char *name,
+                          XML_Content *model);
+

+enum XML_Content_Type {
+  XML_CTYPE_EMPTY = 1,
+  XML_CTYPE_ANY,
+  XML_CTYPE_MIXED,
+  XML_CTYPE_NAME,
+  XML_CTYPE_CHOICE,
+  XML_CTYPE_SEQ
+};
+
+enum XML_Content_Quant {
+  XML_CQUANT_NONE,
+  XML_CQUANT_OPT,
+  XML_CQUANT_REP,
+  XML_CQUANT_PLUS
+};
+
+typedef struct XML_cp XML_Content;
+
+struct XML_cp {
+  enum XML_Content_Type		type;
+  enum XML_Content_Quant	quant;
+  const XML_Char *		name;
+  unsigned int			numchildren;
+  XML_Content *			children;
+};
+

Sets a handler for element declarations in a DTD. The handler gets +called with the name of the element in the declaration and a pointer +to a structure that contains the element model. It is the +application's responsibility to free this data structure.

+ +

The model argument is the root of a tree of +XML_Content nodes. If type equals +XML_CTYPE_EMPTY or XML_CTYPE_ANY, then +quant will be XML_CQUANT_NONE, and the other +fields will be zero or NULL. If type is +XML_CTYPE_MIXED, then quant will be +XML_CQUANT_NONE or XML_CQUANT_REP and +numchildren will contain the number of elements that are +allowed to be mixed in and children points to an array of +XML_Content structures that will all have type +XML_CTYPE_NAME with no quantification. Only the root node can be type +XML_CTYPE_EMPTY, XML_CTYPE_ANY, or +XML_CTYPE_MIXED.

+ +

For type XML_CTYPE_NAME, the name field +points to the name and the numchildren and +children fields will be zero and NULL. The +quant field will indicate any quantifiers placed on the +name.

+ +

Types XML_CTYPE_CHOICE and XML_CTYPE_SEQ +indicate a choice or sequence respectively. The +numchildren field indicates how many nodes in the choice +or sequence and children points to the nodes.

+ +

+XML_SetAttlistDeclHandler(XML_Parser p,
+                          XML_AttlistDeclHandler attdecl);
+

+typedef void
+(*XML_AttlistDeclHandler) (void           *userData,
+                           const XML_Char *elname,
+                           const XML_Char *attname,
+                           const XML_Char *att_type,
+                           const XML_Char *dflt,
+                           int            isrequired);
+

Set a handler for attlist declarations in the DTD. This handler is +called for each attribute. So a single attlist declaration +with multiple attributes declared will generate multiple calls to this +handler. The elname parameter returns the name of the +element for which the attribute is being declared. The attribute name +is in the attname parameter. The attribute type is in the +att_type parameter. It is the string representing the +type in the declaration with whitespace removed.

+ +

The dflt parameter holds the default value. It will be +NULL in the case of "#IMPLIED" or "#REQUIRED" attributes. You can +distinguish these two cases by checking the isrequired +parameter, which will be true in the case of "#REQUIRED" attributes. +Attributes which are "#FIXED" will have also have a true +isrequired, but they will have the non-NULL fixed value +in the dflt parameter.

+ +

+XML_SetEntityDeclHandler(XML_Parser p,
+			 XML_EntityDeclHandler handler);
+

+typedef void
+(*XML_EntityDeclHandler) (void           *userData,
+                          const XML_Char *entityName,
+                          int            is_parameter_entity,
+                          const XML_Char *value,
+                          int            value_length,
+                          const XML_Char *base,
+                          const XML_Char *systemId,
+                          const XML_Char *publicId,
+                          const XML_Char *notationName);
+

Sets a handler that will be called for all entity declarations. +The is_parameter_entity argument will be non-zero in the +case of parameter entities and zero otherwise.

+ +

For internal entities (<!ENTITY foo "bar">), +value will be non-NULL and systemId, +publicId, and notationName will all be NULL. +The value string is not NULL terminated; the length is +provided in the value_length parameter. Do not use +value_length to test for internal entities, since it is +legal to have zero-length values. Instead check for whether or not +value is NULL.

The notationName +argument will have a non-NULL value only for unparsed entity +declarations.

+ +

+XML_SetUnparsedEntityDeclHandler(XML_Parser p,
+                                 XML_UnparsedEntityDeclHandler h)
+

+typedef void
+(*XML_UnparsedEntityDeclHandler)(void *userData,
+                                 const XML_Char *entityName,
+                                 const XML_Char *base,
+                                 const XML_Char *systemId,
+                                 const XML_Char *publicId,
+                                 const XML_Char *notationName);
+

Set a handler that receives declarations of unparsed entities. These +are entity declarations that have a notation (NDATA) field:

+ +

+<!ENTITY logo SYSTEM "images/logo.gif" NDATA gif>
+

This handler is obsolete and is provided for backwards +compatibility. Use instead XML_SetEntityDeclHandler.

+ +

+XML_SetNotationDeclHandler(XML_Parser p,
+                           XML_NotationDeclHandler h)
+

+typedef void
+(*XML_NotationDeclHandler)(void *userData,
+                           const XML_Char *notationName,
+                           const XML_Char *base,
+                           const XML_Char *systemId,
+                           const XML_Char *publicId);
+

Set a handler that receives notation declarations.

+ +

+XML_SetNotStandaloneHandler(XML_Parser p,
+                            XML_NotStandaloneHandler h)
+

+typedef int 
+(*XML_NotStandaloneHandler)(void *userData);
+

Set a handler that is called if the document is not "standalone". +This happens when there is an external subset or a reference to a +parameter entity, but does not have standalone set to "yes" in an XML +declaration. If this handler returns 0, then the parser will throw an +XML_ERROR_NOT_STANDALONE error.

+ +

Parse position and error reporting functions

+ +

These are the functions you'll want to call when the parse +functions return 0 (i.e. a parse error has ocurred), although the +position reporting functions are useful outside of errors. The +position reported is the byte position (in the original document or +entity encoding) of the first of the sequence of characters that +generated the current event (or the error that caused the parse +functions to return 0.)

+ +

The position reporting functions are accurate only outside of the +DTD. In other words, they usually return bogus information when +called from within a DTD declaration handler.

+ +

+enum XML_Error
+XML_GetErrorCode(XML_Parser p);
+

+Return what type of error has occurred. +

+ +

+const XML_LChar *
+XML_ErrorString(int code);
+

+Return a string describing the error corresponding to code. +The code should be one of the enums that can be returned from +XML_GetErrorCode. +

+ +

+long
+XML_GetCurrentByteIndex(XML_Parser p);
+

+Return the byte offset of the position. +

+ +

+int
+XML_GetCurrentLineNumber(XML_Parser p);
+

+Return the line number of the position. +

+ +

+int
+XML_GetCurrentColumnNumber(XML_Parser p);
+

+Return the offset, from the beginning of the current line, of +the position. +

+ +

+int
+XML_GetCurrentByteCount(XML_Parser p);
+

+Return the number of bytes in the current event. Returns +0 if the event is inside a reference to an internal +entity and for the end-tag event for empty element tags (the later can +be used to distinguish empty-element tags from empty elements using +separate start and end tags). +

+ +

+const char *
+XML_GetInputContext(XML_Parser p,
+                    int *offset,
+                    int *size);
+

+ +

Returns the parser's input buffer, sets the integer pointed at by +offset to the offset within this buffer of the current +parse position, and set the integer pointed at by size to +the size of the returned buffer.

+ +

This should only be called from within a handler during an active +parse and the returned buffer should only be referred to from within +the handler that made the call. This input buffer contains the +untranslated bytes of the input.

+ +

Only a limited amount of context is kept, so if the event +triggering a call spans over a very large amount of input, the actual +parse position may be before the beginning of the buffer.

+ +

Miscellaneous functions

+ +

The functions in this section either obtain state information from +the parser or can be used to dynamicly set parser options.

+ +

+void
+XML_SetUserData(XML_Parser p,
+                void *userData);
+

+This sets the user data pointer that gets passed to handlers. It +overwrites any previous value for this pointer. Note that the +application is responsible for freeing the memory associated with +userData when it is finished with the parser. So if you +call this when there's already a pointer there, and you haven't freed +the memory associated with it, then you've probably just leaked +memory. +

+ +

+void *
+XML_GetUserData(XML_Parser p);
+

+This returns the user data pointer that gets passed to handlers. +It is actually implemented as a macro. +

+ +

+void
+XML_UseParserAsHandlerArg(XML_Parser p);
+

+After this is called, handlers receive the parser in the userData +argument. The userData information can still be obtained using the +XML_GetUserData +function. +

+ +

+int
+XML_SetBase(XML_Parser p,
+            const XML_Char *base);
+

+Set the base to be used for resolving relative URIs in system +identifiers. The return value is 0 if there's no memory to store +base, otherwise it's non-zero. +

+ +

+const XML_Char *
+XML_GetBase(XML_Parser p);
+

+Return the base for resolving relative URIs. +

+ +

+int
+XML_GetSpecifiedAttributeCount(XML_Parser p);
+

+When attributes are reported to the start handler in the atts vector, +attributes that were explicitly set in the element occur before any +attributes that receive their value from default information in an +ATTLIST declaration. This function returns the number of attributes +that were explicitly set times two, thus giving the offset in the +atts array passed to the start tag handler of the first +attribute set due to defaults. It supplies information for the last +call to a start handler. If called inside a start handler, then that +means the current call. +

+ +

+int
+XML_GetIdAttributeIndex(XML_Parser p);
+

+Returns the index of the ID attribute passed in the atts array in the +last call to XML_StartElementHandler, or -1 if there is no ID +attribute. If called inside a start handler, then that means the +current call. +

+ +

+int
+XML_SetEncoding(XML_Parser p,
+                const XML_Char *encoding);
+

+Set the encoding to be used by the parser. It is equivalent to +passing a non-null encoding argument to the parser creation functions. +It must not be called after XML_Parse or XML_ParseBuffer have been called on the given parser. +

+ +

+int
+XML_SetParamEntityParsing(XML_Parser p,
+                          enum XML_ParamEntityParsing code);
+

+This enables parsing of parameter entities, including the external +parameter entity that is the external DTD subset, according to +code. +The choices for code are: +

XML_PARAM_ENTITY_PARSING_NEVER
XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE
XML_PARAM_ENTITY_PARSING_ALWAYS

+ +

+enum XML_Error
+XML_UseForeignDTD(XML_Parser parser, XML_Bool useDTD);
+

This function allows an application to provide an external subset +for the document type declaration for documents which do not specify +an external subset of their own. For documents which specify an +external subset in their DOCTYPE declaration, the application-provided +subset will be ignored. If the document does not contain a DOCTYPE +declaration at all and useDTD is true, the +application-provided subset will be parsed, but the +startDoctypeDeclHandler and +endDoctypeDeclHandler functions, if set, will not be +called. The setting of parameter entity parsing, controlled using +XML_SetParamEntityParsing, will be honored.

+ +

The application-provided external subset is read by calling the +external entity reference handler set via XML_SetExternalEntityRefHandler with both +publicId and systemId set to NULL.

+ +

If this function is called after parsing has begun, it returns +XML_ERROR_CANT_CHANGE_FEATURE_ONCE_PARSING and ignores +useDTD. If called when Expat has been compiled without +DTD support, it returns +XML_ERROR_FEATURE_REQUIRES_XML_DTD. Otherwise, it +returns XML_ERROR_NONE.

+ +

+void
+XML_SetReturnNSTriplet(XML_Parser parser,
+                       int        do_nst);
+

+This function only has an effect when using a parser created with +XML_ParserCreateNS, +i.e. when namespace processing is in effect. The do_nst +sets whether or not prefixes are returned with names qualified with a +namespace prefix. If this function is called with do_nst +non-zero, then afterwards namespace qualified names (that is qualified +with a prefix as opposed to belonging to a default namespace) are +returned as a triplet with the three parts separated by the namespace +separator specified when the parser was created. The order of +returned parts is URI, local name, and prefix.

If +do_nst is zero, then namespaces are reported in the +default manner, URI then local_name separated by the namespace +separator.

+ +

+void
+XML_DefaultCurrent(XML_Parser parser);
+

+This can be called within a handler for a start element, end element, +processing instruction or character data. It causes the corresponding +markup to be passed to the default handler set by XML_SetDefaultHandler or +XML_SetDefaultHandlerExpand. It does nothing if there is +not a default handler. +

+ +

+XML_LChar *
+XML_ExpatVersion();
+

+Return the library version as a string (e.g. "expat_1.95.1"). +

+ +

+struct XML_Expat_Version
+XML_ExpatVersionInfo();
+

+typedef struct {
+  int major;
+  int minor;
+  int micro;
+} XML_Expat_Version;
+

+Return the library version information as a structure. +Some macros are also defined that support compile-time tests of the +library version: +

XML_MAJOR_VERSION
XML_MINOR_VERSION
XML_MICRO_VERSION

+Testing these constants is currently the best way to determine if +particular parts of the Expat API are available. +

+ +

+const XML_Feature *
+XML_GetFeatureList();
+

+enum XML_FeatureEnum {
+  XML_FEATURE_END = 0,
+  XML_FEATURE_UNICODE,
+  XML_FEATURE_UNICODE_WCHAR_T,
+  XML_FEATURE_DTD,
+  XML_FEATURE_CONTEXT_BYTES,
+  XML_FEATURE_MIN_SIZE,
+  XML_FEATURE_SIZEOF_XML_CHAR,
+  XML_FEATURE_SIZEOF_XML_LCHAR
+};
+
+typedef struct {
+  enum XML_FeatureEnum  feature;
+  XML_LChar            *name;
+  long int              value;
+} XML_Feature;
+

Returns a list of "feature" records, providing details on how +Expat was configured at compile time. Most applications should not +need to worry about this, but this information is otherwise not +available from Expat. This function allows code that does need to +check these features to do so at runtime.

+ +

The return value is an array of XML_Feature, +terminated by a record with a feature of +XML_FEATURE_END and name of NULL, +identifying the feature-test macros Expat was compiled with. Since an +application that requires this kind of information needs to determine +the type of character the name points to, records for the +XML_FEATURE_SIZEOF_XML_CHAR and +XML_FEATURE_SIZEOF_XML_LCHAR will be located at the +beginning of the list, followed by XML_FEATURE_UNICODE +and XML_FEATURE_UNICODE_WCHAR_T, if they are present at +all.

+ +

Some features have an associated value. If there isn't an +associated value, the value field is set to 0. At this +time, the following features have been defined to have values:

+ +

XML_FEATURE_SIZEOF_XML_CHAR: The number of bytes occupied by one XML_Char + character.
XML_FEATURE_SIZEOF_XML_LCHAR: The number of bytes occupied by one XML_LChar + character.
XML_FEATURE_CONTEXT_BYTES: The maximum number of characters of context which can be + reported by XML_GetInputContext.

+ +