diff -r 000000000000 -r 1918ee327afb tests/auto/qxmlstream/XML-Test-Suite/xmlconf/sun/cxml.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/auto/qxmlstream/XML-Test-Suite/xmlconf/sun/cxml.html Mon Jan 11 14:00:40 2010 +0000 @@ -0,0 +1,155 @@ + +XML Canonical Forms + +

XML Canonical Forms

DRAFT 1 +

As with many sorts of structured information, there are many +categories of information that may be deemed "important" for +some task. Canonical forms are standard ways to represent +such classes of information. For testing XML, and potentially +for other purposes, three XML Canonical Forms have +been defined as of this writing:

First XML Canonical Form, defined by + James Clark, is also called Canonical XML. + +
Second XML Canonical Form, defined + by Sun, supports testing a larger subset of the XML 1.0 + processor requirements by exposing notation declarations. + +
Third XML Canonical Form, defined + by Sun, extends the second form to reflect information + which validating XML 1.0 processors are required to report. + +

+ +

For a document already in a given canonical form, recanonicalizing +to that same form will change nothing. Canonicalizing second or +third forms to the first canonical form discards all declarations. +Canonicalizing second or third forms to the other form has no effect. + +

The author is pleased to acknowledge help from +James Clark in defining the additional canonical forms. + + + +

First XML Canonical Form

+ + +

This description has been extracted from the version at + +http://www.jclark.com/xml/canonxml.html. + +

+Every well-formed XML document has a unique structurally equivalent +canonical XML document. Two structurally equivalent XML +documents have a byte-for-byte identical canonical XML document. +Canonicalizing an XML document requires only information that an XML +processor is required to make available to an application. +

+A canonical XML document conforms to the following grammar: +

+CanonXML    ::= Pi* element Pi*
+element     ::= Stag (Datachar | Pi | element)* Etag
+Stag        ::= '<'  Name Atts '>'
+Etag        ::= '</' Name '>'
+Pi          ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'
+Atts        ::= (' ' Name '=' '"' Datachar* '"')*
+Datachar    ::= '&amp;' | '&lt;' | '&gt;' | '&quot;'
+                 | '&#9;'| '&#10;'| '&#13;'
+                 | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))
+Name        ::= (see XML spec)
+Char        ::= (see XML spec)
+S           ::= (see XML spec)
+

+Attributes are in lexicographical order (in Unicode bit order). +

+A canonical XML document is encoded in UTF-8. +

+Ignorable white space is considered significant and is treated equivalently +to data. + + + +

Second XML Canonical Form

+ +

Modified to ensure that literals are surrounded by single quotes. +

This canonical form is identical to the first form, with +one significant addition. All XML processors are required to +report the name and external identifiers of notations that +are declared and referred to in an XML document (section 4.7); +those reports are reflected in declarations in this form, +presented in lexicographic order. + +

Note that all public identifiers must be normalized before being +presented to applications (section 4.2.2). + +

System identifiers are normalized on output to be relative +to the input document, if that is possible, with the shortest +such relative URI. All other URIs must be absolute. Any +hash mark and fragment ID, if erroneously present on input, are +removed. Any non-ASCII characters in the URI must be escaped +as specified in the XML specification (section 4.2.2). + +

+CanonXML2    ::= DTD2? CanonXML
+DTD2         ::= '<!DOCTYPE ' name ' [' #xA Notations? ']>' #xA
+Notations    ::= ( '<!NOTATION ' Name '
+			(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
+			|('PUBLIC ' PubidLiteral)
+			|('SYSTEM ' SystemLiteral))
+			'>' #xA )*
+PubidLiteral ::= "'" PubidChar* "'"
+SystemLiteral ::= "'" [^']* "'"
+
+

+ +

The requirement of this canonical form differs slightly from that +of the XML specification itself in that all declared notations +must be listed, not just those which were referred to. +Should that change? SAX supports it easily. + + + +

Third XML Canonical Form

+ +

This canonical form is identical to the second form, with +two significant exceptions reflecting requirements placed on +validating XML processors:

They are required to report "white space appearing in + element content" (section 2.10). Ignorable whitespace is + not represented in this canonical form. + +
They must report the external identifiers and notation name + for unparsed entities appearing as attribute values (section 4.4.6). + Such entities are declared in this canonical form, in lexicographic + order. + +

+ +

This builds on the grammar productions included above. + +

+CanonXML3    ::= DTD3? CanonXML
+DTD3         ::= '<!DOCTYPE ' name ' [' #xA Notations? Unparsed? ']>' #xA
+Unparsed    ::= ( '<!ENTITY ' Name '
+			(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
+			|('SYSTEM ' SystemLiteral))
+			'NDATA ' Name
+			'>' #xA )*
+

+ +

The requirement of this canonical form differs slightly from that +of the XML specification itself in that all declared unparsed entities +must be listed, not just those which were referred to. +Should that change? SAX supports it easily. + +

+xml-feedback@java.sun.com +

+ + +