tests/auto/qxmlstream/XML-Test-Suite/xmlconf/sun/cxml.html
changeset 0 1918ee327afb
equal deleted inserted replaced
-1:000000000000 0:1918ee327afb
       
     1 <HTML>
       
     2 <TITLE>XML Canonical Forms</TITLE>
       
     3 <BODY>
       
     4 <H1>XML Canonical Forms</H1>
       
     5 <P><FONT COLOR=RED><b><em>DRAFT 1</em></b></FONT>
       
     6 <P> As with many sorts of structured information, there are many
       
     7 categories of information that may be deemed "important" for
       
     8 some task.  Canonical forms are standard ways to represent
       
     9 such classes of information.  For testing XML, and potentially
       
    10 for other purposes, three <em>XML Canonical Forms</em> have
       
    11 been defined as of this writing:  <UL>
       
    12 
       
    13     <LI> <a href=#cxml1>First XML Canonical Form</a>, defined by
       
    14     James Clark, is also called <em>Canonical XML</em>.
       
    15 
       
    16     <LI> <a href=#cxml2>Second XML Canonical Form</a>, defined
       
    17     by Sun, supports testing a larger subset of the XML 1.0
       
    18     processor requirements by exposing notation declarations.
       
    19 
       
    20     <LI> <a href=#cxml3>Third XML Canonical Form</a>, defined
       
    21     by Sun, extends the second form to reflect information
       
    22     which validating XML 1.0 processors are required to report.
       
    23 
       
    24     </UL>
       
    25 
       
    26 <P> For a document already in a given canonical form, recanonicalizing
       
    27 to that same form will change nothing.  Canonicalizing second or
       
    28 third forms to the first canonical form discards all declarations.
       
    29 Canonicalizing second or third forms to the other form has no effect.
       
    30 
       
    31 <P> <em>The author is pleased to acknowledge help from
       
    32 James Clark in defining the additional canonical forms.</em>
       
    33 
       
    34 
       
    35 <A NAME=cxml1> 
       
    36 <H2>First XML Canonical Form</H2>
       
    37 </A>
       
    38 
       
    39 <P> <em>This description has been extracted from the version at
       
    40 <a href=http://www.jclark.com/xml/canonxml.html>
       
    41 http://www.jclark.com/xml/canonxml.html</a>.</em>
       
    42 
       
    43 <P>
       
    44 Every well-formed XML document has a unique structurally equivalent
       
    45 canonical XML document.  Two structurally equivalent XML
       
    46 documents have a byte-for-byte identical canonical XML document.
       
    47 Canonicalizing an XML document requires only information that an XML
       
    48 processor is required to make available to an application.
       
    49 <P>
       
    50 A canonical XML document conforms to the following grammar:
       
    51 <PRE>
       
    52 CanonXML    ::= Pi* element Pi*
       
    53 element     ::= Stag (Datachar | Pi | element)* Etag
       
    54 Stag        ::= '&lt;'  Name Atts '&gt;'
       
    55 Etag        ::= '&lt;/' Name '&gt;'
       
    56 Pi          ::= '&lt;?' Name ' ' (((Char - S) Char*)? - (Char* '?&gt;' Char*)) '?&gt;'
       
    57 Atts        ::= (' ' Name '=' '"' Datachar* '"')*
       
    58 Datachar    ::= '&amp;amp;' | '&amp;lt;' | '&amp;gt;' | '&amp;quot;'
       
    59                  | '&amp;#9;'| '&amp;#10;'| '&amp;#13;'
       
    60                  | (Char - ('&amp;' | '&lt;' | '&gt;' | '"' | #x9 | #xA | #xD))
       
    61 Name        ::= (see XML spec)
       
    62 Char        ::= (see XML spec)
       
    63 S           ::= (see XML spec)
       
    64 </PRE>
       
    65 <P>
       
    66 Attributes are in lexicographical order (in Unicode bit order).
       
    67 <P>
       
    68 A canonical XML document is encoded in UTF-8.
       
    69 <P>
       
    70 Ignorable white space is considered significant and is treated equivalently
       
    71 to data.
       
    72 
       
    73 
       
    74 <A NAME=cxml2> 
       
    75 <H2>Second XML Canonical Form</H2>
       
    76 </A>
       
    77 <P><FONT COLOR=RED><b><em>Modified to ensure that literals are surrounded by single quotes.</em></b></FONT>
       
    78 <P> This canonical form is identical to the first form, with
       
    79 one significant addition.  All XML processors are required to
       
    80 report the name and external identifiers of notations that
       
    81 are declared and referred to in an XML document (section 4.7);
       
    82 those reports are reflected in declarations in this form,
       
    83 presented in lexicographic order.
       
    84 
       
    85 <P> Note that all public identifiers must be normalized before being
       
    86 presented to applications (section 4.2.2).
       
    87 
       
    88 <P> System identifiers are normalized on output to be relative
       
    89 to the input document, if that is possible, with the shortest
       
    90 such relative URI.  All other URIs must be absolute.  Any
       
    91 hash mark and fragment ID, if erroneously present on input, are
       
    92 removed.  Any non-ASCII characters in the URI must be escaped
       
    93 as specified in the XML specification (section 4.2.2).
       
    94 
       
    95 <PRE>
       
    96 CanonXML2    ::= DTD2? CanonXML
       
    97 DTD2         ::= '&lt;!DOCTYPE ' name ' [' #xA Notations? ']>' #xA
       
    98 Notations    ::= ( '&lt;!NOTATION ' Name '
       
    99 			(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
       
   100 			|('PUBLIC ' PubidLiteral)
       
   101 			|('SYSTEM ' SystemLiteral))
       
   102 			'>' #xA )*
       
   103 PubidLiteral ::= "'" PubidChar* "'"
       
   104 SystemLiteral ::= "'" [^']* "'"
       
   105 
       
   106 </PRE>
       
   107 
       
   108 <P> The requirement of this canonical form differs slightly from that
       
   109 of the XML specification itself in that all declared notations
       
   110 must be listed, not just those which were referred to.
       
   111 <em>Should that change?  SAX supports it easily.</em>
       
   112 
       
   113 
       
   114 <A NAME=cxml3> 
       
   115 <H2>Third XML Canonical Form</H2>
       
   116 </A>
       
   117 <P> This canonical form is identical to the second form, with
       
   118 two significant exceptions reflecting requirements placed on
       
   119 validating XML processors:<UL>
       
   120 
       
   121     <LI> They are required to report "white space appearing in
       
   122     element content" (section 2.10).  Ignorable whitespace is
       
   123     not represented in this canonical form.
       
   124 
       
   125     <LI> They must report the external identifiers and notation name
       
   126     for unparsed entities appearing as attribute values (section 4.4.6).
       
   127     Such entities are declared in this canonical form, in lexicographic
       
   128     order.
       
   129 
       
   130     </UL>
       
   131 
       
   132 <P> This builds on the grammar productions included above.
       
   133 
       
   134 <PRE>
       
   135 CanonXML3    ::= DTD3? CanonXML
       
   136 DTD3         ::= '&lt;!DOCTYPE ' name ' [' #xA Notations? Unparsed? ']>' #xA
       
   137 Unparsed    ::= ( '&lt;!ENTITY ' Name '
       
   138 			(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
       
   139 			|('SYSTEM ' SystemLiteral))
       
   140 			'NDATA ' Name
       
   141 			'>' #xA )*
       
   142 </PRE>
       
   143 
       
   144 <P> The requirement of this canonical form differs slightly from that
       
   145 of the XML specification itself in that all declared unparsed entities
       
   146 must be listed, not just those which were referred to.
       
   147 <em>Should that change?  SAX supports it easily.</em>
       
   148 
       
   149 <P>
       
   150 <ADDRESS>
       
   151 <A HREF="mailto:xml-feedback@java.sun.com">xml-feedback@java.sun.com</A>
       
   152 </ADDRESS>
       
   153 
       
   154 </BODY>
       
   155 </HTML>