0
|
1 |
<HTML>
|
|
2 |
<TITLE>XML Canonical Forms</TITLE>
|
|
3 |
<BODY>
|
|
4 |
<H1>XML Canonical Forms</H1>
|
|
5 |
<P><FONT COLOR=RED><b><em>DRAFT 1</em></b></FONT>
|
|
6 |
<P> As with many sorts of structured information, there are many
|
|
7 |
categories of information that may be deemed "important" for
|
|
8 |
some task. Canonical forms are standard ways to represent
|
|
9 |
such classes of information. For testing XML, and potentially
|
|
10 |
for other purposes, three <em>XML Canonical Forms</em> have
|
|
11 |
been defined as of this writing: <UL>
|
|
12 |
|
|
13 |
<LI> <a href=#cxml1>First XML Canonical Form</a>, defined by
|
|
14 |
James Clark, is also called <em>Canonical XML</em>.
|
|
15 |
|
|
16 |
<LI> <a href=#cxml2>Second XML Canonical Form</a>, defined
|
|
17 |
by Sun, supports testing a larger subset of the XML 1.0
|
|
18 |
processor requirements by exposing notation declarations.
|
|
19 |
|
|
20 |
<LI> <a href=#cxml3>Third XML Canonical Form</a>, defined
|
|
21 |
by Sun, extends the second form to reflect information
|
|
22 |
which validating XML 1.0 processors are required to report.
|
|
23 |
|
|
24 |
</UL>
|
|
25 |
|
|
26 |
<P> For a document already in a given canonical form, recanonicalizing
|
|
27 |
to that same form will change nothing. Canonicalizing second or
|
|
28 |
third forms to the first canonical form discards all declarations.
|
|
29 |
Canonicalizing second or third forms to the other form has no effect.
|
|
30 |
|
|
31 |
<P> <em>The author is pleased to acknowledge help from
|
|
32 |
James Clark in defining the additional canonical forms.</em>
|
|
33 |
|
|
34 |
|
|
35 |
<A NAME=cxml1>
|
|
36 |
<H2>First XML Canonical Form</H2>
|
|
37 |
</A>
|
|
38 |
|
|
39 |
<P> <em>This description has been extracted from the version at
|
|
40 |
<a href=http://www.jclark.com/xml/canonxml.html>
|
|
41 |
http://www.jclark.com/xml/canonxml.html</a>.</em>
|
|
42 |
|
|
43 |
<P>
|
|
44 |
Every well-formed XML document has a unique structurally equivalent
|
|
45 |
canonical XML document. Two structurally equivalent XML
|
|
46 |
documents have a byte-for-byte identical canonical XML document.
|
|
47 |
Canonicalizing an XML document requires only information that an XML
|
|
48 |
processor is required to make available to an application.
|
|
49 |
<P>
|
|
50 |
A canonical XML document conforms to the following grammar:
|
|
51 |
<PRE>
|
|
52 |
CanonXML ::= Pi* element Pi*
|
|
53 |
element ::= Stag (Datachar | Pi | element)* Etag
|
|
54 |
Stag ::= '<' Name Atts '>'
|
|
55 |
Etag ::= '</' Name '>'
|
|
56 |
Pi ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'
|
|
57 |
Atts ::= (' ' Name '=' '"' Datachar* '"')*
|
|
58 |
Datachar ::= '&amp;' | '&lt;' | '&gt;' | '&quot;'
|
|
59 |
| '&#9;'| '&#10;'| '&#13;'
|
|
60 |
| (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))
|
|
61 |
Name ::= (see XML spec)
|
|
62 |
Char ::= (see XML spec)
|
|
63 |
S ::= (see XML spec)
|
|
64 |
</PRE>
|
|
65 |
<P>
|
|
66 |
Attributes are in lexicographical order (in Unicode bit order).
|
|
67 |
<P>
|
|
68 |
A canonical XML document is encoded in UTF-8.
|
|
69 |
<P>
|
|
70 |
Ignorable white space is considered significant and is treated equivalently
|
|
71 |
to data.
|
|
72 |
|
|
73 |
|
|
74 |
<A NAME=cxml2>
|
|
75 |
<H2>Second XML Canonical Form</H2>
|
|
76 |
</A>
|
|
77 |
<P><FONT COLOR=RED><b><em>Modified to ensure that literals are surrounded by single quotes.</em></b></FONT>
|
|
78 |
<P> This canonical form is identical to the first form, with
|
|
79 |
one significant addition. All XML processors are required to
|
|
80 |
report the name and external identifiers of notations that
|
|
81 |
are declared and referred to in an XML document (section 4.7);
|
|
82 |
those reports are reflected in declarations in this form,
|
|
83 |
presented in lexicographic order.
|
|
84 |
|
|
85 |
<P> Note that all public identifiers must be normalized before being
|
|
86 |
presented to applications (section 4.2.2).
|
|
87 |
|
|
88 |
<P> System identifiers are normalized on output to be relative
|
|
89 |
to the input document, if that is possible, with the shortest
|
|
90 |
such relative URI. All other URIs must be absolute. Any
|
|
91 |
hash mark and fragment ID, if erroneously present on input, are
|
|
92 |
removed. Any non-ASCII characters in the URI must be escaped
|
|
93 |
as specified in the XML specification (section 4.2.2).
|
|
94 |
|
|
95 |
<PRE>
|
|
96 |
CanonXML2 ::= DTD2? CanonXML
|
|
97 |
DTD2 ::= '<!DOCTYPE ' name ' [' #xA Notations? ']>' #xA
|
|
98 |
Notations ::= ( '<!NOTATION ' Name '
|
|
99 |
(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
|
|
100 |
|('PUBLIC ' PubidLiteral)
|
|
101 |
|('SYSTEM ' SystemLiteral))
|
|
102 |
'>' #xA )*
|
|
103 |
PubidLiteral ::= "'" PubidChar* "'"
|
|
104 |
SystemLiteral ::= "'" [^']* "'"
|
|
105 |
|
|
106 |
</PRE>
|
|
107 |
|
|
108 |
<P> The requirement of this canonical form differs slightly from that
|
|
109 |
of the XML specification itself in that all declared notations
|
|
110 |
must be listed, not just those which were referred to.
|
|
111 |
<em>Should that change? SAX supports it easily.</em>
|
|
112 |
|
|
113 |
|
|
114 |
<A NAME=cxml3>
|
|
115 |
<H2>Third XML Canonical Form</H2>
|
|
116 |
</A>
|
|
117 |
<P> This canonical form is identical to the second form, with
|
|
118 |
two significant exceptions reflecting requirements placed on
|
|
119 |
validating XML processors:<UL>
|
|
120 |
|
|
121 |
<LI> They are required to report "white space appearing in
|
|
122 |
element content" (section 2.10). Ignorable whitespace is
|
|
123 |
not represented in this canonical form.
|
|
124 |
|
|
125 |
<LI> They must report the external identifiers and notation name
|
|
126 |
for unparsed entities appearing as attribute values (section 4.4.6).
|
|
127 |
Such entities are declared in this canonical form, in lexicographic
|
|
128 |
order.
|
|
129 |
|
|
130 |
</UL>
|
|
131 |
|
|
132 |
<P> This builds on the grammar productions included above.
|
|
133 |
|
|
134 |
<PRE>
|
|
135 |
CanonXML3 ::= DTD3? CanonXML
|
|
136 |
DTD3 ::= '<!DOCTYPE ' name ' [' #xA Notations? Unparsed? ']>' #xA
|
|
137 |
Unparsed ::= ( '<!ENTITY ' Name '
|
|
138 |
(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
|
|
139 |
|('SYSTEM ' SystemLiteral))
|
|
140 |
'NDATA ' Name
|
|
141 |
'>' #xA )*
|
|
142 |
</PRE>
|
|
143 |
|
|
144 |
<P> The requirement of this canonical form differs slightly from that
|
|
145 |
of the XML specification itself in that all declared unparsed entities
|
|
146 |
must be listed, not just those which were referred to.
|
|
147 |
<em>Should that change? SAX supports it easily.</em>
|
|
148 |
|
|
149 |
<P>
|
|
150 |
<ADDRESS>
|
|
151 |
<A HREF="mailto:xml-feedback@java.sun.com">xml-feedback@java.sun.com</A>
|
|
152 |
</ADDRESS>
|
|
153 |
|
|
154 |
</BODY>
|
|
155 |
</HTML>
|