|
1 <HTML> |
|
2 <TITLE>XML Canonical Forms</TITLE> |
|
3 <BODY> |
|
4 <H1>XML Canonical Forms</H1> |
|
5 <P><FONT COLOR=RED><b><em>DRAFT 1</em></b></FONT> |
|
6 <P> As with many sorts of structured information, there are many |
|
7 categories of information that may be deemed "important" for |
|
8 some task. Canonical forms are standard ways to represent |
|
9 such classes of information. For testing XML, and potentially |
|
10 for other purposes, three <em>XML Canonical Forms</em> have |
|
11 been defined as of this writing: <UL> |
|
12 |
|
13 <LI> <a href=#cxml1>First XML Canonical Form</a>, defined by |
|
14 James Clark, is also called <em>Canonical XML</em>. |
|
15 |
|
16 <LI> <a href=#cxml2>Second XML Canonical Form</a>, defined |
|
17 by Sun, supports testing a larger subset of the XML 1.0 |
|
18 processor requirements by exposing notation declarations. |
|
19 |
|
20 <LI> <a href=#cxml3>Third XML Canonical Form</a>, defined |
|
21 by Sun, extends the second form to reflect information |
|
22 which validating XML 1.0 processors are required to report. |
|
23 |
|
24 </UL> |
|
25 |
|
26 <P> For a document already in a given canonical form, recanonicalizing |
|
27 to that same form will change nothing. Canonicalizing second or |
|
28 third forms to the first canonical form discards all declarations. |
|
29 Canonicalizing second or third forms to the other form has no effect. |
|
30 |
|
31 <P> <em>The author is pleased to acknowledge help from |
|
32 James Clark in defining the additional canonical forms.</em> |
|
33 |
|
34 |
|
35 <A NAME=cxml1> |
|
36 <H2>First XML Canonical Form</H2> |
|
37 </A> |
|
38 |
|
39 <P> <em>This description has been extracted from the version at |
|
40 <a href=http://www.jclark.com/xml/canonxml.html> |
|
41 http://www.jclark.com/xml/canonxml.html</a>.</em> |
|
42 |
|
43 <P> |
|
44 Every well-formed XML document has a unique structurally equivalent |
|
45 canonical XML document. Two structurally equivalent XML |
|
46 documents have a byte-for-byte identical canonical XML document. |
|
47 Canonicalizing an XML document requires only information that an XML |
|
48 processor is required to make available to an application. |
|
49 <P> |
|
50 A canonical XML document conforms to the following grammar: |
|
51 <PRE> |
|
52 CanonXML ::= Pi* element Pi* |
|
53 element ::= Stag (Datachar | Pi | element)* Etag |
|
54 Stag ::= '<' Name Atts '>' |
|
55 Etag ::= '</' Name '>' |
|
56 Pi ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>' |
|
57 Atts ::= (' ' Name '=' '"' Datachar* '"')* |
|
58 Datachar ::= '&amp;' | '&lt;' | '&gt;' | '&quot;' |
|
59 | '&#9;'| '&#10;'| '&#13;' |
|
60 | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD)) |
|
61 Name ::= (see XML spec) |
|
62 Char ::= (see XML spec) |
|
63 S ::= (see XML spec) |
|
64 </PRE> |
|
65 <P> |
|
66 Attributes are in lexicographical order (in Unicode bit order). |
|
67 <P> |
|
68 A canonical XML document is encoded in UTF-8. |
|
69 <P> |
|
70 Ignorable white space is considered significant and is treated equivalently |
|
71 to data. |
|
72 |
|
73 |
|
74 <A NAME=cxml2> |
|
75 <H2>Second XML Canonical Form</H2> |
|
76 </A> |
|
77 <P><FONT COLOR=RED><b><em>Modified to ensure that literals are surrounded by single quotes.</em></b></FONT> |
|
78 <P> This canonical form is identical to the first form, with |
|
79 one significant addition. All XML processors are required to |
|
80 report the name and external identifiers of notations that |
|
81 are declared and referred to in an XML document (section 4.7); |
|
82 those reports are reflected in declarations in this form, |
|
83 presented in lexicographic order. |
|
84 |
|
85 <P> Note that all public identifiers must be normalized before being |
|
86 presented to applications (section 4.2.2). |
|
87 |
|
88 <P> System identifiers are normalized on output to be relative |
|
89 to the input document, if that is possible, with the shortest |
|
90 such relative URI. All other URIs must be absolute. Any |
|
91 hash mark and fragment ID, if erroneously present on input, are |
|
92 removed. Any non-ASCII characters in the URI must be escaped |
|
93 as specified in the XML specification (section 4.2.2). |
|
94 |
|
95 <PRE> |
|
96 CanonXML2 ::= DTD2? CanonXML |
|
97 DTD2 ::= '<!DOCTYPE ' name ' [' #xA Notations? ']>' #xA |
|
98 Notations ::= ( '<!NOTATION ' Name ' |
|
99 (('PUBLIC ' PubidLiteral ' ' SystemLiteral) |
|
100 |('PUBLIC ' PubidLiteral) |
|
101 |('SYSTEM ' SystemLiteral)) |
|
102 '>' #xA )* |
|
103 PubidLiteral ::= "'" PubidChar* "'" |
|
104 SystemLiteral ::= "'" [^']* "'" |
|
105 |
|
106 </PRE> |
|
107 |
|
108 <P> The requirement of this canonical form differs slightly from that |
|
109 of the XML specification itself in that all declared notations |
|
110 must be listed, not just those which were referred to. |
|
111 <em>Should that change? SAX supports it easily.</em> |
|
112 |
|
113 |
|
114 <A NAME=cxml3> |
|
115 <H2>Third XML Canonical Form</H2> |
|
116 </A> |
|
117 <P> This canonical form is identical to the second form, with |
|
118 two significant exceptions reflecting requirements placed on |
|
119 validating XML processors:<UL> |
|
120 |
|
121 <LI> They are required to report "white space appearing in |
|
122 element content" (section 2.10). Ignorable whitespace is |
|
123 not represented in this canonical form. |
|
124 |
|
125 <LI> They must report the external identifiers and notation name |
|
126 for unparsed entities appearing as attribute values (section 4.4.6). |
|
127 Such entities are declared in this canonical form, in lexicographic |
|
128 order. |
|
129 |
|
130 </UL> |
|
131 |
|
132 <P> This builds on the grammar productions included above. |
|
133 |
|
134 <PRE> |
|
135 CanonXML3 ::= DTD3? CanonXML |
|
136 DTD3 ::= '<!DOCTYPE ' name ' [' #xA Notations? Unparsed? ']>' #xA |
|
137 Unparsed ::= ( '<!ENTITY ' Name ' |
|
138 (('PUBLIC ' PubidLiteral ' ' SystemLiteral) |
|
139 |('SYSTEM ' SystemLiteral)) |
|
140 'NDATA ' Name |
|
141 '>' #xA )* |
|
142 </PRE> |
|
143 |
|
144 <P> The requirement of this canonical form differs slightly from that |
|
145 of the XML specification itself in that all declared unparsed entities |
|
146 must be listed, not just those which were referred to. |
|
147 <em>Should that change? SAX supports it easily.</em> |
|
148 |
|
149 <P> |
|
150 <ADDRESS> |
|
151 <A HREF="mailto:xml-feedback@java.sun.com">xml-feedback@java.sun.com</A> |
|
152 </ADDRESS> |
|
153 |
|
154 </BODY> |
|
155 </HTML> |