|
1 <html> |
|
2 |
|
3 |
|
4 |
|
5 <head> |
|
6 |
|
7 <meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 4.0"> |
|
8 |
|
9 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> |
|
10 |
|
11 <link REL="stylesheet" HREF="http://www.unicode.org/unicode.css" TYPE="text/css"> |
|
12 |
|
13 <title>UnicodeData File Format</title> |
|
14 |
|
15 </head> |
|
16 |
|
17 |
|
18 |
|
19 <body> |
|
20 |
|
21 |
|
22 |
|
23 <h1>UnicodeData File Format<br> |
|
24 Version 3.0.0</h1> |
|
25 |
|
26 |
|
27 |
|
28 <table BORDER="1" CELLSPACING="2" CELLPADDING="0" HEIGHT="87" WIDTH="100%"> |
|
29 |
|
30 <tr> |
|
31 |
|
32 <td VALIGN="TOP" width="144">Revision</td> |
|
33 |
|
34 <td VALIGN="TOP">3.0.0</td> |
|
35 |
|
36 </tr> |
|
37 |
|
38 <tr> |
|
39 |
|
40 <td VALIGN="TOP" width="144">Authors</td> |
|
41 |
|
42 <td VALIGN="TOP">Mark Davis and Ken Whistler</td> |
|
43 |
|
44 </tr> |
|
45 |
|
46 <tr> |
|
47 |
|
48 <td VALIGN="TOP" width="144">Date</td> |
|
49 |
|
50 <td VALIGN="TOP">1999-09-12</td> |
|
51 |
|
52 </tr> |
|
53 |
|
54 <tr> |
|
55 |
|
56 <td VALIGN="TOP" width="144">This Version</td> |
|
57 |
|
58 <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td> |
|
59 |
|
60 </tr> |
|
61 |
|
62 <tr> |
|
63 |
|
64 <td VALIGN="TOP" width="144">Previous Version</td> |
|
65 |
|
66 <td VALIGN="TOP">n/a</td> |
|
67 |
|
68 </tr> |
|
69 |
|
70 <tr> |
|
71 |
|
72 <td VALIGN="TOP" width="144">Latest Version</td> |
|
73 |
|
74 <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td> |
|
75 |
|
76 </tr> |
|
77 |
|
78 </table> |
|
79 |
|
80 |
|
81 |
|
82 <p align="center">Copyright © 1995-1999 Unicode, Inc. All Rights reserved.<br> |
|
83 |
|
84 <i>For more information, including Disclamer and Limitations, see <a HREF="UnicodeCharacterDatabase-3.0.0.html">UnicodeCharacterDatabase-3.0.0.html</a> </i></p> |
|
85 |
|
86 |
|
87 |
|
88 <p>This document describes the format of the UnicodeData.txt file, which is one of the |
|
89 |
|
90 files in the Unicode Character Database. The document is divided into the following |
|
91 |
|
92 sections: |
|
93 |
|
94 |
|
95 |
|
96 <ul> |
|
97 |
|
98 <li><a HREF="#Field Formats">Field Formats</a> <ul> |
|
99 |
|
100 <li><a HREF="#General Category">General Category</a> </li> |
|
101 |
|
102 <li><a HREF="#Bidirectional Category">Bidirectional Category</a> </li> |
|
103 |
|
104 <li><a HREF="#Character Decomposition">Character Decomposition Mapping</a> </li> |
|
105 |
|
106 <li><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </li> |
|
107 |
|
108 <li><a HREF="#Decompositions and Normalization">Decompositions and Normalization</a> </li> |
|
109 |
|
110 <li><a HREF="#Case Mappings">Case Mappings</a> </li> |
|
111 |
|
112 </ul> |
|
113 |
|
114 </li> |
|
115 |
|
116 <li><a HREF="#Property Invariants">Property Invariants</a> </li> |
|
117 |
|
118 <li><a HREF="#Modification History">Modification History</a> </li> |
|
119 |
|
120 </ul> |
|
121 |
|
122 |
|
123 |
|
124 <p><b>Warning: </b>the information in this file does not completely describe the use and |
|
125 |
|
126 interpretation of Unicode character properties and behavior. It must be used in |
|
127 |
|
128 conjunction with the data in the other files in the Unicode Character Database, and relies |
|
129 |
|
130 on the notation and definitions supplied in <i><a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html"> The Unicode |
|
131 Standard</a></i>. All chapter references |
|
132 |
|
133 are to Version 3.0 of the standard.</p> |
|
134 |
|
135 |
|
136 |
|
137 <h2><a NAME="Field Formats"></a>Field Formats</h2> |
|
138 |
|
139 |
|
140 |
|
141 <p>The file consists of lines containing fields terminated by semicolons. Each line |
|
142 |
|
143 represents the data for one encoded character in the Unicode Standard. Every encoded |
|
144 |
|
145 character has a data entry, with the exception of certain special ranges, as detailed |
|
146 |
|
147 below. |
|
148 |
|
149 |
|
150 |
|
151 <ul> |
|
152 |
|
153 <li>There are six special ranges of characters that are represented only by their start and |
|
154 |
|
155 end characters, since the properties in the file are uniform, except for code values |
|
156 |
|
157 (which are all sequential and assigned). </li> |
|
158 |
|
159 <li>The names of CJK ideograph characters and the names and decompositions of Hangul |
|
160 |
|
161 syllable characters are algorithmically derivable. (See the Unicode Standard and <a |
|
162 |
|
163 HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical Report #15</a> for |
|
164 |
|
165 more information). </li> |
|
166 |
|
167 <li>Surrogate code values and private use characters have no names. </li> |
|
168 |
|
169 <li>The Private Use character outside of the BMP (U+F0000..U+FFFFD, U+100000..U+10FFFD) are |
|
170 |
|
171 not listed. These correspond to surrogate pairs where the first surrogate is in the High |
|
172 |
|
173 Surrogate Private Use section. </li> |
|
174 |
|
175 </ul> |
|
176 |
|
177 |
|
178 |
|
179 <p>The exact ranges represented by start and end characters are: |
|
180 |
|
181 |
|
182 |
|
183 <ul> |
|
184 |
|
185 <li>CJK Ideographs Extension A (U+3400 - U+4DB5) </li> |
|
186 |
|
187 <li>CJK Ideographs (U+4E00 - U+9FA5) </li> |
|
188 |
|
189 <li>Hangul Syllables (U+AC00 - U+D7A3) </li> |
|
190 |
|
191 <li>Non-Private Use High Surrogates (U+D800 - U+DB7F) </li> |
|
192 |
|
193 <li>Private Use High Surrogates (U+DB80 - U+DBFF) </li> |
|
194 |
|
195 <li>Low Surrogates (U+DC00 - U+DFFF) </li> |
|
196 |
|
197 <li>The Private Use Area (U+E000 - U+F8FF) </li> |
|
198 |
|
199 </ul> |
|
200 |
|
201 |
|
202 |
|
203 <p>The following table describes the format and meaning of each field in a data entry in |
|
204 |
|
205 the UnicodeData file. Fields which contain normative information are so indicated.</p> |
|
206 |
|
207 |
|
208 |
|
209 <table BORDER="1" CELLSPACING="2" CELLPADDING="2"> |
|
210 |
|
211 <tr> |
|
212 |
|
213 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Field</th> |
|
214 |
|
215 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Name</th> |
|
216 |
|
217 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Status</th> |
|
218 |
|
219 <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Explanation</th> |
|
220 |
|
221 </tr> |
|
222 |
|
223 <tr> |
|
224 |
|
225 <th VALIGN="top">0</th> |
|
226 |
|
227 <td VALIGN="top">Code value</td> |
|
228 |
|
229 <td VALIGN="top">normative</td> |
|
230 |
|
231 <td VALIGN="top">Code value in 4-digit hexadecimal format.</td> |
|
232 |
|
233 </tr> |
|
234 |
|
235 <tr> |
|
236 |
|
237 <th VALIGN="top">1</th> |
|
238 |
|
239 <td VALIGN="top">Character name</td> |
|
240 |
|
241 <td VALIGN="top">normative</td> |
|
242 |
|
243 <td VALIGN="top">These names match exactly the names published in Chapter 14 of the |
|
244 |
|
245 Unicode Standard, Version 3.0.</td> |
|
246 |
|
247 </tr> |
|
248 |
|
249 <tr> |
|
250 |
|
251 <th VALIGN="top">2</th> |
|
252 |
|
253 <td VALIGN="top"><a HREF="#General Category">General Category</a> </td> |
|
254 |
|
255 <td VALIGN="top">normative / informative<br> |
|
256 |
|
257 (see below)</td> |
|
258 |
|
259 <td VALIGN="top">This is a useful breakdown into various "character types" which |
|
260 |
|
261 can be used as a default categorization in implementations. See below for a brief |
|
262 |
|
263 explanation.</td> |
|
264 |
|
265 </tr> |
|
266 |
|
267 <tr> |
|
268 |
|
269 <th VALIGN="top">3</th> |
|
270 |
|
271 <td VALIGN="top"><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </td> |
|
272 |
|
273 <td VALIGN="top">normative</td> |
|
274 |
|
275 <td VALIGN="top">The classes used for the Canonical Ordering Algorithm in the Unicode |
|
276 |
|
277 Standard. These classes are also printed in Chapter 4 of the Unicode Standard.</td> |
|
278 |
|
279 </tr> |
|
280 |
|
281 <tr> |
|
282 |
|
283 <th VALIGN="top">4</th> |
|
284 |
|
285 <td VALIGN="top"><a HREF="#Bidirectional Category">Bidirectional Category</a> </td> |
|
286 |
|
287 <td VALIGN="top">normative</td> |
|
288 |
|
289 <td VALIGN="top">See the list below for an explanation of the abbreviations used in this |
|
290 |
|
291 field. These are the categories required by the Bidirectional Behavior Algorithm in the |
|
292 |
|
293 Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard.</td> |
|
294 |
|
295 </tr> |
|
296 |
|
297 <tr> |
|
298 |
|
299 <th VALIGN="top">5</th> |
|
300 |
|
301 <td VALIGN="top"><a HREF="#Character Decomposition">Character Decomposition |
|
302 Mapping</a></td> |
|
303 |
|
304 <td VALIGN="top">normative</td> |
|
305 |
|
306 <td VALIGN="top">In the Unicode Standard, not all of the mappings are full (maximal) |
|
307 |
|
308 decompositions. Recursive application of look-up for decompositions will, in all cases, |
|
309 |
|
310 lead to a maximal decomposition. The decomposition mappings match exactly the |
|
311 |
|
312 decomposition mappings published with the character names in the Unicode Standard.</td> |
|
313 |
|
314 </tr> |
|
315 |
|
316 <tr> |
|
317 |
|
318 <th VALIGN="top">6</th> |
|
319 |
|
320 <td VALIGN="top">Decimal digit value</td> |
|
321 |
|
322 <td VALIGN="top">normative</td> |
|
323 |
|
324 <td VALIGN="top">This is a numeric field. If the character has the decimal digit property, |
|
325 |
|
326 as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented |
|
327 |
|
328 with an integer value in this field</td> |
|
329 |
|
330 </tr> |
|
331 |
|
332 <tr> |
|
333 |
|
334 <th VALIGN="top">7</th> |
|
335 |
|
336 <td VALIGN="top">Digit value</td> |
|
337 |
|
338 <td VALIGN="top">normative</td> |
|
339 |
|
340 <td VALIGN="top">This is a numeric field. If the character represents a digit, not |
|
341 |
|
342 necessarily a decimal digit, the value is here. This covers digits which do not form |
|
343 |
|
344 decimal radix forms, such as the compatibility superscript digits</td> |
|
345 |
|
346 </tr> |
|
347 |
|
348 <tr> |
|
349 |
|
350 <th VALIGN="top">8</th> |
|
351 |
|
352 <td VALIGN="top">Numeric value</td> |
|
353 |
|
354 <td VALIGN="top">normative</td> |
|
355 |
|
356 <td VALIGN="top">This is a numeric field. If the character has the numeric property, as |
|
357 |
|
358 specified in Chapter 4 of the Unicode Standard, the value of that character is represented |
|
359 |
|
360 with an integer or rational number in this field. This includes fractions as, e.g., |
|
361 |
|
362 "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values |
|
363 |
|
364 for compatibility characters such as circled numbers.</td> |
|
365 |
|
366 </tr> |
|
367 |
|
368 <tr> |
|
369 |
|
370 <th VALIGN="top">8</th> |
|
371 |
|
372 <td VALIGN="top">Mirrored</td> |
|
373 |
|
374 <td VALIGN="top">normative</td> |
|
375 |
|
376 <td VALIGN="top">If the character has been identified as a "mirrored" character |
|
377 |
|
378 in bidirectional text, this field has the value "Y"; otherwise "N". |
|
379 |
|
380 The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard.</td> |
|
381 |
|
382 </tr> |
|
383 |
|
384 <tr> |
|
385 |
|
386 <th VALIGN="top">10</th> |
|
387 |
|
388 <td VALIGN="top">Unicode 1.0 Name</td> |
|
389 |
|
390 <td VALIGN="top">informative</td> |
|
391 |
|
392 <td VALIGN="top">This is the old name as published in Unicode 1.0. This name is only |
|
393 |
|
394 provided when it is significantly different from the Unicode 3.0 name for the character.</td> |
|
395 |
|
396 </tr> |
|
397 |
|
398 <tr> |
|
399 |
|
400 <th VALIGN="top">11</th> |
|
401 |
|
402 <td VALIGN="top">10646 comment field</td> |
|
403 |
|
404 <td VALIGN="top">informative</td> |
|
405 |
|
406 <td VALIGN="top">This is the ISO 10646 comment field. It is in parantheses in the 10646 |
|
407 |
|
408 names list.</td> |
|
409 |
|
410 </tr> |
|
411 |
|
412 <tr> |
|
413 |
|
414 <th VALIGN="top">12</th> |
|
415 |
|
416 <td VALIGN="top"><a HREF="#Case Mappings">Uppercase Mapping</a></td> |
|
417 |
|
418 <td VALIGN="top">informative</td> |
|
419 |
|
420 <td VALIGN="top">Upper case equivalent mapping. If a character is part of an alphabet with |
|
421 |
|
422 case distinctions, and has an upper case equivalent, then the upper case equivalent is in |
|
423 |
|
424 this field. See the explanation below on case distinctions. These mappings are always |
|
425 |
|
426 one-to-one, not one-to-many or many-to-one. This field is informative.</td> |
|
427 |
|
428 </tr> |
|
429 |
|
430 <tr> |
|
431 |
|
432 <th VALIGN="top">13</th> |
|
433 |
|
434 <td VALIGN="top"><a HREF="#Case Mappings">Lowercase Mapping</a></td> |
|
435 |
|
436 <td VALIGN="top">informative</td> |
|
437 |
|
438 <td VALIGN="top">Similar to Uppercase mapping</td> |
|
439 |
|
440 </tr> |
|
441 |
|
442 <tr> |
|
443 |
|
444 <th VALIGN="top">14</th> |
|
445 |
|
446 <td VALIGN="top"><a HREF="#Case Mappings">Titlecase Mapping</a></td> |
|
447 |
|
448 <td VALIGN="top">informative</td> |
|
449 |
|
450 <td VALIGN="top">Similar to Uppercase mapping</td> |
|
451 |
|
452 </tr> |
|
453 |
|
454 </table> |
|
455 |
|
456 |
|
457 |
|
458 <h3><a NAME="General Category"></a>General Category</h3> |
|
459 |
|
460 |
|
461 |
|
462 <p>The values in this field are abbreviations for the following. Some of the values are |
|
463 |
|
464 normative, and some are informative. For more information, see the Unicode Standard.</p> |
|
465 |
|
466 |
|
467 |
|
468 <p><b>Note:</b> the standard does not assign information to control characters (except for |
|
469 |
|
470 certain cases in the Bidirectional Algorithm). Implementations will generally also assign |
|
471 |
|
472 categories to certain control characters, notably CR and LF, according to platform |
|
473 |
|
474 conventions.</p> |
|
475 |
|
476 |
|
477 |
|
478 <h4>Normative Categories</h4> |
|
479 |
|
480 |
|
481 |
|
482 <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> |
|
483 |
|
484 <tr> |
|
485 |
|
486 <th><p ALIGN="LEFT">Abbr.</th> |
|
487 |
|
488 <th><p ALIGN="LEFT">Description</th> |
|
489 |
|
490 </tr> |
|
491 |
|
492 <tr> |
|
493 |
|
494 <td ALIGN="CENTER">Lu</td> |
|
495 |
|
496 <td>Letter, Uppercase</td> |
|
497 |
|
498 </tr> |
|
499 |
|
500 <tr> |
|
501 |
|
502 <td ALIGN="CENTER">Ll</td> |
|
503 |
|
504 <td>Letter, Lowercase</td> |
|
505 |
|
506 </tr> |
|
507 |
|
508 <tr> |
|
509 |
|
510 <td ALIGN="CENTER">Lt</td> |
|
511 |
|
512 <td>Letter, Titlecase</td> |
|
513 |
|
514 </tr> |
|
515 |
|
516 <tr> |
|
517 |
|
518 <td ALIGN="CENTER">Mn</td> |
|
519 |
|
520 <td>Mark, Non-Spacing</td> |
|
521 |
|
522 </tr> |
|
523 |
|
524 <tr> |
|
525 |
|
526 <td ALIGN="CENTER">Mc</td> |
|
527 |
|
528 <td>Mark, Spacing Combining</td> |
|
529 |
|
530 </tr> |
|
531 |
|
532 <tr> |
|
533 |
|
534 <td ALIGN="CENTER">Me</td> |
|
535 |
|
536 <td>Mark, Enclosing</td> |
|
537 |
|
538 </tr> |
|
539 |
|
540 <tr> |
|
541 |
|
542 <td ALIGN="CENTER">Nd</td> |
|
543 |
|
544 <td>Number, Decimal Digit</td> |
|
545 |
|
546 </tr> |
|
547 |
|
548 <tr> |
|
549 |
|
550 <td ALIGN="CENTER">Nl</td> |
|
551 |
|
552 <td>Number, Letter</td> |
|
553 |
|
554 </tr> |
|
555 |
|
556 <tr> |
|
557 |
|
558 <td ALIGN="CENTER">No</td> |
|
559 |
|
560 <td>Number, Other</td> |
|
561 |
|
562 </tr> |
|
563 |
|
564 <tr> |
|
565 |
|
566 <td ALIGN="CENTER">Zs</td> |
|
567 |
|
568 <td>Separator, Space</td> |
|
569 |
|
570 </tr> |
|
571 |
|
572 <tr> |
|
573 |
|
574 <td ALIGN="CENTER">Zl</td> |
|
575 |
|
576 <td>Separator, Line</td> |
|
577 |
|
578 </tr> |
|
579 |
|
580 <tr> |
|
581 |
|
582 <td ALIGN="CENTER">Zp</td> |
|
583 |
|
584 <td>Separator, Paragraph</td> |
|
585 |
|
586 </tr> |
|
587 |
|
588 <tr> |
|
589 |
|
590 <td ALIGN="CENTER">Cc</td> |
|
591 |
|
592 <td>Other, Control</td> |
|
593 |
|
594 </tr> |
|
595 |
|
596 <tr> |
|
597 |
|
598 <td ALIGN="CENTER">Cf</td> |
|
599 |
|
600 <td>Other, Format</td> |
|
601 |
|
602 </tr> |
|
603 |
|
604 <tr> |
|
605 |
|
606 <td ALIGN="CENTER">Cs</td> |
|
607 |
|
608 <td>Other, Surrogate</td> |
|
609 |
|
610 </tr> |
|
611 |
|
612 <tr> |
|
613 |
|
614 <td ALIGN="CENTER">Co</td> |
|
615 |
|
616 <td>Other, Private Use</td> |
|
617 |
|
618 </tr> |
|
619 |
|
620 <tr> |
|
621 |
|
622 <td ALIGN="CENTER">Cn</td> |
|
623 |
|
624 <td>Other, Not Assigned (no characters in the file have this property)</td> |
|
625 |
|
626 </tr> |
|
627 |
|
628 </table> |
|
629 |
|
630 |
|
631 |
|
632 <h4>Informative Categories</h4> |
|
633 |
|
634 |
|
635 |
|
636 <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> |
|
637 |
|
638 <tr> |
|
639 |
|
640 <th><p ALIGN="LEFT">Abbr.</th> |
|
641 |
|
642 <th><p ALIGN="LEFT">Description</th> |
|
643 |
|
644 </tr> |
|
645 |
|
646 <tr> |
|
647 |
|
648 <td ALIGN="CENTER">Lm</td> |
|
649 |
|
650 <td>Letter, Modifier</td> |
|
651 |
|
652 </tr> |
|
653 |
|
654 <tr> |
|
655 |
|
656 <td ALIGN="CENTER">Lo</td> |
|
657 |
|
658 <td>Letter, Other</td> |
|
659 |
|
660 </tr> |
|
661 |
|
662 <tr> |
|
663 |
|
664 <td ALIGN="CENTER">Pc</td> |
|
665 |
|
666 <td>Punctuation, Connector</td> |
|
667 |
|
668 </tr> |
|
669 |
|
670 <tr> |
|
671 |
|
672 <td ALIGN="CENTER">Pd</td> |
|
673 |
|
674 <td>Punctuation, Dash</td> |
|
675 |
|
676 </tr> |
|
677 |
|
678 <tr> |
|
679 |
|
680 <td ALIGN="CENTER">Ps</td> |
|
681 |
|
682 <td>Punctuation, Open</td> |
|
683 |
|
684 </tr> |
|
685 |
|
686 <tr> |
|
687 |
|
688 <td ALIGN="CENTER">Pe</td> |
|
689 |
|
690 <td>Punctuation, Close</td> |
|
691 |
|
692 </tr> |
|
693 |
|
694 <tr> |
|
695 |
|
696 <td ALIGN="CENTER">Pi</td> |
|
697 |
|
698 <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td> |
|
699 |
|
700 </tr> |
|
701 |
|
702 <tr> |
|
703 |
|
704 <td ALIGN="CENTER">Pf</td> |
|
705 |
|
706 <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td> |
|
707 |
|
708 </tr> |
|
709 |
|
710 <tr> |
|
711 |
|
712 <td ALIGN="CENTER">Po</td> |
|
713 |
|
714 <td>Punctuation, Other</td> |
|
715 |
|
716 </tr> |
|
717 |
|
718 <tr> |
|
719 |
|
720 <td ALIGN="CENTER">Sm</td> |
|
721 |
|
722 <td>Symbol, Math</td> |
|
723 |
|
724 </tr> |
|
725 |
|
726 <tr> |
|
727 |
|
728 <td ALIGN="CENTER">Sc</td> |
|
729 |
|
730 <td>Symbol, Currency</td> |
|
731 |
|
732 </tr> |
|
733 |
|
734 <tr> |
|
735 |
|
736 <td ALIGN="CENTER">Sk</td> |
|
737 |
|
738 <td>Symbol, Modifier</td> |
|
739 |
|
740 </tr> |
|
741 |
|
742 <tr> |
|
743 |
|
744 <td ALIGN="CENTER">So</td> |
|
745 |
|
746 <td>Symbol, Other</td> |
|
747 |
|
748 </tr> |
|
749 |
|
750 </table> |
|
751 |
|
752 |
|
753 |
|
754 <h3><a NAME="Bidirectional Category"></a>Bidirectional Category</h3> |
|
755 |
|
756 |
|
757 |
|
758 <p>Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional |
|
759 |
|
760 Behavior and an explanation of the significance of these categories. An up-to-date version |
|
761 |
|
762 can be found on <a HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode Technical |
|
763 |
|
764 Report #9: The Bidirectional Algorithm</a>. These values are normative.</p> |
|
765 |
|
766 |
|
767 |
|
768 <table BORDER="0" CELLPADDING="2"> |
|
769 |
|
770 <tr> |
|
771 |
|
772 <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Type</th> |
|
773 |
|
774 <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Description</th> |
|
775 |
|
776 </tr> |
|
777 |
|
778 <tr> |
|
779 |
|
780 <td VALIGN="TOP"><b>L</b></td> |
|
781 |
|
782 <td VALIGN="TOP">Left-to-Right</td> |
|
783 |
|
784 </tr> |
|
785 |
|
786 <tr> |
|
787 |
|
788 <td VALIGN="TOP"><b>LRE</b></td> |
|
789 |
|
790 <td VALIGN="TOP">Left-to-Right Embedding</td> |
|
791 |
|
792 </tr> |
|
793 |
|
794 <tr> |
|
795 |
|
796 <td VALIGN="TOP"><b>LRO</b></td> |
|
797 |
|
798 <td VALIGN="TOP">Left-to-Right Override</td> |
|
799 |
|
800 </tr> |
|
801 |
|
802 <tr> |
|
803 |
|
804 <td VALIGN="TOP"><b>R</b></td> |
|
805 |
|
806 <td VALIGN="TOP">Right-to-Left</td> |
|
807 |
|
808 </tr> |
|
809 |
|
810 <tr> |
|
811 |
|
812 <td VALIGN="TOP"><b>AL</b></td> |
|
813 |
|
814 <td VALIGN="TOP">Right-to-Left Arabic</td> |
|
815 |
|
816 </tr> |
|
817 |
|
818 <tr> |
|
819 |
|
820 <td VALIGN="TOP"><b>RLE</b></td> |
|
821 |
|
822 <td VALIGN="TOP">Right-to-Left Embedding</td> |
|
823 |
|
824 </tr> |
|
825 |
|
826 <tr> |
|
827 |
|
828 <td VALIGN="TOP"><b>RLO</b></td> |
|
829 |
|
830 <td VALIGN="TOP">Right-to-Left Override</td> |
|
831 |
|
832 </tr> |
|
833 |
|
834 <tr> |
|
835 |
|
836 <td VALIGN="TOP"><b>PDF</b></td> |
|
837 |
|
838 <td VALIGN="TOP">Pop Directional Format</td> |
|
839 |
|
840 </tr> |
|
841 |
|
842 <tr> |
|
843 |
|
844 <td VALIGN="TOP"><b>EN</b></td> |
|
845 |
|
846 <td VALIGN="TOP">European Number</td> |
|
847 |
|
848 </tr> |
|
849 |
|
850 <tr> |
|
851 |
|
852 <td VALIGN="TOP"><b>ES</b></td> |
|
853 |
|
854 <td VALIGN="TOP">European Number Separator</td> |
|
855 |
|
856 </tr> |
|
857 |
|
858 <tr> |
|
859 |
|
860 <td VALIGN="TOP"><b>ET</b></td> |
|
861 |
|
862 <td VALIGN="TOP">European Number Terminator</td> |
|
863 |
|
864 </tr> |
|
865 |
|
866 <tr> |
|
867 |
|
868 <td VALIGN="TOP"><b>AN</b></td> |
|
869 |
|
870 <td VALIGN="TOP">Arabic Number</td> |
|
871 |
|
872 </tr> |
|
873 |
|
874 <tr> |
|
875 |
|
876 <td VALIGN="TOP"><b>CS</b></td> |
|
877 |
|
878 <td VALIGN="TOP">Common Number Separator</td> |
|
879 |
|
880 </tr> |
|
881 |
|
882 <tr> |
|
883 |
|
884 <td VALIGN="TOP"><b>NSM</b></td> |
|
885 |
|
886 <td VALIGN="TOP">Non-Spacing Mark</td> |
|
887 |
|
888 </tr> |
|
889 |
|
890 <tr> |
|
891 |
|
892 <td VALIGN="TOP"><b>BN</b></td> |
|
893 |
|
894 <td VALIGN="TOP">Boundary Neutral</td> |
|
895 |
|
896 </tr> |
|
897 |
|
898 <tr> |
|
899 |
|
900 <td VALIGN="TOP"><b>B</b></td> |
|
901 |
|
902 <td VALIGN="TOP">Paragraph Separator</td> |
|
903 |
|
904 </tr> |
|
905 |
|
906 <tr> |
|
907 |
|
908 <td VALIGN="TOP"><b>S</b></td> |
|
909 |
|
910 <td VALIGN="TOP">Segment Separator</td> |
|
911 |
|
912 </tr> |
|
913 |
|
914 <tr> |
|
915 |
|
916 <td VALIGN="TOP"><b>WS</b></td> |
|
917 |
|
918 <td VALIGN="TOP">Whitespace</td> |
|
919 |
|
920 </tr> |
|
921 |
|
922 <tr> |
|
923 |
|
924 <td VALIGN="TOP"><b>ON</b></td> |
|
925 |
|
926 <td VALIGN="TOP">Other Neutrals</td> |
|
927 |
|
928 </tr> |
|
929 |
|
930 </table> |
|
931 |
|
932 |
|
933 |
|
934 <h3><a NAME="Character Decomposition"></a>Character Decomposition Mapping</h3> |
|
935 |
|
936 |
|
937 |
|
938 <p>The decomposition is a normative property of a character. The tags supplied with |
|
939 |
|
940 certain decomposition mappings generally indicate formatting information. Where no such |
|
941 |
|
942 tag is given, the mapping is designated as canonical. Conversely, the presence of a |
|
943 |
|
944 formatting tag also indicates that the mapping is a compatibility mapping and not a |
|
945 |
|
946 canonical mapping. In the absence of other formatting information in a compatibility |
|
947 |
|
948 mapping, the tag is used to distinguish it from canonical mappings.</p> |
|
949 |
|
950 |
|
951 |
|
952 <p>In some instances a canonical mapping or a compatibility mapping may consist of a |
|
953 |
|
954 single character. For a canonical mapping, this indicates that the character is a |
|
955 |
|
956 canonical equivalent of another single character. For a compatibility mapping, this |
|
957 |
|
958 indicates that the character is a compatibility equivalent of another single character. |
|
959 |
|
960 The compatibility formatting tags used are:</p> |
|
961 |
|
962 |
|
963 |
|
964 <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> |
|
965 |
|
966 <tr> |
|
967 |
|
968 <th>Tag</th> |
|
969 |
|
970 <th><p ALIGN="LEFT">Description</th> |
|
971 |
|
972 </tr> |
|
973 |
|
974 <tr> |
|
975 |
|
976 <td ALIGN="CENTER"><font> </td> |
|
977 |
|
978 <td>A font variant (e.g. a blackletter form).</td> |
|
979 |
|
980 </tr> |
|
981 |
|
982 <tr> |
|
983 |
|
984 <td ALIGN="CENTER"><noBreak> </td> |
|
985 |
|
986 <td>A no-break version of a space or hyphen.</td> |
|
987 |
|
988 </tr> |
|
989 |
|
990 <tr> |
|
991 |
|
992 <td ALIGN="CENTER"><initial> </td> |
|
993 |
|
994 <td>An initial presentation form (Arabic).</td> |
|
995 |
|
996 </tr> |
|
997 |
|
998 <tr> |
|
999 |
|
1000 <td ALIGN="CENTER"><medial> </td> |
|
1001 |
|
1002 <td>A medial presentation form (Arabic).</td> |
|
1003 |
|
1004 </tr> |
|
1005 |
|
1006 <tr> |
|
1007 |
|
1008 <td ALIGN="CENTER"><final> </td> |
|
1009 |
|
1010 <td>A final presentation form (Arabic).</td> |
|
1011 |
|
1012 </tr> |
|
1013 |
|
1014 <tr> |
|
1015 |
|
1016 <td ALIGN="CENTER"><isolated> </td> |
|
1017 |
|
1018 <td>An isolated presentation form (Arabic).</td> |
|
1019 |
|
1020 </tr> |
|
1021 |
|
1022 <tr> |
|
1023 |
|
1024 <td ALIGN="CENTER"><circle> </td> |
|
1025 |
|
1026 <td>An encircled form.</td> |
|
1027 |
|
1028 </tr> |
|
1029 |
|
1030 <tr> |
|
1031 |
|
1032 <td ALIGN="CENTER"><super> </td> |
|
1033 |
|
1034 <td>A superscript form.</td> |
|
1035 |
|
1036 </tr> |
|
1037 |
|
1038 <tr> |
|
1039 |
|
1040 <td ALIGN="CENTER"><sub> </td> |
|
1041 |
|
1042 <td>A subscript form.</td> |
|
1043 |
|
1044 </tr> |
|
1045 |
|
1046 <tr> |
|
1047 |
|
1048 <td ALIGN="CENTER"><vertical> </td> |
|
1049 |
|
1050 <td>A vertical layout presentation form.</td> |
|
1051 |
|
1052 </tr> |
|
1053 |
|
1054 <tr> |
|
1055 |
|
1056 <td ALIGN="CENTER"><wide> </td> |
|
1057 |
|
1058 <td>A wide (or zenkaku) compatibility character.</td> |
|
1059 |
|
1060 </tr> |
|
1061 |
|
1062 <tr> |
|
1063 |
|
1064 <td ALIGN="CENTER"><narrow> </td> |
|
1065 |
|
1066 <td>A narrow (or hankaku) compatibility character.</td> |
|
1067 |
|
1068 </tr> |
|
1069 |
|
1070 <tr> |
|
1071 |
|
1072 <td ALIGN="CENTER"><small> </td> |
|
1073 |
|
1074 <td>A small variant form (CNS compatibility).</td> |
|
1075 |
|
1076 </tr> |
|
1077 |
|
1078 <tr> |
|
1079 |
|
1080 <td ALIGN="CENTER"><square> </td> |
|
1081 |
|
1082 <td>A CJK squared font variant.</td> |
|
1083 |
|
1084 </tr> |
|
1085 |
|
1086 <tr> |
|
1087 |
|
1088 <td ALIGN="CENTER"><fraction> </td> |
|
1089 |
|
1090 <td>A vulgar fraction form.</td> |
|
1091 |
|
1092 </tr> |
|
1093 |
|
1094 <tr> |
|
1095 |
|
1096 <td ALIGN="CENTER"><compat> </td> |
|
1097 |
|
1098 <td>Otherwise unspecified compatibility character.</td> |
|
1099 |
|
1100 </tr> |
|
1101 |
|
1102 </table> |
|
1103 |
|
1104 |
|
1105 |
|
1106 <p><b>Reminder: </b>There is a difference between decomposition and decomposition mapping. |
|
1107 |
|
1108 The decomposition mappings are defined in the UnicodeData, while the decomposition (also |
|
1109 |
|
1110 termed "full decomposition") is defined in Chapter 3 to use those mappings |
|
1111 <i> |
|
1112 |
|
1113 recursively.</i> |
|
1114 |
|
1115 |
|
1116 |
|
1117 <ul> |
|
1118 |
|
1119 <li>The canonical decomposition is formed by recursively applying the canonical mappings, |
|
1120 |
|
1121 then applying the canonical reordering algorithm. </li> |
|
1122 |
|
1123 <li>The compatibility decomposition is formed by recursively applying the canonical <em>and</em> |
|
1124 |
|
1125 compatibility mappings, then applying the canonical reordering algorithm. </li> |
|
1126 |
|
1127 </ul> |
|
1128 |
|
1129 |
|
1130 |
|
1131 <h3><a NAME="Canonical Combining Classes"></a>Canonical Combining Classes</h3> |
|
1132 |
|
1133 |
|
1134 |
|
1135 <table BORDER="0" CELLSPACING="2" CELLPADDING="0"> |
|
1136 |
|
1137 <tr> |
|
1138 |
|
1139 <th><p ALIGN="LEFT">Value</th> |
|
1140 |
|
1141 <th><p ALIGN="LEFT">Description</th> |
|
1142 |
|
1143 </tr> |
|
1144 |
|
1145 <tr> |
|
1146 |
|
1147 <td ALIGN="RIGHT">0:</td> |
|
1148 |
|
1149 <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td> |
|
1150 |
|
1151 </tr> |
|
1152 |
|
1153 <tr> |
|
1154 |
|
1155 <td ALIGN="RIGHT">1:</td> |
|
1156 |
|
1157 <td>Overlays and interior</td> |
|
1158 |
|
1159 </tr> |
|
1160 |
|
1161 <tr> |
|
1162 |
|
1163 <td ALIGN="RIGHT">7:</td> |
|
1164 |
|
1165 <td>Nuktas</td> |
|
1166 |
|
1167 </tr> |
|
1168 |
|
1169 <tr> |
|
1170 |
|
1171 <td ALIGN="RIGHT">8:</td> |
|
1172 |
|
1173 <td>Hiragana/Katakana voicing marks</td> |
|
1174 |
|
1175 </tr> |
|
1176 |
|
1177 <tr> |
|
1178 |
|
1179 <td ALIGN="RIGHT">9:</td> |
|
1180 |
|
1181 <td>Viramas</td> |
|
1182 |
|
1183 </tr> |
|
1184 |
|
1185 <tr> |
|
1186 |
|
1187 <td ALIGN="RIGHT">10:</td> |
|
1188 |
|
1189 <td>Start of fixed position classes</td> |
|
1190 |
|
1191 </tr> |
|
1192 |
|
1193 <tr> |
|
1194 |
|
1195 <td ALIGN="RIGHT">199:</td> |
|
1196 |
|
1197 <td>End of fixed position classes</td> |
|
1198 |
|
1199 </tr> |
|
1200 |
|
1201 <tr> |
|
1202 |
|
1203 <td ALIGN="RIGHT">200:</td> |
|
1204 |
|
1205 <td>Below left attached</td> |
|
1206 |
|
1207 </tr> |
|
1208 |
|
1209 <tr> |
|
1210 |
|
1211 <td ALIGN="RIGHT">202:</td> |
|
1212 |
|
1213 <td>Below attached</td> |
|
1214 |
|
1215 </tr> |
|
1216 |
|
1217 <tr> |
|
1218 |
|
1219 <td ALIGN="RIGHT">204:</td> |
|
1220 |
|
1221 <td>Below right attached</td> |
|
1222 |
|
1223 </tr> |
|
1224 |
|
1225 <tr> |
|
1226 |
|
1227 <td ALIGN="RIGHT">208:</td> |
|
1228 |
|
1229 <td>Left attached (reordrant around single base character)</td> |
|
1230 |
|
1231 </tr> |
|
1232 |
|
1233 <tr> |
|
1234 |
|
1235 <td ALIGN="RIGHT">210:</td> |
|
1236 |
|
1237 <td>Right attached</td> |
|
1238 |
|
1239 </tr> |
|
1240 |
|
1241 <tr> |
|
1242 |
|
1243 <td ALIGN="RIGHT">212:</td> |
|
1244 |
|
1245 <td>Above left attached</td> |
|
1246 |
|
1247 </tr> |
|
1248 |
|
1249 <tr> |
|
1250 |
|
1251 <td ALIGN="RIGHT">214:</td> |
|
1252 |
|
1253 <td>Above attached</td> |
|
1254 |
|
1255 </tr> |
|
1256 |
|
1257 <tr> |
|
1258 |
|
1259 <td ALIGN="RIGHT">216:</td> |
|
1260 |
|
1261 <td>Above right attached</td> |
|
1262 |
|
1263 </tr> |
|
1264 |
|
1265 <tr> |
|
1266 |
|
1267 <td ALIGN="RIGHT">218:</td> |
|
1268 |
|
1269 <td>Below left</td> |
|
1270 |
|
1271 </tr> |
|
1272 |
|
1273 <tr> |
|
1274 |
|
1275 <td ALIGN="RIGHT">220:</td> |
|
1276 |
|
1277 <td>Below</td> |
|
1278 |
|
1279 </tr> |
|
1280 |
|
1281 <tr> |
|
1282 |
|
1283 <td ALIGN="RIGHT">222:</td> |
|
1284 |
|
1285 <td>Below right</td> |
|
1286 |
|
1287 </tr> |
|
1288 |
|
1289 <tr> |
|
1290 |
|
1291 <td ALIGN="RIGHT">224:</td> |
|
1292 |
|
1293 <td>Left (reordrant around single base character)</td> |
|
1294 |
|
1295 </tr> |
|
1296 |
|
1297 <tr> |
|
1298 |
|
1299 <td ALIGN="RIGHT">226:</td> |
|
1300 |
|
1301 <td>Right</td> |
|
1302 |
|
1303 </tr> |
|
1304 |
|
1305 <tr> |
|
1306 |
|
1307 <td ALIGN="RIGHT">228:</td> |
|
1308 |
|
1309 <td>Above left</td> |
|
1310 |
|
1311 </tr> |
|
1312 |
|
1313 <tr> |
|
1314 |
|
1315 <td ALIGN="RIGHT">230:</td> |
|
1316 |
|
1317 <td>Above</td> |
|
1318 |
|
1319 </tr> |
|
1320 |
|
1321 <tr> |
|
1322 |
|
1323 <td ALIGN="RIGHT">232:</td> |
|
1324 |
|
1325 <td>Above right</td> |
|
1326 |
|
1327 </tr> |
|
1328 |
|
1329 <tr> |
|
1330 |
|
1331 <td ALIGN="RIGHT">233:</td> |
|
1332 |
|
1333 <td>Double below</td> |
|
1334 |
|
1335 </tr> |
|
1336 |
|
1337 <tr> |
|
1338 |
|
1339 <td ALIGN="RIGHT">234:</td> |
|
1340 |
|
1341 <td>Double above</td> |
|
1342 |
|
1343 </tr> |
|
1344 |
|
1345 <tr> |
|
1346 |
|
1347 <td ALIGN="RIGHT">240:</td> |
|
1348 |
|
1349 <td>Below (iota subscript)</td> |
|
1350 |
|
1351 </tr> |
|
1352 |
|
1353 </table> |
|
1354 |
|
1355 |
|
1356 |
|
1357 <p><strong>Note: </strong>some of the combining classes in this list do not currently have |
|
1358 |
|
1359 members but are specified here for completeness.</p> |
|
1360 |
|
1361 |
|
1362 |
|
1363 <h3><a NAME="Decompositions and Normalization"></a>Decompositions and Normalization</h3> |
|
1364 |
|
1365 |
|
1366 |
|
1367 <p>Decomposition is specified in Chapter 3. <a href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Technical Report #15: |
|
1368 |
|
1369 Normalization Forms</i></a> specifies the interaction between decomposition and normalization. The |
|
1370 |
|
1371 most up-to-date version is found on <a HREF="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>. |
|
1372 |
|
1373 That report specifies how the decompositions defined in UnicodeData.txt are used to derive |
|
1374 |
|
1375 normalized forms of Unicode text.</p> |
|
1376 |
|
1377 |
|
1378 |
|
1379 <p>Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions |
|
1380 |
|
1381 in the UnicodeData.txt file can be used to recursively derive the full decomposition in |
|
1382 |
|
1383 canonical order, without the need to separately apply canonical reordering. However, |
|
1384 |
|
1385 canonical reordering of combining character sequences must still be applied in |
|
1386 |
|
1387 decomposition when normalizing source text which contains any combining marks.</p> |
|
1388 |
|
1389 |
|
1390 |
|
1391 <h3><a NAME="Case Mappings"></a>Case Mappings</h3> |
|
1392 |
|
1393 |
|
1394 |
|
1395 <p>The case mapping is an informative, default mapping. Case itself, on the other hand, |
|
1396 |
|
1397 has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively |
|
1398 |
|
1399 uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The |
|
1400 |
|
1401 reason for this is that case can be considered to be an inherent property of a particular |
|
1402 |
|
1403 character (and is usually, but not always, derivable from the presence of the terms |
|
1404 |
|
1405 "CAPITAL" or "SMALL" in the character name), but case mappings between |
|
1406 |
|
1407 characters are occasionally influenced by local conventions. For example, certain |
|
1408 |
|
1409 languages, such as Turkish, German, French, or Greek may have small deviations from the |
|
1410 |
|
1411 default mappings listed in UnicodeData.</p> |
|
1412 |
|
1413 |
|
1414 |
|
1415 <p>In addition to uppercase and lowercase, because of the inclusion of certain composite |
|
1416 |
|
1417 characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case, |
|
1418 |
|
1419 called <i>titlecase</i>, which is used where the first letter of a word is to be |
|
1420 |
|
1421 capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter |
|
1422 |
|
1423 is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.</p> |
|
1424 |
|
1425 |
|
1426 |
|
1427 <p>The uppercase, titlecase and lowercase fields are only included for characters that |
|
1428 |
|
1429 have a single corresponding character of that type. Composite characters (such as |
|
1430 |
|
1431 "339D SQUARE CM") that do not have a single corresponding character of that type |
|
1432 |
|
1433 can be cased by decomposition.</p> |
|
1434 |
|
1435 |
|
1436 |
|
1437 <p>For compatibility with existing parsers, UnicodeData only contains case mappings for |
|
1438 |
|
1439 characters where they are one-to-one mappings; it also omits information about |
|
1440 |
|
1441 context-sensitive case mappings. Information about these special cases can be found in a |
|
1442 |
|
1443 separate data file, SpecialCasing.txt, |
|
1444 |
|
1445 which has been added starting with the 2.1.8 update to the Unicode data files. |
|
1446 |
|
1447 SpecialCasing.txt contains additional informative case mappings that are either not |
|
1448 |
|
1449 one-to-one or which are context-sensitive.</p> |
|
1450 |
|
1451 |
|
1452 |
|
1453 <h2><a NAME="Property Invariants"></a>Property Invariants</h2> |
|
1454 |
|
1455 |
|
1456 |
|
1457 <p>Values in UnicodeData.txt are subject to correction as errors are found; however, some |
|
1458 |
|
1459 characteristics of the categories themselves can be considered invariants. Applications |
|
1460 |
|
1461 may wish to take these invariants into account when choosing how to implement character |
|
1462 |
|
1463 properties. The following is a partial list of known invariants for the Unicode Character |
|
1464 |
|
1465 Database.</p> |
|
1466 |
|
1467 |
|
1468 |
|
1469 <h4>Database Fields</h4> |
|
1470 |
|
1471 |
|
1472 |
|
1473 <ul> |
|
1474 |
|
1475 <li>The number of fields in UnicodeData.txt is fixed. </li> |
|
1476 |
|
1477 <li>The order of the fields is also fixed. <ul> |
|
1478 |
|
1479 <li>Any additional information about character properties to be added in the future will |
|
1480 |
|
1481 appear in separate data tables, rather than being added on to the existing table or by |
|
1482 |
|
1483 subdivision or reinterpretation of existing fields. </li> |
|
1484 |
|
1485 </ul> |
|
1486 |
|
1487 </li> |
|
1488 |
|
1489 </ul> |
|
1490 |
|
1491 |
|
1492 |
|
1493 <h4>General Category</h4> |
|
1494 |
|
1495 |
|
1496 |
|
1497 <ul> |
|
1498 |
|
1499 <li>There will never be more than 32 General Category values. <ul> |
|
1500 |
|
1501 <li>It is very unlikely that the Unicode Technical Committee will subdivide the General |
|
1502 |
|
1503 Category partition any further, since that can cause implementations to misbehave. Because |
|
1504 |
|
1505 the General Category is limited to 32 values, 5 bits can be used to represent the |
|
1506 |
|
1507 information, and a 32-bit integer can be used as a bitmask to represent arbitrary sets of |
|
1508 |
|
1509 categories. </li> |
|
1510 |
|
1511 </ul> |
|
1512 |
|
1513 </li> |
|
1514 |
|
1515 </ul> |
|
1516 |
|
1517 |
|
1518 |
|
1519 <h4>Combining Classes</h4> |
|
1520 |
|
1521 |
|
1522 |
|
1523 <ul> |
|
1524 |
|
1525 <li>Combining classes are limited to the values 0 to 255. <ul> |
|
1526 |
|
1527 <li>In practice, there are far fewer than 256 values used. Implementations may take |
|
1528 |
|
1529 advantage of this fact for compression, since only the ordering of the non-zero values |
|
1530 |
|
1531 matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be |
|
1532 |
|
1533 used in the future; however, UTC decisions in the future may restrict the number of values |
|
1534 |
|
1535 to 128, since this has implementation advantages. [Signed bytes can be used without |
|
1536 |
|
1537 widening to ints in Java, for example.] </li> |
|
1538 |
|
1539 </ul> |
|
1540 |
|
1541 </li> |
|
1542 |
|
1543 <li>All characters other than those of General Category M* have the combining class 0. <ul> |
|
1544 |
|
1545 <li>Currently, all characters other than those of General Category Mn have the value 0. |
|
1546 |
|
1547 However, some characters of General Category Me or Mc may be given non-zero values in the |
|
1548 |
|
1549 future. </li> |
|
1550 |
|
1551 <li>The precise values above the value 0 are not invariant--only the relative ordering is |
|
1552 |
|
1553 considered normative. For example, it is not guaranteed in future versions that the class |
|
1554 |
|
1555 of U+05B4 will be precisely 14. </li> |
|
1556 |
|
1557 </ul> |
|
1558 |
|
1559 </li> |
|
1560 |
|
1561 </ul> |
|
1562 |
|
1563 |
|
1564 |
|
1565 <h4>Case</h4> |
|
1566 |
|
1567 |
|
1568 |
|
1569 <ul> |
|
1570 |
|
1571 <li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters with an Upper, |
|
1572 |
|
1573 Lower, or Titlecase mapping are cased characters. <ul> |
|
1574 |
|
1575 <li>However, characters with the General Categories of Lu, Ll, or Lt may not always have |
|
1576 |
|
1577 case mappings, and case mappings may vary by locale. (See |
|
1578 |
|
1579 ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt). </li> |
|
1580 |
|
1581 </ul> |
|
1582 |
|
1583 </li> |
|
1584 |
|
1585 </ul> |
|
1586 |
|
1587 |
|
1588 |
|
1589 <h4>Canonical Decomposition</h4> |
|
1590 |
|
1591 |
|
1592 |
|
1593 <ul> |
|
1594 |
|
1595 <li>Canonical mappings are always in canonical order. </li> |
|
1596 |
|
1597 <li>Canonical mappings have only the first of a pair possibly further decomposing. </li> |
|
1598 |
|
1599 <li>Canonical decompositions are "transparent" to other character data: <ul> |
|
1600 |
|
1601 <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt> </li> |
|
1602 |
|
1603 <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt> </li> |
|
1604 |
|
1605 <li><tt>CombiningClass(a) = CombiningClass(principal(canonicalDecomposition(a))</tt><br> |
|
1606 |
|
1607 where principal(a) is the first character not of type Mn, or the first character if all |
|
1608 |
|
1609 characters are of type Mn. </li> |
|
1610 |
|
1611 </ul> |
|
1612 |
|
1613 </li> |
|
1614 |
|
1615 <li>However, because there are sometimes missing case pairs, and because of some legacy |
|
1616 |
|
1617 characters, it is only generally true that: <ul> |
|
1618 |
|
1619 <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt> </li> |
|
1620 |
|
1621 <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt> </li> |
|
1622 |
|
1623 <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt> </li> |
|
1624 |
|
1625 </ul> |
|
1626 |
|
1627 </li> |
|
1628 |
|
1629 </ul> |
|
1630 |
|
1631 |
|
1632 |
|
1633 <h2><a NAME="Modification History"></a>Modification History</h2> |
|
1634 |
|
1635 |
|
1636 |
|
1637 <p>This section provides a summary of the changes between update versions of the Unicode |
|
1638 |
|
1639 Standard.</p> |
|
1640 |
|
1641 |
|
1642 |
|
1643 <h3><a href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0"> Unicode 3.0.0</a></h3> |
|
1644 |
|
1645 |
|
1646 |
|
1647 <p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and |
|
1648 |
|
1649 a number of property changes. These are summarized in Appendex D of <em>The Unicode |
|
1650 |
|
1651 Standard, Version 3.0.</em></p> |
|
1652 |
|
1653 |
|
1654 |
|
1655 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode 2.1.9</a> </h3> |
|
1656 |
|
1657 |
|
1658 |
|
1659 <p>Modifications made for Version 2.1.9 of UnicodeData.txt include: |
|
1660 |
|
1661 |
|
1662 |
|
1663 <ul> |
|
1664 |
|
1665 <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR. </li> |
|
1666 |
|
1667 <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE </li> |
|
1668 |
|
1669 <li>Corrected combining class for U+0F35 and U+0F37 to 220. </li> |
|
1670 |
|
1671 <li>Corrected combining class for U+0F71 to 129. </li> |
|
1672 |
|
1673 <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR. </li> |
|
1674 |
|
1675 <li>Added decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5, |
|
1676 |
|
1677 U+03D6, U+03F0..U+03F2. </li> |
|
1678 |
|
1679 <li>Removed decompositions from the conjoining jamo block: U+1100..U+11F8. </li> |
|
1680 |
|
1681 <li>Changes to decomposition mappings for some Tibetan vowels for consistency in |
|
1682 |
|
1683 normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81) </li> |
|
1684 |
|
1685 <li>Updated the decomposition mappings for several Vietnamese characters with two diacritics |
|
1686 |
|
1687 (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive |
|
1688 |
|
1689 decomposition can be generated directly in canonically reordered form (not a normative |
|
1690 |
|
1691 change). </li> |
|
1692 |
|
1693 <li>Updated the decomposition mappings for several Arabic compatibility characters involving |
|
1694 |
|
1695 shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so |
|
1696 |
|
1697 that the decompositions are generated directly in canonically reordered form (not a |
|
1698 |
|
1699 normative change). </li> |
|
1700 |
|
1701 <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE |
|
1702 |
|
1703 SEPARATOR. </li> |
|
1704 |
|
1705 <li>Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035, |
|
1706 |
|
1707 U+FF9E, U+FF9F. </li> |
|
1708 |
|
1709 <li>Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375. </li> |
|
1710 |
|
1711 <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL. </li> |
|
1712 |
|
1713 <li>Added Unicode 1.0 names for many Tibetan characters (informative). </li> |
|
1714 |
|
1715 </ul> |
|
1716 |
|
1717 |
|
1718 |
|
1719 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode 2.1.8</a> </h3> |
|
1720 |
|
1721 |
|
1722 |
|
1723 <p>Modifications made for Version 2.1.8 of UnicodeData.txt include: |
|
1724 |
|
1725 |
|
1726 |
|
1727 <ul> |
|
1728 |
|
1729 <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that |
|
1730 |
|
1731 decompositions involving iota subscript are derivable directly in canonically reordered |
|
1732 |
|
1733 form; this also has a bearing on simplification of casing of polytonic Greek. </li> |
|
1734 |
|
1735 <li>Changes in decompositions related to Greek tonos. These result from the clarification |
|
1736 |
|
1737 that monotonic Greek "tonos" should be equated with U+0301 COMBINING ACUTE, |
|
1738 |
|
1739 rather than with U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters in the Greek |
|
1740 |
|
1741 block involving "tonos"; some Greek characters in the polytonic Greek in the |
|
1742 |
|
1743 1FXX block.) </li> |
|
1744 |
|
1745 <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0) </li> |
|
1746 |
|
1747 <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These changes |
|
1748 |
|
1749 simplify normalization. </li> |
|
1750 |
|
1751 <li>Removed canonical decomposition for Latin Candrabindu. (U+0310) </li> |
|
1752 |
|
1753 <li>Corrected error in canonical decomposition for U+1FF4. </li> |
|
1754 |
|
1755 <li>Added compatibility decompositions to clarify collation tables. (U+2100, U+2101, U+2105, |
|
1756 |
|
1757 U+2106, U+1E9A) </li> |
|
1758 |
|
1759 <li>A series of general category changes to assist the convergence of of Unicode definition |
|
1760 |
|
1761 of identifier with ISO TR 10176: <ul> |
|
1762 |
|
1763 <li>So > Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B </li> |
|
1764 |
|
1765 <li>Po > Lo: U+0E2F, U+0EAF, U+3006 </li> |
|
1766 |
|
1767 <li>Lm > Sk: U+309B, U+309C </li> |
|
1768 |
|
1769 <li>Po > Pc: U+30FB, U+FF65 </li> |
|
1770 |
|
1771 <li>Ps/Pe > Mn: U+0F3E, U+0F3F </li> |
|
1772 |
|
1773 </ul> |
|
1774 |
|
1775 </li> |
|
1776 |
|
1777 <li>A series of bidi property changes for consistency. <ul> |
|
1778 |
|
1779 <li>L > ET: U+09F2, U+09F3 </li> |
|
1780 |
|
1781 <li>ON > L: U+3007 </li> |
|
1782 |
|
1783 <li>L > ON: U+0F3A..U+0F3D, U+037E, U+0387 </li> |
|
1784 |
|
1785 </ul> |
|
1786 |
|
1787 </li> |
|
1788 |
|
1789 <li>Add case mapping: U+01A6 <-> U+0280 </li> |
|
1790 |
|
1791 <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, U+203A. </li> |
|
1792 |
|
1793 <li>Changes to combining class values. Most Indic fixed position class non-spacing marks |
|
1794 |
|
1795 were changed to combining class 0. This fixes some inconsistencies in how canonical |
|
1796 |
|
1797 reordering would apply to Indic scripts, including Tibetan. Indic interacting top/bottom |
|
1798 |
|
1799 fixed position classes were merged into single (non-zero) classes as part of this change. |
|
1800 |
|
1801 Tibetan subjoined consonants are changed from combining class 6 to combining class 0. Thai |
|
1802 |
|
1803 pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari stress marks into generic |
|
1804 |
|
1805 above and below combining classes (U+0951, U+0952). </li> |
|
1806 |
|
1807 <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, etc., scattered |
|
1808 |
|
1809 positions to U+FA29) </li> |
|
1810 |
|
1811 </ul> |
|
1812 |
|
1813 |
|
1814 |
|
1815 <h3>Version 2.1.7</h3> |
|
1816 |
|
1817 |
|
1818 |
|
1819 <p><i>This version was for internal change tracking only, and never publicly released.</i></p> |
|
1820 |
|
1821 |
|
1822 |
|
1823 <h3>Version 2.1.6</h3> |
|
1824 |
|
1825 |
|
1826 |
|
1827 <p><i>This version was for internal change tracking only, and never publicly released.</i></p> |
|
1828 |
|
1829 |
|
1830 |
|
1831 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode 2.1.5</a> </h3> |
|
1832 |
|
1833 |
|
1834 |
|
1835 <p>Modifications made for Version 2.1.5 of UnicodeData.txt include: |
|
1836 |
|
1837 |
|
1838 |
|
1839 <ul> |
|
1840 |
|
1841 <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation weighting will |
|
1842 |
|
1843 automatically result from the canonical equivalences. </li> |
|
1844 |
|
1845 <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, U+04E0, U+04E1, |
|
1846 |
|
1847 U+04E8, U+04E9 (the implication being that no canonical equivalence is claimed between |
|
1848 |
|
1849 these 8 characters and similar Latin letters), and updated 4 canonical decompositions for |
|
1850 |
|
1851 U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied difference in the base character. </li> |
|
1852 |
|
1853 <li>Added Pi, and Pf categories and assigned the relevant quotation marks to those |
|
1854 |
|
1855 categories, based on the Unicode Technical Corrigendum on Quotation Characters. </li> |
|
1856 |
|
1857 <li>Updating of many bidi properties, following the advice of the ad hoc committee on bidi, |
|
1858 |
|
1859 and to make the bidi properties of compatibility characters more consistent. </li> |
|
1860 |
|
1861 <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, U+0F88..U+0F8B to make |
|
1862 |
|
1863 them non-combining, reflecting the combined opinion of Tibetan experts. </li> |
|
1864 |
|
1865 <li>Added case mapping for U+03F2. </li> |
|
1866 |
|
1867 <li>Corrected case mapping for U+0275. </li> |
|
1868 |
|
1869 <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. U+03F2. </li> |
|
1870 |
|
1871 <li>Corrected compatibility label for U+2121. </li> |
|
1872 |
|
1873 <li>Add specific entries for all the CJK compatibility ideographs, U+F900..U+FA2D, so the |
|
1874 |
|
1875 canonical decomposition for each (the URO character it is equivalent to) can be carried in |
|
1876 |
|
1877 the database. </li> |
|
1878 |
|
1879 </ul> |
|
1880 |
|
1881 |
|
1882 |
|
1883 <h3>Version 2.1.4</h3> |
|
1884 |
|
1885 |
|
1886 |
|
1887 <p><i>This version was for internal change tracking only, and never publicly released.</i></p> |
|
1888 |
|
1889 |
|
1890 |
|
1891 <h3>Version 2.1.3</h3> |
|
1892 |
|
1893 |
|
1894 |
|
1895 <p><i>This version was for internal change tracking only, and never publicly released.</i></p> |
|
1896 |
|
1897 |
|
1898 |
|
1899 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode 2.1.2</a> </h3> |
|
1900 |
|
1901 |
|
1902 |
|
1903 <p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode |
|
1904 |
|
1905 Standard, Version 2.1 (from Version 2.0) include: |
|
1906 |
|
1907 |
|
1908 |
|
1909 <ul> |
|
1910 |
|
1911 <li>Added two characters (U+20AC and U+FFFC). </li> |
|
1912 |
|
1913 <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007. </li> |
|
1914 |
|
1915 <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, U+03C2, U+1E9B. </li> |
|
1916 |
|
1917 <li>Changed combining order class for U+0F71. </li> |
|
1918 |
|
1919 <li>Corrected canonical decompositions for U+0F73, U+1FBE. </li> |
|
1920 |
|
1921 <li>Changed decomposition for U+FB1F from compatibility to canonical. </li> |
|
1922 |
|
1923 <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. </li> |
|
1924 |
|
1925 <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358. </li> |
|
1926 |
|
1927 </ul> |
|
1928 |
|
1929 |
|
1930 |
|
1931 <h3>Version 2.1.1</h3> |
|
1932 |
|
1933 |
|
1934 |
|
1935 <p><i>This version was for internal change tracking only, and never publicly released.</i></p> |
|
1936 |
|
1937 |
|
1938 |
|
1939 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode 2.0.0</a> </h3> |
|
1940 |
|
1941 |
|
1942 |
|
1943 <p>The modifications made in updating UnicodeData.txt for the Unicode |
|
1944 |
|
1945 Standard, Version 2.0 include: |
|
1946 |
|
1947 |
|
1948 |
|
1949 <ul> |
|
1950 |
|
1951 <li>Fixed decompositions with TONOS to use correct NSM: 030D. </li> |
|
1952 |
|
1953 <li>Removed old Hangul Syllables; mapping to new characters are in a separate table. </li> |
|
1954 |
|
1955 <li>Marked compatibility decompositions with additional tags. </li> |
|
1956 |
|
1957 <li>Changed old tag names for clarity. </li> |
|
1958 |
|
1959 <li>Revision of decompositions to use first-level decomposition, instead of maximal |
|
1960 |
|
1961 decomposition. </li> |
|
1962 |
|
1963 <li>Correction of all known errors in decompositions from earlier versions. </li> |
|
1964 |
|
1965 <li>Added control code names (as old Unicode names). </li> |
|
1966 |
|
1967 <li>Added Hangul Jamo decompositions. </li> |
|
1968 |
|
1969 <li>Added Number category to match properties list in book. </li> |
|
1970 |
|
1971 <li>Fixed categories of Koranic Arabic marks. </li> |
|
1972 |
|
1973 <li>Fixed categories of precomposed characters to match decomposition where possible. </li> |
|
1974 |
|
1975 <li>Added Hebrew cantillation marks and the Tibetan script. </li> |
|
1976 |
|
1977 <li>Added place holders for ranges such as CJK Ideographic Area and the Private Use Area. </li> |
|
1978 |
|
1979 <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the |
|
1980 |
|
1981 database. </li> |
|
1982 |
|
1983 </ul> |
|
1984 |
|
1985 </body> |
|
1986 |
|
1987 </html> |
|
1988 |