toolsandutils/e32tools/readtype/unicodedata.html
changeset 0 83f4b4db085c
equal deleted inserted replaced
-1:000000000000 0:83f4b4db085c
       
     1 <html>
       
     2 
       
     3 
       
     4 
       
     5 <head>
       
     6 
       
     7 <meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 4.0">
       
     8 
       
     9 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
       
    10 
       
    11 <link REL="stylesheet" HREF="http://www.unicode.org/unicode.css" TYPE="text/css">
       
    12 
       
    13 <title>UnicodeData File Format</title>
       
    14 
       
    15 </head>
       
    16 
       
    17 
       
    18 
       
    19 <body>
       
    20 
       
    21 
       
    22 
       
    23 <h1>UnicodeData File Format<br> 
       
    24 Version 3.0.0</h1>
       
    25 
       
    26 
       
    27 
       
    28 <table BORDER="1" CELLSPACING="2" CELLPADDING="0" HEIGHT="87" WIDTH="100%">
       
    29 
       
    30   <tr>
       
    31 
       
    32     <td VALIGN="TOP" width="144">Revision</td>
       
    33 
       
    34     <td VALIGN="TOP">3.0.0</td>
       
    35 
       
    36   </tr>
       
    37 
       
    38   <tr>
       
    39 
       
    40     <td VALIGN="TOP" width="144">Authors</td>
       
    41 
       
    42     <td VALIGN="TOP">Mark Davis and Ken Whistler</td>
       
    43 
       
    44   </tr>
       
    45 
       
    46   <tr>
       
    47 
       
    48     <td VALIGN="TOP" width="144">Date</td>
       
    49 
       
    50     <td VALIGN="TOP">1999-09-12</td>
       
    51 
       
    52   </tr>
       
    53 
       
    54   <tr>
       
    55 
       
    56     <td VALIGN="TOP" width="144">This Version</td>
       
    57 
       
    58     <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
       
    59 
       
    60   </tr>
       
    61 
       
    62   <tr>
       
    63 
       
    64     <td VALIGN="TOP" width="144">Previous Version</td>
       
    65 
       
    66     <td VALIGN="TOP">n/a</td>
       
    67 
       
    68   </tr>
       
    69 
       
    70   <tr>
       
    71 
       
    72     <td VALIGN="TOP" width="144">Latest Version</td>
       
    73 
       
    74     <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
       
    75 
       
    76   </tr>
       
    77 
       
    78 </table>
       
    79 
       
    80 
       
    81 
       
    82 <p align="center">Copyright © 1995-1999 Unicode, Inc. All Rights reserved.<br>    
       
    83     
       
    84 <i>For more information, including Disclamer and Limitations, see <a HREF="UnicodeCharacterDatabase-3.0.0.html">UnicodeCharacterDatabase-3.0.0.html</a> </i></p>   
       
    85    
       
    86    
       
    87    
       
    88 <p>This document describes the format of the UnicodeData.txt file, which is one of the    
       
    89    
       
    90 files in the Unicode Character Database. The document is divided into the following    
       
    91    
       
    92 sections:    
       
    93    
       
    94    
       
    95    
       
    96 <ul>   
       
    97    
       
    98   <li><a HREF="#Field Formats">Field Formats</a> <ul>   
       
    99    
       
   100       <li><a HREF="#General Category">General Category</a> </li>   
       
   101    
       
   102       <li><a HREF="#Bidirectional Category">Bidirectional Category</a> </li>   
       
   103    
       
   104       <li><a HREF="#Character Decomposition">Character Decomposition Mapping</a> </li>  
       
   105   
       
   106       <li><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </li>  
       
   107   
       
   108       <li><a HREF="#Decompositions and Normalization">Decompositions and Normalization</a> </li>  
       
   109   
       
   110       <li><a HREF="#Case Mappings">Case Mappings</a> </li>  
       
   111   
       
   112     </ul>  
       
   113   
       
   114   </li>  
       
   115   
       
   116   <li><a HREF="#Property Invariants">Property Invariants</a> </li>  
       
   117   
       
   118   <li><a HREF="#Modification History">Modification History</a> </li>  
       
   119   
       
   120 </ul>  
       
   121   
       
   122   
       
   123   
       
   124 <p><b>Warning: </b>the information in this file does not completely describe the use and   
       
   125   
       
   126 interpretation of Unicode character properties and behavior. It must be used in   
       
   127   
       
   128 conjunction with the data in the other files in the Unicode Character Database, and relies   
       
   129   
       
   130 on the notation and definitions supplied in <i><a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html"> The Unicode 
       
   131 Standard</a></i>. All chapter references   
       
   132   
       
   133 are to Version 3.0 of the standard.</p>  
       
   134   
       
   135   
       
   136   
       
   137 <h2><a NAME="Field Formats"></a>Field Formats</h2>    
       
   138     
       
   139     
       
   140     
       
   141 <p>The file consists of lines containing fields terminated by semicolons. Each line     
       
   142     
       
   143 represents the data for one encoded character in the Unicode Standard. Every encoded     
       
   144     
       
   145 character has a data entry, with the exception of certain special ranges, as detailed     
       
   146     
       
   147 below.     
       
   148     
       
   149     
       
   150     
       
   151 <ul>    
       
   152     
       
   153   <li>There are six special ranges of characters that are represented only by their start and     
       
   154     
       
   155     end characters, since the properties in the file are uniform, except for code values     
       
   156     
       
   157     (which are all sequential and assigned). </li>    
       
   158     
       
   159   <li>The names of CJK ideograph characters and the names and decompositions of Hangul     
       
   160     
       
   161     syllable characters are algorithmically derivable. (See the Unicode Standard and <a    
       
   162     
       
   163     HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical Report #15</a> for     
       
   164     
       
   165     more information). </li>    
       
   166     
       
   167   <li>Surrogate code values and private use characters have no names. </li>    
       
   168     
       
   169   <li>The Private Use character outside of the BMP (U+F0000..U+FFFFD, U+100000..U+10FFFD) are     
       
   170     
       
   171     not listed. These correspond to surrogate pairs where the first surrogate is in the High     
       
   172     
       
   173     Surrogate Private Use section. </li>    
       
   174     
       
   175 </ul>    
       
   176     
       
   177     
       
   178     
       
   179 <p>The exact ranges represented by start and end characters are:     
       
   180     
       
   181     
       
   182     
       
   183 <ul>    
       
   184     
       
   185   <li>CJK Ideographs Extension A (U+3400 - U+4DB5) </li>    
       
   186     
       
   187   <li>CJK Ideographs (U+4E00 - U+9FA5) </li>    
       
   188     
       
   189   <li>Hangul Syllables (U+AC00 - U+D7A3) </li>    
       
   190     
       
   191   <li>Non-Private Use High Surrogates (U+D800 - U+DB7F) </li>    
       
   192     
       
   193   <li>Private Use High Surrogates (U+DB80 - U+DBFF) </li>    
       
   194     
       
   195   <li>Low Surrogates (U+DC00 - U+DFFF) </li>    
       
   196     
       
   197   <li>The Private Use Area (U+E000 - U+F8FF) </li>    
       
   198     
       
   199 </ul>    
       
   200     
       
   201     
       
   202     
       
   203 <p>The following table describes the format and meaning of each field in a data entry in     
       
   204     
       
   205 the UnicodeData file. Fields which contain normative information are so indicated.</p>    
       
   206     
       
   207     
       
   208     
       
   209 <table BORDER="1" CELLSPACING="2" CELLPADDING="2">    
       
   210     
       
   211   <tr>    
       
   212     
       
   213     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Field</th>    
       
   214     
       
   215     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Name</th>    
       
   216     
       
   217     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Status</th>    
       
   218     
       
   219     <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Explanation</th>    
       
   220     
       
   221   </tr>    
       
   222     
       
   223   <tr>    
       
   224     
       
   225     <th VALIGN="top">0</th>    
       
   226     
       
   227     <td VALIGN="top">Code value</td>    
       
   228     
       
   229     <td VALIGN="top">normative</td>    
       
   230     
       
   231     <td VALIGN="top">Code value in 4-digit hexadecimal format.</td>    
       
   232     
       
   233   </tr>    
       
   234     
       
   235   <tr>    
       
   236     
       
   237     <th VALIGN="top">1</th>    
       
   238     
       
   239     <td VALIGN="top">Character name</td>    
       
   240     
       
   241     <td VALIGN="top">normative</td>    
       
   242     
       
   243     <td VALIGN="top">These names match exactly the names published in Chapter 14 of the     
       
   244     
       
   245     Unicode Standard, Version 3.0.</td>    
       
   246     
       
   247   </tr>    
       
   248     
       
   249   <tr>    
       
   250     
       
   251     <th VALIGN="top">2</th>    
       
   252     
       
   253     <td VALIGN="top"><a HREF="#General Category">General Category</a> </td>    
       
   254     
       
   255     <td VALIGN="top">normative / informative<br>    
       
   256     
       
   257     (see below)</td>    
       
   258     
       
   259     <td VALIGN="top">This is a useful breakdown into various &quot;character types&quot; which     
       
   260     
       
   261     can be used as a default categorization in implementations. See below for a brief     
       
   262     
       
   263     explanation.</td>    
       
   264     
       
   265   </tr>    
       
   266     
       
   267   <tr>    
       
   268     
       
   269     <th VALIGN="top">3</th>    
       
   270     
       
   271     <td VALIGN="top"><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </td>    
       
   272     
       
   273     <td VALIGN="top">normative</td>    
       
   274     
       
   275     <td VALIGN="top">The classes used for the Canonical Ordering Algorithm in the Unicode     
       
   276     
       
   277     Standard. These classes are also printed in Chapter 4 of the Unicode Standard.</td>    
       
   278     
       
   279   </tr>    
       
   280     
       
   281   <tr>    
       
   282     
       
   283     <th VALIGN="top">4</th>    
       
   284     
       
   285     <td VALIGN="top"><a HREF="#Bidirectional Category">Bidirectional Category</a> </td>    
       
   286     
       
   287     <td VALIGN="top">normative</td>    
       
   288     
       
   289     <td VALIGN="top">See the list below for an explanation of the abbreviations used in this     
       
   290     
       
   291     field. These are the categories required by the Bidirectional Behavior Algorithm in the     
       
   292     
       
   293     Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard.</td>    
       
   294     
       
   295   </tr>    
       
   296     
       
   297   <tr>    
       
   298     
       
   299     <th VALIGN="top">5</th>    
       
   300     
       
   301     <td VALIGN="top"><a HREF="#Character Decomposition">Character Decomposition  
       
   302       Mapping</a></td>   
       
   303    
       
   304     <td VALIGN="top">normative</td>   
       
   305    
       
   306     <td VALIGN="top">In the Unicode Standard, not all of the mappings are full (maximal)    
       
   307    
       
   308     decompositions. Recursive application of look-up for decompositions will, in all cases,    
       
   309    
       
   310     lead to a maximal decomposition. The decomposition mappings match exactly the    
       
   311    
       
   312     decomposition mappings published with the character names in the Unicode Standard.</td>   
       
   313    
       
   314   </tr>   
       
   315    
       
   316   <tr>   
       
   317    
       
   318     <th VALIGN="top">6</th>   
       
   319    
       
   320     <td VALIGN="top">Decimal digit value</td>   
       
   321    
       
   322     <td VALIGN="top">normative</td>   
       
   323    
       
   324     <td VALIGN="top">This is a numeric field. If the character has the decimal digit property,    
       
   325    
       
   326     as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented    
       
   327    
       
   328     with an integer value in this field</td>   
       
   329    
       
   330   </tr>   
       
   331    
       
   332   <tr>   
       
   333    
       
   334     <th VALIGN="top">7</th>   
       
   335    
       
   336     <td VALIGN="top">Digit value</td>   
       
   337    
       
   338     <td VALIGN="top">normative</td>   
       
   339    
       
   340     <td VALIGN="top">This is a numeric field. If the character represents a digit, not    
       
   341    
       
   342     necessarily a decimal digit, the value is here. This covers digits which do not form    
       
   343    
       
   344     decimal radix forms, such as the compatibility superscript digits</td>   
       
   345    
       
   346   </tr>   
       
   347    
       
   348   <tr>   
       
   349    
       
   350     <th VALIGN="top">8</th>   
       
   351    
       
   352     <td VALIGN="top">Numeric value</td>   
       
   353    
       
   354     <td VALIGN="top">normative</td>   
       
   355    
       
   356     <td VALIGN="top">This is a numeric field. If the character has the numeric property, as    
       
   357    
       
   358     specified in Chapter 4 of the Unicode Standard, the value of that character is represented    
       
   359    
       
   360     with an integer or rational number in this field. This includes fractions as, e.g.,    
       
   361    
       
   362     &quot;1/5&quot; for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values    
       
   363    
       
   364     for compatibility characters such as circled numbers.</td>   
       
   365    
       
   366   </tr>   
       
   367    
       
   368   <tr>   
       
   369    
       
   370     <th VALIGN="top">8</th>   
       
   371    
       
   372     <td VALIGN="top">Mirrored</td>   
       
   373    
       
   374     <td VALIGN="top">normative</td>   
       
   375    
       
   376     <td VALIGN="top">If the character has been identified as a &quot;mirrored&quot; character    
       
   377    
       
   378     in bidirectional text, this field has the value &quot;Y&quot;; otherwise &quot;N&quot;.    
       
   379    
       
   380     The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard.</td>   
       
   381    
       
   382   </tr>   
       
   383    
       
   384   <tr>   
       
   385    
       
   386     <th VALIGN="top">10</th>   
       
   387    
       
   388     <td VALIGN="top">Unicode 1.0 Name</td>   
       
   389    
       
   390     <td VALIGN="top">informative</td>   
       
   391    
       
   392     <td VALIGN="top">This is the old name as published in Unicode 1.0. This name is only    
       
   393    
       
   394     provided when it is significantly different from the Unicode 3.0 name for the character.</td>   
       
   395    
       
   396   </tr>   
       
   397    
       
   398   <tr>   
       
   399    
       
   400     <th VALIGN="top">11</th>   
       
   401    
       
   402     <td VALIGN="top">10646 comment field</td>   
       
   403    
       
   404     <td VALIGN="top">informative</td>   
       
   405    
       
   406     <td VALIGN="top">This is the ISO 10646 comment field. It is in parantheses in the 10646    
       
   407    
       
   408     names list.</td>   
       
   409    
       
   410   </tr>   
       
   411    
       
   412   <tr>   
       
   413    
       
   414     <th VALIGN="top">12</th>   
       
   415    
       
   416     <td VALIGN="top"><a HREF="#Case Mappings">Uppercase Mapping</a></td>   
       
   417    
       
   418     <td VALIGN="top">informative</td>   
       
   419    
       
   420     <td VALIGN="top">Upper case equivalent mapping. If a character is part of an alphabet with    
       
   421    
       
   422     case distinctions, and has an upper case equivalent, then the upper case equivalent is in    
       
   423    
       
   424     this field. See the explanation below on case distinctions. These mappings are always    
       
   425    
       
   426     one-to-one, not one-to-many or many-to-one. This field is informative.</td>   
       
   427    
       
   428   </tr>   
       
   429    
       
   430   <tr>   
       
   431    
       
   432     <th VALIGN="top">13</th>   
       
   433    
       
   434     <td VALIGN="top"><a HREF="#Case Mappings">Lowercase Mapping</a></td>   
       
   435    
       
   436     <td VALIGN="top">informative</td>   
       
   437    
       
   438     <td VALIGN="top">Similar to Uppercase mapping</td>    
       
   439     
       
   440   </tr>    
       
   441     
       
   442   <tr>    
       
   443     
       
   444     <th VALIGN="top">14</th>    
       
   445     
       
   446     <td VALIGN="top"><a HREF="#Case Mappings">Titlecase Mapping</a></td>   
       
   447    
       
   448     <td VALIGN="top">informative</td>   
       
   449    
       
   450     <td VALIGN="top">Similar to Uppercase mapping</td>    
       
   451     
       
   452   </tr>    
       
   453     
       
   454 </table>    
       
   455     
       
   456     
       
   457     
       
   458 <h3><a NAME="General Category"></a>General Category</h3>    
       
   459     
       
   460     
       
   461     
       
   462 <p>The values in this field are abbreviations for the following. Some of the values are     
       
   463     
       
   464 normative, and some are informative. For more information, see the Unicode Standard.</p>    
       
   465     
       
   466     
       
   467     
       
   468 <p><b>Note:</b> the standard does not assign information to control characters (except for     
       
   469     
       
   470 certain cases in the Bidirectional Algorithm). Implementations will generally also assign     
       
   471     
       
   472 categories to certain control characters, notably CR and LF, according to platform     
       
   473     
       
   474 conventions.</p>    
       
   475     
       
   476     
       
   477     
       
   478 <h4>Normative Categories</h4>    
       
   479     
       
   480     
       
   481     
       
   482 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">    
       
   483     
       
   484   <tr>    
       
   485     
       
   486     <th><p ALIGN="LEFT">Abbr.</th>    
       
   487     
       
   488     <th><p ALIGN="LEFT">Description</th>    
       
   489     
       
   490   </tr>    
       
   491     
       
   492   <tr>    
       
   493     
       
   494     <td ALIGN="CENTER">Lu</td>    
       
   495     
       
   496     <td>Letter, Uppercase</td>    
       
   497     
       
   498   </tr>    
       
   499     
       
   500   <tr>    
       
   501     
       
   502     <td ALIGN="CENTER">Ll</td>    
       
   503     
       
   504     <td>Letter, Lowercase</td>    
       
   505     
       
   506   </tr>    
       
   507     
       
   508   <tr>    
       
   509     
       
   510     <td ALIGN="CENTER">Lt</td>    
       
   511     
       
   512     <td>Letter, Titlecase</td>    
       
   513     
       
   514   </tr>    
       
   515     
       
   516   <tr>    
       
   517     
       
   518     <td ALIGN="CENTER">Mn</td>    
       
   519     
       
   520     <td>Mark, Non-Spacing</td>    
       
   521     
       
   522   </tr>    
       
   523     
       
   524   <tr>    
       
   525     
       
   526     <td ALIGN="CENTER">Mc</td>    
       
   527     
       
   528     <td>Mark, Spacing Combining</td>    
       
   529     
       
   530   </tr>    
       
   531     
       
   532   <tr>    
       
   533     
       
   534     <td ALIGN="CENTER">Me</td>    
       
   535     
       
   536     <td>Mark, Enclosing</td>    
       
   537     
       
   538   </tr>    
       
   539     
       
   540   <tr>    
       
   541     
       
   542     <td ALIGN="CENTER">Nd</td>    
       
   543     
       
   544     <td>Number, Decimal Digit</td>    
       
   545     
       
   546   </tr>    
       
   547     
       
   548   <tr>    
       
   549     
       
   550     <td ALIGN="CENTER">Nl</td>    
       
   551     
       
   552     <td>Number, Letter</td>    
       
   553     
       
   554   </tr>    
       
   555     
       
   556   <tr>    
       
   557     
       
   558     <td ALIGN="CENTER">No</td>    
       
   559     
       
   560     <td>Number, Other</td>    
       
   561     
       
   562   </tr>    
       
   563     
       
   564   <tr>    
       
   565     
       
   566     <td ALIGN="CENTER">Zs</td>    
       
   567     
       
   568     <td>Separator, Space</td>    
       
   569     
       
   570   </tr>    
       
   571     
       
   572   <tr>    
       
   573     
       
   574     <td ALIGN="CENTER">Zl</td>    
       
   575     
       
   576     <td>Separator, Line</td>    
       
   577     
       
   578   </tr>    
       
   579     
       
   580   <tr>    
       
   581     
       
   582     <td ALIGN="CENTER">Zp</td>    
       
   583     
       
   584     <td>Separator, Paragraph</td>    
       
   585     
       
   586   </tr>    
       
   587     
       
   588   <tr>    
       
   589     
       
   590     <td ALIGN="CENTER">Cc</td>    
       
   591     
       
   592     <td>Other, Control</td>    
       
   593     
       
   594   </tr>    
       
   595     
       
   596   <tr>    
       
   597     
       
   598     <td ALIGN="CENTER">Cf</td>    
       
   599     
       
   600     <td>Other, Format</td>    
       
   601     
       
   602   </tr>    
       
   603     
       
   604   <tr>    
       
   605     
       
   606     <td ALIGN="CENTER">Cs</td>    
       
   607     
       
   608     <td>Other, Surrogate</td>    
       
   609     
       
   610   </tr>    
       
   611     
       
   612   <tr>    
       
   613     
       
   614     <td ALIGN="CENTER">Co</td>    
       
   615     
       
   616     <td>Other, Private Use</td>    
       
   617     
       
   618   </tr>    
       
   619     
       
   620   <tr>    
       
   621     
       
   622     <td ALIGN="CENTER">Cn</td>    
       
   623     
       
   624     <td>Other, Not Assigned (no characters in the file have this property)</td>    
       
   625     
       
   626   </tr>    
       
   627     
       
   628 </table>    
       
   629     
       
   630     
       
   631     
       
   632 <h4>Informative Categories</h4>    
       
   633     
       
   634     
       
   635     
       
   636 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">    
       
   637     
       
   638   <tr>    
       
   639     
       
   640     <th><p ALIGN="LEFT">Abbr.</th>    
       
   641     
       
   642     <th><p ALIGN="LEFT">Description</th>    
       
   643     
       
   644   </tr>    
       
   645     
       
   646   <tr>    
       
   647     
       
   648     <td ALIGN="CENTER">Lm</td>    
       
   649     
       
   650     <td>Letter, Modifier</td>    
       
   651     
       
   652   </tr>    
       
   653     
       
   654   <tr>    
       
   655     
       
   656     <td ALIGN="CENTER">Lo</td>    
       
   657     
       
   658     <td>Letter, Other</td>    
       
   659     
       
   660   </tr>    
       
   661     
       
   662   <tr>    
       
   663     
       
   664     <td ALIGN="CENTER">Pc</td>    
       
   665     
       
   666     <td>Punctuation, Connector</td>    
       
   667     
       
   668   </tr>    
       
   669     
       
   670   <tr>    
       
   671     
       
   672     <td ALIGN="CENTER">Pd</td>    
       
   673     
       
   674     <td>Punctuation, Dash</td>    
       
   675     
       
   676   </tr>    
       
   677     
       
   678   <tr>    
       
   679     
       
   680     <td ALIGN="CENTER">Ps</td>    
       
   681     
       
   682     <td>Punctuation, Open</td>    
       
   683     
       
   684   </tr>    
       
   685     
       
   686   <tr>    
       
   687     
       
   688     <td ALIGN="CENTER">Pe</td>    
       
   689     
       
   690     <td>Punctuation, Close</td>    
       
   691     
       
   692   </tr>    
       
   693     
       
   694   <tr>    
       
   695     
       
   696     <td ALIGN="CENTER">Pi</td>    
       
   697     
       
   698     <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td>    
       
   699     
       
   700   </tr>    
       
   701     
       
   702   <tr>    
       
   703     
       
   704     <td ALIGN="CENTER">Pf</td>    
       
   705     
       
   706     <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td>    
       
   707     
       
   708   </tr>    
       
   709     
       
   710   <tr>    
       
   711     
       
   712     <td ALIGN="CENTER">Po</td>    
       
   713     
       
   714     <td>Punctuation, Other</td>    
       
   715     
       
   716   </tr>    
       
   717     
       
   718   <tr>    
       
   719     
       
   720     <td ALIGN="CENTER">Sm</td>    
       
   721     
       
   722     <td>Symbol, Math</td>    
       
   723     
       
   724   </tr>    
       
   725     
       
   726   <tr>    
       
   727     
       
   728     <td ALIGN="CENTER">Sc</td>    
       
   729     
       
   730     <td>Symbol, Currency</td>    
       
   731     
       
   732   </tr>    
       
   733     
       
   734   <tr>    
       
   735     
       
   736     <td ALIGN="CENTER">Sk</td>    
       
   737     
       
   738     <td>Symbol, Modifier</td>    
       
   739     
       
   740   </tr>    
       
   741     
       
   742   <tr>    
       
   743     
       
   744     <td ALIGN="CENTER">So</td>    
       
   745     
       
   746     <td>Symbol, Other</td>    
       
   747     
       
   748   </tr>    
       
   749     
       
   750 </table>    
       
   751     
       
   752     
       
   753     
       
   754 <h3><a NAME="Bidirectional Category"></a>Bidirectional Category</h3>    
       
   755     
       
   756     
       
   757     
       
   758 <p>Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional     
       
   759     
       
   760 Behavior and an explanation of the significance of these categories. An up-to-date version     
       
   761     
       
   762 can be found on <a HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode Technical     
       
   763     
       
   764 Report #9: The Bidirectional Algorithm</a>. These values are normative.</p>    
       
   765     
       
   766     
       
   767     
       
   768 <table BORDER="0" CELLPADDING="2">    
       
   769     
       
   770   <tr>    
       
   771     
       
   772     <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Type</th>    
       
   773     
       
   774     <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Description</th>    
       
   775     
       
   776   </tr>    
       
   777     
       
   778   <tr>    
       
   779     
       
   780     <td VALIGN="TOP"><b>L</b></td>    
       
   781     
       
   782     <td VALIGN="TOP">Left-to-Right</td>    
       
   783     
       
   784   </tr>    
       
   785     
       
   786   <tr>    
       
   787     
       
   788     <td VALIGN="TOP"><b>LRE</b></td>    
       
   789     
       
   790     <td VALIGN="TOP">Left-to-Right Embedding</td>    
       
   791     
       
   792   </tr>    
       
   793     
       
   794   <tr>    
       
   795     
       
   796     <td VALIGN="TOP"><b>LRO</b></td>    
       
   797     
       
   798     <td VALIGN="TOP">Left-to-Right Override</td>    
       
   799     
       
   800   </tr>    
       
   801     
       
   802   <tr>    
       
   803     
       
   804     <td VALIGN="TOP"><b>R</b></td>    
       
   805     
       
   806     <td VALIGN="TOP">Right-to-Left</td>    
       
   807     
       
   808   </tr>    
       
   809     
       
   810   <tr>    
       
   811     
       
   812     <td VALIGN="TOP"><b>AL</b></td>    
       
   813     
       
   814     <td VALIGN="TOP">Right-to-Left Arabic</td>    
       
   815     
       
   816   </tr>    
       
   817     
       
   818   <tr>    
       
   819     
       
   820     <td VALIGN="TOP"><b>RLE</b></td>    
       
   821     
       
   822     <td VALIGN="TOP">Right-to-Left Embedding</td>    
       
   823     
       
   824   </tr>    
       
   825     
       
   826   <tr>    
       
   827     
       
   828     <td VALIGN="TOP"><b>RLO</b></td>    
       
   829     
       
   830     <td VALIGN="TOP">Right-to-Left Override</td>    
       
   831     
       
   832   </tr>    
       
   833     
       
   834   <tr>    
       
   835     
       
   836     <td VALIGN="TOP"><b>PDF</b></td>    
       
   837     
       
   838     <td VALIGN="TOP">Pop Directional Format</td>    
       
   839     
       
   840   </tr>    
       
   841     
       
   842   <tr>    
       
   843     
       
   844     <td VALIGN="TOP"><b>EN</b></td>    
       
   845     
       
   846     <td VALIGN="TOP">European Number</td>    
       
   847     
       
   848   </tr>    
       
   849     
       
   850   <tr>    
       
   851     
       
   852     <td VALIGN="TOP"><b>ES</b></td>    
       
   853     
       
   854     <td VALIGN="TOP">European Number Separator</td>    
       
   855     
       
   856   </tr>    
       
   857     
       
   858   <tr>    
       
   859     
       
   860     <td VALIGN="TOP"><b>ET</b></td>    
       
   861     
       
   862     <td VALIGN="TOP">European Number Terminator</td>    
       
   863     
       
   864   </tr>    
       
   865     
       
   866   <tr>    
       
   867     
       
   868     <td VALIGN="TOP"><b>AN</b></td>    
       
   869     
       
   870     <td VALIGN="TOP">Arabic Number</td>    
       
   871     
       
   872   </tr>    
       
   873     
       
   874   <tr>    
       
   875     
       
   876     <td VALIGN="TOP"><b>CS</b></td>    
       
   877     
       
   878     <td VALIGN="TOP">Common Number Separator</td>    
       
   879     
       
   880   </tr>    
       
   881     
       
   882   <tr>    
       
   883     
       
   884     <td VALIGN="TOP"><b>NSM</b></td>    
       
   885     
       
   886     <td VALIGN="TOP">Non-Spacing Mark</td>    
       
   887     
       
   888   </tr>    
       
   889     
       
   890   <tr>    
       
   891     
       
   892     <td VALIGN="TOP"><b>BN</b></td>    
       
   893     
       
   894     <td VALIGN="TOP">Boundary Neutral</td>    
       
   895     
       
   896   </tr>    
       
   897     
       
   898   <tr>    
       
   899     
       
   900     <td VALIGN="TOP"><b>B</b></td>    
       
   901     
       
   902     <td VALIGN="TOP">Paragraph Separator</td>    
       
   903     
       
   904   </tr>    
       
   905     
       
   906   <tr>    
       
   907     
       
   908     <td VALIGN="TOP"><b>S</b></td>    
       
   909     
       
   910     <td VALIGN="TOP">Segment Separator</td>    
       
   911     
       
   912   </tr>    
       
   913     
       
   914   <tr>    
       
   915     
       
   916     <td VALIGN="TOP"><b>WS</b></td>    
       
   917     
       
   918     <td VALIGN="TOP">Whitespace</td>    
       
   919     
       
   920   </tr>    
       
   921     
       
   922   <tr>    
       
   923     
       
   924     <td VALIGN="TOP"><b>ON</b></td>    
       
   925     
       
   926     <td VALIGN="TOP">Other Neutrals</td>    
       
   927     
       
   928   </tr>    
       
   929     
       
   930 </table>    
       
   931     
       
   932     
       
   933     
       
   934 <h3><a NAME="Character Decomposition"></a>Character Decomposition Mapping</h3>   
       
   935    
       
   936    
       
   937    
       
   938 <p>The decomposition is a normative property of a character. The tags supplied with    
       
   939    
       
   940 certain decomposition mappings generally indicate formatting information. Where no such    
       
   941    
       
   942 tag is given, the mapping is designated as canonical. Conversely, the presence of a    
       
   943    
       
   944 formatting tag also indicates that the mapping is a compatibility mapping and not a    
       
   945    
       
   946 canonical mapping. In the absence of other formatting information in a compatibility    
       
   947    
       
   948 mapping, the tag is used to distinguish it from canonical mappings.</p>   
       
   949    
       
   950    
       
   951    
       
   952 <p>In some instances a canonical mapping or a compatibility mapping may consist of a    
       
   953    
       
   954 single character. For a canonical mapping, this indicates that the character is a    
       
   955    
       
   956 canonical equivalent of another single character. For a compatibility mapping, this    
       
   957    
       
   958 indicates that the character is a compatibility equivalent of another single character.    
       
   959    
       
   960 The compatibility formatting tags used are:</p>   
       
   961    
       
   962    
       
   963    
       
   964 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">   
       
   965    
       
   966   <tr>   
       
   967    
       
   968     <th>Tag</th>   
       
   969    
       
   970     <th><p ALIGN="LEFT">Description</th>   
       
   971    
       
   972   </tr>   
       
   973    
       
   974   <tr>   
       
   975    
       
   976     <td ALIGN="CENTER">&lt;font&gt;&nbsp;&nbsp;</td>   
       
   977    
       
   978     <td>A font variant (e.g. a blackletter form).</td>    
       
   979     
       
   980   </tr>    
       
   981     
       
   982   <tr>    
       
   983     
       
   984     <td ALIGN="CENTER">&lt;noBreak&gt;&nbsp;&nbsp;</td>    
       
   985     
       
   986     <td>A no-break version of a space or hyphen.</td>    
       
   987     
       
   988   </tr>    
       
   989     
       
   990   <tr>    
       
   991     
       
   992     <td ALIGN="CENTER">&lt;initial&gt;&nbsp;&nbsp;</td>    
       
   993     
       
   994     <td>An initial presentation form (Arabic).</td>    
       
   995     
       
   996   </tr>    
       
   997     
       
   998   <tr>    
       
   999     
       
  1000     <td ALIGN="CENTER">&lt;medial&gt;&nbsp;&nbsp;</td>    
       
  1001     
       
  1002     <td>A medial presentation form (Arabic).</td>    
       
  1003     
       
  1004   </tr>    
       
  1005     
       
  1006   <tr>    
       
  1007     
       
  1008     <td ALIGN="CENTER">&lt;final&gt;&nbsp;&nbsp;</td>    
       
  1009     
       
  1010     <td>A final presentation form (Arabic).</td>    
       
  1011     
       
  1012   </tr>    
       
  1013     
       
  1014   <tr>    
       
  1015     
       
  1016     <td ALIGN="CENTER">&lt;isolated&gt;&nbsp;&nbsp;</td>    
       
  1017     
       
  1018     <td>An isolated presentation form (Arabic).</td>    
       
  1019     
       
  1020   </tr>    
       
  1021     
       
  1022   <tr>    
       
  1023     
       
  1024     <td ALIGN="CENTER">&lt;circle&gt;&nbsp;&nbsp;</td>    
       
  1025     
       
  1026     <td>An encircled form.</td>    
       
  1027     
       
  1028   </tr>    
       
  1029     
       
  1030   <tr>    
       
  1031     
       
  1032     <td ALIGN="CENTER">&lt;super&gt;&nbsp;&nbsp;</td>    
       
  1033     
       
  1034     <td>A superscript form.</td>    
       
  1035     
       
  1036   </tr>    
       
  1037     
       
  1038   <tr>    
       
  1039     
       
  1040     <td ALIGN="CENTER">&lt;sub&gt;&nbsp;&nbsp;</td>    
       
  1041     
       
  1042     <td>A subscript form.</td>    
       
  1043     
       
  1044   </tr>    
       
  1045     
       
  1046   <tr>    
       
  1047     
       
  1048     <td ALIGN="CENTER">&lt;vertical&gt;&nbsp;&nbsp;</td>    
       
  1049     
       
  1050     <td>A vertical layout presentation form.</td>    
       
  1051     
       
  1052   </tr>    
       
  1053     
       
  1054   <tr>    
       
  1055     
       
  1056     <td ALIGN="CENTER">&lt;wide&gt;&nbsp;&nbsp;</td>    
       
  1057     
       
  1058     <td>A wide (or zenkaku) compatibility character.</td>    
       
  1059     
       
  1060   </tr>    
       
  1061     
       
  1062   <tr>    
       
  1063     
       
  1064     <td ALIGN="CENTER">&lt;narrow&gt;&nbsp;&nbsp;</td>    
       
  1065     
       
  1066     <td>A narrow (or hankaku) compatibility character.</td>    
       
  1067     
       
  1068   </tr>    
       
  1069     
       
  1070   <tr>    
       
  1071     
       
  1072     <td ALIGN="CENTER">&lt;small&gt;&nbsp;&nbsp;</td>    
       
  1073     
       
  1074     <td>A small variant form (CNS compatibility).</td>    
       
  1075     
       
  1076   </tr>    
       
  1077     
       
  1078   <tr>    
       
  1079     
       
  1080     <td ALIGN="CENTER">&lt;square&gt;&nbsp;&nbsp;</td>    
       
  1081     
       
  1082     <td>A CJK squared font variant.</td>    
       
  1083     
       
  1084   </tr>    
       
  1085     
       
  1086   <tr>    
       
  1087     
       
  1088     <td ALIGN="CENTER">&lt;fraction&gt;&nbsp;&nbsp;</td>    
       
  1089     
       
  1090     <td>A vulgar fraction form.</td>    
       
  1091     
       
  1092   </tr>    
       
  1093     
       
  1094   <tr>    
       
  1095     
       
  1096     <td ALIGN="CENTER">&lt;compat&gt;&nbsp;&nbsp;</td>    
       
  1097     
       
  1098     <td>Otherwise unspecified compatibility character.</td>    
       
  1099     
       
  1100   </tr>    
       
  1101     
       
  1102 </table>    
       
  1103     
       
  1104     
       
  1105     
       
  1106 <p><b>Reminder: </b>There is a difference between decomposition and decomposition mapping.     
       
  1107     
       
  1108 The decomposition mappings are defined in the UnicodeData, while the decomposition (also     
       
  1109     
       
  1110 termed &quot;full decomposition&quot;) is defined in Chapter 3 to use those mappings  
       
  1111 <i>    
       
  1112    
       
  1113 recursively.</i>    
       
  1114    
       
  1115    
       
  1116    
       
  1117 <ul>   
       
  1118    
       
  1119   <li>The canonical decomposition is formed by recursively applying the canonical mappings,    
       
  1120    
       
  1121     then applying the canonical reordering algorithm. </li>   
       
  1122    
       
  1123   <li>The compatibility decomposition is formed by recursively applying the canonical <em>and</em>    
       
  1124    
       
  1125     compatibility mappings, then applying the canonical reordering algorithm. </li>   
       
  1126    
       
  1127 </ul>   
       
  1128    
       
  1129    
       
  1130    
       
  1131 <h3><a NAME="Canonical Combining Classes"></a>Canonical Combining Classes</h3>    
       
  1132     
       
  1133     
       
  1134     
       
  1135 <table BORDER="0" CELLSPACING="2" CELLPADDING="0">    
       
  1136     
       
  1137   <tr>    
       
  1138     
       
  1139     <th><p ALIGN="LEFT">Value</th>    
       
  1140     
       
  1141     <th><p ALIGN="LEFT">Description</th>    
       
  1142     
       
  1143   </tr>    
       
  1144     
       
  1145   <tr>    
       
  1146     
       
  1147     <td ALIGN="RIGHT">0:</td>    
       
  1148     
       
  1149     <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td>    
       
  1150     
       
  1151   </tr>    
       
  1152     
       
  1153   <tr>    
       
  1154     
       
  1155     <td ALIGN="RIGHT">1:</td>    
       
  1156     
       
  1157     <td>Overlays and interior</td>    
       
  1158     
       
  1159   </tr>    
       
  1160     
       
  1161   <tr>    
       
  1162     
       
  1163     <td ALIGN="RIGHT">7:</td>    
       
  1164     
       
  1165     <td>Nuktas</td>    
       
  1166     
       
  1167   </tr>    
       
  1168     
       
  1169   <tr>    
       
  1170     
       
  1171     <td ALIGN="RIGHT">8:</td>    
       
  1172     
       
  1173     <td>Hiragana/Katakana voicing marks</td>    
       
  1174     
       
  1175   </tr>    
       
  1176     
       
  1177   <tr>    
       
  1178     
       
  1179     <td ALIGN="RIGHT">9:</td>    
       
  1180     
       
  1181     <td>Viramas</td>    
       
  1182     
       
  1183   </tr>    
       
  1184     
       
  1185   <tr>    
       
  1186     
       
  1187     <td ALIGN="RIGHT">10:</td>    
       
  1188     
       
  1189     <td>Start of fixed position classes</td>    
       
  1190     
       
  1191   </tr>    
       
  1192     
       
  1193   <tr>    
       
  1194     
       
  1195     <td ALIGN="RIGHT">199:</td>    
       
  1196     
       
  1197     <td>End of fixed position classes</td>    
       
  1198     
       
  1199   </tr>    
       
  1200     
       
  1201   <tr>    
       
  1202     
       
  1203     <td ALIGN="RIGHT">200:</td>    
       
  1204     
       
  1205     <td>Below left attached</td>    
       
  1206     
       
  1207   </tr>    
       
  1208     
       
  1209   <tr>    
       
  1210     
       
  1211     <td ALIGN="RIGHT">202:</td>    
       
  1212     
       
  1213     <td>Below attached</td>    
       
  1214     
       
  1215   </tr>    
       
  1216     
       
  1217   <tr>    
       
  1218     
       
  1219     <td ALIGN="RIGHT">204:</td>    
       
  1220     
       
  1221     <td>Below right attached</td>    
       
  1222     
       
  1223   </tr>    
       
  1224     
       
  1225   <tr>    
       
  1226     
       
  1227     <td ALIGN="RIGHT">208:</td>    
       
  1228     
       
  1229     <td>Left attached (reordrant around single base character)</td>    
       
  1230     
       
  1231   </tr>    
       
  1232     
       
  1233   <tr>    
       
  1234     
       
  1235     <td ALIGN="RIGHT">210:</td>    
       
  1236     
       
  1237     <td>Right attached</td>    
       
  1238     
       
  1239   </tr>    
       
  1240     
       
  1241   <tr>    
       
  1242     
       
  1243     <td ALIGN="RIGHT">212:</td>    
       
  1244     
       
  1245     <td>Above left attached</td>    
       
  1246     
       
  1247   </tr>    
       
  1248     
       
  1249   <tr>    
       
  1250     
       
  1251     <td ALIGN="RIGHT">214:</td>    
       
  1252     
       
  1253     <td>Above attached</td>    
       
  1254     
       
  1255   </tr>    
       
  1256     
       
  1257   <tr>    
       
  1258     
       
  1259     <td ALIGN="RIGHT">216:</td>    
       
  1260     
       
  1261     <td>Above right attached</td>    
       
  1262     
       
  1263   </tr>    
       
  1264     
       
  1265   <tr>    
       
  1266     
       
  1267     <td ALIGN="RIGHT">218:</td>    
       
  1268     
       
  1269     <td>Below left</td>    
       
  1270     
       
  1271   </tr>    
       
  1272     
       
  1273   <tr>    
       
  1274     
       
  1275     <td ALIGN="RIGHT">220:</td>    
       
  1276     
       
  1277     <td>Below</td>    
       
  1278     
       
  1279   </tr>    
       
  1280     
       
  1281   <tr>    
       
  1282     
       
  1283     <td ALIGN="RIGHT">222:</td>    
       
  1284     
       
  1285     <td>Below right</td>    
       
  1286     
       
  1287   </tr>    
       
  1288     
       
  1289   <tr>    
       
  1290     
       
  1291     <td ALIGN="RIGHT">224:</td>    
       
  1292     
       
  1293     <td>Left (reordrant around single base character)</td>    
       
  1294     
       
  1295   </tr>    
       
  1296     
       
  1297   <tr>    
       
  1298     
       
  1299     <td ALIGN="RIGHT">226:</td>    
       
  1300     
       
  1301     <td>Right</td>    
       
  1302     
       
  1303   </tr>    
       
  1304     
       
  1305   <tr>    
       
  1306     
       
  1307     <td ALIGN="RIGHT">228:</td>    
       
  1308     
       
  1309     <td>Above left</td>    
       
  1310     
       
  1311   </tr>    
       
  1312     
       
  1313   <tr>    
       
  1314     
       
  1315     <td ALIGN="RIGHT">230:</td>    
       
  1316     
       
  1317     <td>Above</td>    
       
  1318     
       
  1319   </tr>    
       
  1320     
       
  1321   <tr>    
       
  1322     
       
  1323     <td ALIGN="RIGHT">232:</td>    
       
  1324     
       
  1325     <td>Above right</td>    
       
  1326     
       
  1327   </tr>    
       
  1328     
       
  1329   <tr>    
       
  1330     
       
  1331     <td ALIGN="RIGHT">233:</td>    
       
  1332     
       
  1333     <td>Double below</td>    
       
  1334     
       
  1335   </tr>    
       
  1336     
       
  1337   <tr>    
       
  1338     
       
  1339     <td ALIGN="RIGHT">234:</td>    
       
  1340     
       
  1341     <td>Double above</td>    
       
  1342     
       
  1343   </tr>    
       
  1344     
       
  1345   <tr>    
       
  1346     
       
  1347     <td ALIGN="RIGHT">240:</td>    
       
  1348     
       
  1349     <td>Below (iota subscript)</td>    
       
  1350     
       
  1351   </tr>    
       
  1352     
       
  1353 </table>    
       
  1354     
       
  1355     
       
  1356     
       
  1357 <p><strong>Note: </strong>some of the combining classes in this list do not currently have     
       
  1358     
       
  1359 members but are specified here for completeness.</p>    
       
  1360     
       
  1361     
       
  1362     
       
  1363 <h3><a NAME="Decompositions and Normalization"></a>Decompositions and Normalization</h3>    
       
  1364     
       
  1365     
       
  1366     
       
  1367 <p>Decomposition is specified in Chapter 3. <a href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Technical Report #15:     
       
  1368     
       
  1369 Normalization Forms</i></a> specifies the interaction between decomposition and normalization. The     
       
  1370     
       
  1371 most up-to-date version is found on <a HREF="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>.     
       
  1372     
       
  1373 That report specifies how the decompositions defined in UnicodeData.txt are used to derive     
       
  1374     
       
  1375 normalized forms of Unicode text.</p>    
       
  1376     
       
  1377     
       
  1378     
       
  1379 <p>Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions     
       
  1380     
       
  1381 in the UnicodeData.txt file can be used to recursively derive the full decomposition in     
       
  1382     
       
  1383 canonical order, without the need to separately apply canonical reordering. However,     
       
  1384     
       
  1385 canonical reordering of combining character sequences must still be applied in     
       
  1386     
       
  1387 decomposition when normalizing source text which contains any combining marks.</p>    
       
  1388     
       
  1389     
       
  1390     
       
  1391 <h3><a NAME="Case Mappings"></a>Case Mappings</h3>    
       
  1392     
       
  1393     
       
  1394     
       
  1395 <p>The case mapping is an informative, default mapping. Case itself, on the other hand,     
       
  1396     
       
  1397 has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively     
       
  1398     
       
  1399 uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The     
       
  1400     
       
  1401 reason for this is that case can be considered to be an inherent property of a particular     
       
  1402     
       
  1403 character (and is usually, but not always, derivable from the presence of the terms     
       
  1404     
       
  1405 &quot;CAPITAL&quot; or &quot;SMALL&quot; in the character name), but case mappings between     
       
  1406     
       
  1407 characters are occasionally influenced by local conventions. For example, certain     
       
  1408     
       
  1409 languages, such as Turkish, German, French, or Greek may have small deviations from the     
       
  1410     
       
  1411 default mappings listed in UnicodeData.</p>    
       
  1412     
       
  1413     
       
  1414     
       
  1415 <p>In addition to uppercase and lowercase, because of the inclusion of certain composite     
       
  1416     
       
  1417 characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case,     
       
  1418     
       
  1419 called <i>titlecase</i>, which is used where the first letter of a word is to be     
       
  1420     
       
  1421 capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter     
       
  1422     
       
  1423 is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.</p>    
       
  1424     
       
  1425     
       
  1426     
       
  1427 <p>The uppercase, titlecase and lowercase fields are only included for characters that     
       
  1428     
       
  1429 have a single corresponding character of that type. Composite characters (such as     
       
  1430     
       
  1431 &quot;339D SQUARE CM&quot;) that do not have a single corresponding character of that type     
       
  1432     
       
  1433 can be cased by decomposition.</p>    
       
  1434     
       
  1435     
       
  1436     
       
  1437 <p>For compatibility with existing parsers, UnicodeData only contains case mappings for     
       
  1438     
       
  1439 characters where they are one-to-one mappings; it also omits information about     
       
  1440     
       
  1441 context-sensitive case mappings. Information about these special cases can be found in a     
       
  1442     
       
  1443 separate data file, SpecialCasing.txt,     
       
  1444     
       
  1445 which has been added starting with the 2.1.8 update to the Unicode data files.     
       
  1446     
       
  1447 SpecialCasing.txt contains additional informative case mappings that are either not     
       
  1448     
       
  1449 one-to-one or which are context-sensitive.</p>    
       
  1450     
       
  1451     
       
  1452     
       
  1453 <h2><a NAME="Property Invariants"></a>Property Invariants</h2>    
       
  1454     
       
  1455     
       
  1456     
       
  1457 <p>Values in UnicodeData.txt are subject to correction as errors are found; however, some     
       
  1458     
       
  1459 characteristics of the categories themselves can be considered invariants. Applications     
       
  1460     
       
  1461 may wish to take these invariants into account when choosing how to implement character     
       
  1462     
       
  1463 properties. The following is a partial list of known invariants for the Unicode Character     
       
  1464     
       
  1465 Database.</p>    
       
  1466     
       
  1467     
       
  1468     
       
  1469 <h4>Database Fields</h4>    
       
  1470     
       
  1471     
       
  1472     
       
  1473 <ul>    
       
  1474     
       
  1475   <li>The number of fields in UnicodeData.txt is fixed. </li>    
       
  1476     
       
  1477   <li>The order of the fields is also fixed. <ul>    
       
  1478     
       
  1479       <li>Any additional information about character properties to be added in the future will     
       
  1480     
       
  1481         appear in separate data tables, rather than being added on to the existing table or by     
       
  1482     
       
  1483         subdivision or reinterpretation of existing fields. </li>    
       
  1484     
       
  1485     </ul>    
       
  1486     
       
  1487   </li>    
       
  1488     
       
  1489 </ul>    
       
  1490     
       
  1491     
       
  1492     
       
  1493 <h4>General Category</h4>    
       
  1494     
       
  1495     
       
  1496     
       
  1497 <ul>    
       
  1498     
       
  1499   <li>There will never be more than 32 General Category values. <ul>    
       
  1500     
       
  1501       <li>It is very unlikely that the Unicode Technical Committee will subdivide the General     
       
  1502     
       
  1503         Category partition any further, since that can cause implementations to misbehave. Because     
       
  1504     
       
  1505         the General Category is limited to 32 values, 5 bits can be used to represent the     
       
  1506     
       
  1507         information, and a 32-bit integer can be used as a bitmask to represent arbitrary sets of     
       
  1508     
       
  1509         categories. </li>    
       
  1510     
       
  1511     </ul>    
       
  1512     
       
  1513   </li>    
       
  1514     
       
  1515 </ul>    
       
  1516     
       
  1517     
       
  1518     
       
  1519 <h4>Combining Classes</h4>    
       
  1520     
       
  1521     
       
  1522     
       
  1523 <ul>    
       
  1524     
       
  1525   <li>Combining classes are limited to the values 0 to 255. <ul>    
       
  1526     
       
  1527       <li>In practice, there are far fewer than 256 values used. Implementations may take     
       
  1528     
       
  1529         advantage of this fact for compression, since only the ordering of the non-zero values     
       
  1530     
       
  1531         matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be     
       
  1532     
       
  1533         used in the future; however, UTC decisions in the future may restrict the number of values     
       
  1534     
       
  1535         to 128, since this has implementation advantages. [Signed bytes can be used without     
       
  1536     
       
  1537         widening to ints in Java, for example.] </li>    
       
  1538     
       
  1539     </ul>    
       
  1540     
       
  1541   </li>    
       
  1542     
       
  1543   <li>All characters other than those of General Category M* have the combining class 0. <ul>    
       
  1544     
       
  1545       <li>Currently, all characters other than those of General Category Mn have the value 0.     
       
  1546     
       
  1547         However, some characters of General Category Me or Mc may be given non-zero values in the     
       
  1548     
       
  1549         future. </li>    
       
  1550     
       
  1551       <li>The precise values above the value 0 are not invariant--only the relative ordering is     
       
  1552     
       
  1553         considered normative. For example, it is not guaranteed in future versions that the class     
       
  1554     
       
  1555         of U+05B4 will be precisely 14. </li>    
       
  1556     
       
  1557     </ul>    
       
  1558     
       
  1559   </li>    
       
  1560     
       
  1561 </ul>    
       
  1562     
       
  1563     
       
  1564     
       
  1565 <h4>Case</h4>    
       
  1566     
       
  1567     
       
  1568     
       
  1569 <ul>    
       
  1570     
       
  1571   <li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters with an Upper,     
       
  1572     
       
  1573     Lower, or Titlecase mapping are cased characters. <ul>    
       
  1574     
       
  1575       <li>However, characters with the General Categories of Lu, Ll, or Lt may not always have     
       
  1576     
       
  1577         case mappings, and case mappings may vary by locale. (See     
       
  1578     
       
  1579         ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt). </li>    
       
  1580     
       
  1581     </ul>    
       
  1582     
       
  1583   </li>    
       
  1584     
       
  1585 </ul>    
       
  1586     
       
  1587     
       
  1588     
       
  1589 <h4>Canonical Decomposition</h4>    
       
  1590     
       
  1591     
       
  1592     
       
  1593 <ul>    
       
  1594     
       
  1595   <li>Canonical mappings are always in canonical order. </li>    
       
  1596     
       
  1597   <li>Canonical mappings have only the first of a pair possibly further decomposing. </li>    
       
  1598     
       
  1599   <li>Canonical decompositions are &quot;transparent&quot; to other character data: <ul>    
       
  1600     
       
  1601       <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt> </li>    
       
  1602     
       
  1603       <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt> </li>    
       
  1604     
       
  1605       <li><tt>CombiningClass(a) = CombiningClass(principal(canonicalDecomposition(a))</tt><br>    
       
  1606     
       
  1607         where principal(a) is the first character not of type Mn, or the first character if all     
       
  1608     
       
  1609         characters are of type Mn. </li>    
       
  1610     
       
  1611     </ul>    
       
  1612     
       
  1613   </li>    
       
  1614     
       
  1615   <li>However, because there are sometimes missing case pairs, and because of some legacy     
       
  1616     
       
  1617     characters, it is only generally true that: <ul>    
       
  1618     
       
  1619       <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt> </li>    
       
  1620     
       
  1621       <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt> </li>    
       
  1622     
       
  1623       <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt> </li>    
       
  1624     
       
  1625     </ul>    
       
  1626     
       
  1627   </li>    
       
  1628     
       
  1629 </ul>    
       
  1630     
       
  1631     
       
  1632     
       
  1633 <h2><a NAME="Modification History"></a>Modification History</h2>    
       
  1634     
       
  1635     
       
  1636     
       
  1637 <p>This section provides a summary of the changes between update versions of the Unicode     
       
  1638     
       
  1639 Standard.</p>    
       
  1640     
       
  1641     
       
  1642     
       
  1643 <h3><a href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0"> Unicode 3.0.0</a></h3>    
       
  1644     
       
  1645     
       
  1646     
       
  1647 <p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and     
       
  1648     
       
  1649 a number of property changes. These are summarized in Appendex D of <em>The Unicode     
       
  1650     
       
  1651 Standard, Version 3.0.</em></p>    
       
  1652     
       
  1653     
       
  1654     
       
  1655 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode 2.1.9</a> </h3>    
       
  1656     
       
  1657     
       
  1658     
       
  1659 <p>Modifications made for Version 2.1.9 of UnicodeData.txt include:     
       
  1660     
       
  1661     
       
  1662     
       
  1663 <ul>    
       
  1664     
       
  1665   <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR. </li>    
       
  1666     
       
  1667   <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE </li>    
       
  1668     
       
  1669   <li>Corrected combining class for U+0F35 and U+0F37 to 220. </li>    
       
  1670     
       
  1671   <li>Corrected combining class for U+0F71 to 129. </li>    
       
  1672     
       
  1673   <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR. </li>    
       
  1674     
       
  1675   <li>Added&nbsp; decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5,     
       
  1676     
       
  1677     U+03D6, U+03F0..U+03F2. </li>    
       
  1678     
       
  1679   <li>Removed&nbsp; decompositions from the conjoining jamo block: U+1100..U+11F8. </li>    
       
  1680     
       
  1681   <li>Changes to decomposition mappings for some Tibetan vowels for consistency in     
       
  1682     
       
  1683     normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81) </li>    
       
  1684     
       
  1685   <li>Updated the decomposition mappings for several Vietnamese characters with two diacritics     
       
  1686     
       
  1687     (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive     
       
  1688     
       
  1689     decomposition can be generated directly in canonically reordered form (not a normative     
       
  1690     
       
  1691     change). </li>    
       
  1692     
       
  1693   <li>Updated the decomposition mappings for several Arabic compatibility characters involving     
       
  1694     
       
  1695     shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so     
       
  1696     
       
  1697     that the decompositions are generated directly in canonically reordered form (not a     
       
  1698     
       
  1699     normative change). </li>    
       
  1700     
       
  1701   <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE     
       
  1702     
       
  1703     SEPARATOR. </li>    
       
  1704     
       
  1705   <li>Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035,     
       
  1706     
       
  1707     U+FF9E, U+FF9F. </li>    
       
  1708     
       
  1709   <li>Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375. </li>    
       
  1710     
       
  1711   <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL. </li>    
       
  1712     
       
  1713   <li>Added Unicode 1.0 names for many Tibetan characters (informative). </li>    
       
  1714     
       
  1715 </ul>    
       
  1716     
       
  1717     
       
  1718     
       
  1719 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode 2.1.8</a> </h3>    
       
  1720     
       
  1721     
       
  1722     
       
  1723 <p>Modifications made for Version 2.1.8 of UnicodeData.txt include:     
       
  1724     
       
  1725     
       
  1726     
       
  1727 <ul>    
       
  1728     
       
  1729   <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that     
       
  1730     
       
  1731     decompositions involving iota subscript are derivable directly in canonically reordered     
       
  1732     
       
  1733     form; this also has a bearing on simplification of casing of polytonic Greek. </li>    
       
  1734     
       
  1735   <li>Changes in decompositions related to Greek tonos. These result from the clarification     
       
  1736     
       
  1737     that monotonic Greek &quot;tonos&quot; should be equated with U+0301 COMBINING ACUTE,     
       
  1738     
       
  1739     rather than with U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters in the Greek     
       
  1740     
       
  1741     block involving &quot;tonos&quot;; some Greek characters in the polytonic Greek in the     
       
  1742     
       
  1743     1FXX block.) </li>    
       
  1744     
       
  1745   <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0) </li>    
       
  1746     
       
  1747   <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These changes     
       
  1748     
       
  1749     simplify normalization. </li>    
       
  1750     
       
  1751   <li>Removed canonical decomposition for Latin Candrabindu. (U+0310) </li>    
       
  1752     
       
  1753   <li>Corrected error in canonical decomposition for U+1FF4. </li>    
       
  1754     
       
  1755   <li>Added compatibility decompositions to clarify collation tables. (U+2100, U+2101, U+2105,     
       
  1756     
       
  1757     U+2106, U+1E9A) </li>    
       
  1758     
       
  1759   <li>A series of general category changes to assist the convergence of of Unicode definition     
       
  1760     
       
  1761     of identifier with ISO TR 10176: <ul>    
       
  1762     
       
  1763       <li>So &gt; Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B </li>    
       
  1764     
       
  1765       <li>Po &gt; Lo: U+0E2F, U+0EAF, U+3006 </li>    
       
  1766     
       
  1767       <li>Lm &gt; Sk: U+309B, U+309C </li>    
       
  1768     
       
  1769       <li>Po &gt; Pc: U+30FB, U+FF65 </li>    
       
  1770     
       
  1771       <li>Ps/Pe &gt; Mn: U+0F3E, U+0F3F </li>    
       
  1772     
       
  1773     </ul>    
       
  1774     
       
  1775   </li>    
       
  1776     
       
  1777   <li>A series of bidi property changes for consistency. <ul>    
       
  1778     
       
  1779       <li>L &gt; ET: U+09F2, U+09F3 </li>    
       
  1780     
       
  1781       <li>ON &gt; L: U+3007 </li>    
       
  1782     
       
  1783       <li>L &gt; ON: U+0F3A..U+0F3D, U+037E, U+0387 </li>    
       
  1784     
       
  1785     </ul>    
       
  1786     
       
  1787   </li>    
       
  1788     
       
  1789   <li>Add case mapping: U+01A6 &lt;-&gt; U+0280 </li>    
       
  1790     
       
  1791   <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, U+203A. </li>    
       
  1792     
       
  1793   <li>Changes to combining class values. Most Indic fixed position class non-spacing marks     
       
  1794     
       
  1795     were changed to combining class 0. This fixes some inconsistencies in how canonical     
       
  1796     
       
  1797     reordering would apply to Indic scripts, including Tibetan. Indic interacting top/bottom     
       
  1798     
       
  1799     fixed position classes were merged into single (non-zero) classes as part of this change.     
       
  1800     
       
  1801     Tibetan subjoined consonants are changed from combining class 6 to combining class 0. Thai     
       
  1802     
       
  1803     pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari stress marks into generic     
       
  1804     
       
  1805     above and below combining classes (U+0951, U+0952). </li>    
       
  1806     
       
  1807   <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, etc., scattered     
       
  1808     
       
  1809     positions to U+FA29) </li>    
       
  1810     
       
  1811 </ul>    
       
  1812     
       
  1813     
       
  1814     
       
  1815 <h3>Version 2.1.7</h3>    
       
  1816     
       
  1817     
       
  1818     
       
  1819 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>    
       
  1820     
       
  1821     
       
  1822     
       
  1823 <h3>Version 2.1.6</h3>    
       
  1824     
       
  1825     
       
  1826     
       
  1827 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>    
       
  1828     
       
  1829     
       
  1830     
       
  1831 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode 2.1.5</a> </h3>    
       
  1832     
       
  1833     
       
  1834     
       
  1835 <p>Modifications made for Version 2.1.5 of UnicodeData.txt include:     
       
  1836     
       
  1837     
       
  1838     
       
  1839 <ul>    
       
  1840     
       
  1841   <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation weighting will     
       
  1842     
       
  1843     automatically result from the canonical equivalences. </li>    
       
  1844     
       
  1845   <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, U+04E0, U+04E1,     
       
  1846     
       
  1847     U+04E8, U+04E9 (the implication being that no canonical equivalence is claimed between     
       
  1848     
       
  1849     these 8 characters and similar Latin letters), and updated 4 canonical decompositions for     
       
  1850     
       
  1851     U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied difference in the base character. </li>    
       
  1852     
       
  1853   <li>Added Pi, and Pf categories and assigned the relevant quotation marks to those     
       
  1854     
       
  1855     categories, based on the Unicode Technical Corrigendum on Quotation Characters. </li>    
       
  1856     
       
  1857   <li>Updating of many bidi properties, following the advice of the ad hoc committee on bidi,     
       
  1858     
       
  1859     and to make the bidi properties of compatibility characters more consistent. </li>    
       
  1860     
       
  1861   <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, U+0F88..U+0F8B to make     
       
  1862     
       
  1863     them non-combining, reflecting the combined opinion of Tibetan experts. </li>    
       
  1864     
       
  1865   <li>Added case mapping for U+03F2. </li>    
       
  1866     
       
  1867   <li>Corrected case mapping for U+0275. </li>    
       
  1868     
       
  1869   <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. U+03F2. </li>    
       
  1870     
       
  1871   <li>Corrected compatibility label for U+2121. </li>    
       
  1872     
       
  1873   <li>Add specific entries for all the CJK compatibility ideographs, U+F900..U+FA2D, so the     
       
  1874     
       
  1875     canonical decomposition for each (the URO character it is equivalent to) can be carried in     
       
  1876     
       
  1877     the database. </li>    
       
  1878     
       
  1879 </ul>    
       
  1880     
       
  1881     
       
  1882     
       
  1883 <h3>Version 2.1.4</h3>    
       
  1884     
       
  1885     
       
  1886     
       
  1887 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>    
       
  1888     
       
  1889     
       
  1890     
       
  1891 <h3>Version 2.1.3</h3>    
       
  1892     
       
  1893     
       
  1894     
       
  1895 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>    
       
  1896     
       
  1897     
       
  1898     
       
  1899 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode 2.1.2</a> </h3>    
       
  1900     
       
  1901     
       
  1902     
       
  1903 <p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode     
       
  1904     
       
  1905 Standard, Version 2.1 (from Version 2.0) include:     
       
  1906     
       
  1907     
       
  1908     
       
  1909 <ul>    
       
  1910     
       
  1911   <li>Added two characters (U+20AC and U+FFFC). </li>    
       
  1912     
       
  1913   <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007. </li>    
       
  1914     
       
  1915   <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, U+03C2, U+1E9B. </li>    
       
  1916     
       
  1917   <li>Changed combining order class for U+0F71. </li>    
       
  1918     
       
  1919   <li>Corrected canonical decompositions for U+0F73, U+1FBE. </li>    
       
  1920     
       
  1921   <li>Changed decomposition for U+FB1F from compatibility to canonical. </li>    
       
  1922     
       
  1923   <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. </li>    
       
  1924     
       
  1925   <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358. </li>    
       
  1926     
       
  1927 </ul>    
       
  1928     
       
  1929     
       
  1930     
       
  1931 <h3>Version 2.1.1</h3>    
       
  1932     
       
  1933     
       
  1934     
       
  1935 <p><i>This version was for internal change tracking only, and never publicly released.</i></p>    
       
  1936     
       
  1937     
       
  1938     
       
  1939 <h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode 2.0.0</a> </h3>    
       
  1940     
       
  1941     
       
  1942     
       
  1943 <p>The modifications made in updating UnicodeData.txt for the Unicode     
       
  1944     
       
  1945 Standard, Version 2.0 include:     
       
  1946     
       
  1947     
       
  1948     
       
  1949 <ul>    
       
  1950     
       
  1951   <li>Fixed decompositions with TONOS to use correct NSM: 030D. </li>    
       
  1952     
       
  1953   <li>Removed old Hangul Syllables; mapping to new characters are in a separate table. </li>    
       
  1954     
       
  1955   <li>Marked compatibility decompositions with additional tags. </li>    
       
  1956     
       
  1957   <li>Changed old tag names for clarity. </li>    
       
  1958     
       
  1959   <li>Revision of decompositions to use first-level decomposition, instead of maximal     
       
  1960     
       
  1961     decomposition. </li>    
       
  1962     
       
  1963   <li>Correction of all known errors in decompositions from earlier versions. </li>    
       
  1964     
       
  1965   <li>Added control code names (as old Unicode names). </li>    
       
  1966     
       
  1967   <li>Added Hangul Jamo decompositions. </li>    
       
  1968     
       
  1969   <li>Added Number category to match properties list in book. </li>    
       
  1970     
       
  1971   <li>Fixed categories of Koranic Arabic marks. </li>    
       
  1972     
       
  1973   <li>Fixed categories of precomposed characters to match decomposition where possible. </li>    
       
  1974     
       
  1975   <li>Added Hebrew cantillation marks and the Tibetan script. </li>    
       
  1976     
       
  1977   <li>Added place holders for ranges such as CJK Ideographic Area and the Private Use Area. </li>    
       
  1978     
       
  1979   <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the     
       
  1980     
       
  1981     database. </li>    
       
  1982     
       
  1983 </ul>    
       
  1984     
       
  1985 </body>    
       
  1986     
       
  1987 </html>    
       
  1988