diff -r 000000000000 -r 83f4b4db085c toolsandutils/e32tools/readtype/unicodedata.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/toolsandutils/e32tools/readtype/unicodedata.html Tue Feb 02 01:39:43 2010 +0200 @@ -0,0 +1,1988 @@ + + + + +
+ + + + + + + +Revision | + +3.0.0 | + +
Authors | + +Mark Davis and Ken Whistler | + +
Date | + +1999-09-12 | + +
This Version | + +ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html | + +
Previous Version | + +n/a | + +
Latest Version | + +ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html | + +
Copyright © 1995-1999 Unicode, Inc. All Rights reserved.
+
+For more information, including Disclamer and Limitations, see UnicodeCharacterDatabase-3.0.0.html
This document describes the format of the UnicodeData.txt file, which is one of the + +files in the Unicode Character Database. The document is divided into the following + +sections: + + + +
Warning: the information in this file does not completely describe the use and + +interpretation of Unicode character properties and behavior. It must be used in + +conjunction with the data in the other files in the Unicode Character Database, and relies + +on the notation and definitions supplied in The Unicode +Standard. All chapter references + +are to Version 3.0 of the standard.
+ + + +The file consists of lines containing fields terminated by semicolons. Each line + +represents the data for one encoded character in the Unicode Standard. Every encoded + +character has a data entry, with the exception of certain special ranges, as detailed + +below. + + + +
The exact ranges represented by start and end characters are: + + + +
The following table describes the format and meaning of each field in a data entry in + +the UnicodeData file. Fields which contain normative information are so indicated.
+ + + +Field |
+
+ Name |
+
+ Status |
+
+ Explanation |
+
+
---|---|---|---|
0 | + +Code value | + +normative | + +Code value in 4-digit hexadecimal format. | + +
1 | + +Character name | + +normative | + +These names match exactly the names published in Chapter 14 of the + + Unicode Standard, Version 3.0. | + +
2 | + +General Category | + +normative / informative + + (see below) |
+
+ This is a useful breakdown into various "character types" which + + can be used as a default categorization in implementations. See below for a brief + + explanation. | + +
3 | + +Canonical Combining Classes | + +normative | + +The classes used for the Canonical Ordering Algorithm in the Unicode + + Standard. These classes are also printed in Chapter 4 of the Unicode Standard. | + +
4 | + +Bidirectional Category | + +normative | + +See the list below for an explanation of the abbreviations used in this + + field. These are the categories required by the Bidirectional Behavior Algorithm in the + + Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard. | + +
5 | + +Character Decomposition + Mapping | + +normative | + +In the Unicode Standard, not all of the mappings are full (maximal) + + decompositions. Recursive application of look-up for decompositions will, in all cases, + + lead to a maximal decomposition. The decomposition mappings match exactly the + + decomposition mappings published with the character names in the Unicode Standard. | + +
6 | + +Decimal digit value | + +normative | + +This is a numeric field. If the character has the decimal digit property, + + as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented + + with an integer value in this field | + +
7 | + +Digit value | + +normative | + +This is a numeric field. If the character represents a digit, not + + necessarily a decimal digit, the value is here. This covers digits which do not form + + decimal radix forms, such as the compatibility superscript digits | + +
8 | + +Numeric value | + +normative | + +This is a numeric field. If the character has the numeric property, as + + specified in Chapter 4 of the Unicode Standard, the value of that character is represented + + with an integer or rational number in this field. This includes fractions as, e.g., + + "1/5" for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values + + for compatibility characters such as circled numbers. | + +
8 | + +Mirrored | + +normative | + +If the character has been identified as a "mirrored" character + + in bidirectional text, this field has the value "Y"; otherwise "N". + + The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard. | + +
10 | + +Unicode 1.0 Name | + +informative | + +This is the old name as published in Unicode 1.0. This name is only + + provided when it is significantly different from the Unicode 3.0 name for the character. | + +
11 | + +10646 comment field | + +informative | + +This is the ISO 10646 comment field. It is in parantheses in the 10646 + + names list. | + +
12 | + +Uppercase Mapping | + +informative | + +Upper case equivalent mapping. If a character is part of an alphabet with + + case distinctions, and has an upper case equivalent, then the upper case equivalent is in + + this field. See the explanation below on case distinctions. These mappings are always + + one-to-one, not one-to-many or many-to-one. This field is informative. | + +
13 | + +Lowercase Mapping | + +informative | + +Similar to Uppercase mapping | + +
14 | + +Titlecase Mapping | + +informative | + +Similar to Uppercase mapping | + +
The values in this field are abbreviations for the following. Some of the values are + +normative, and some are informative. For more information, see the Unicode Standard.
+ + + +Note: the standard does not assign information to control characters (except for + +certain cases in the Bidirectional Algorithm). Implementations will generally also assign + +categories to certain control characters, notably CR and LF, according to platform + +conventions.
+ + + +Abbr. |
+
+ Description |
+
+
---|---|
Lu | + +Letter, Uppercase | + +
Ll | + +Letter, Lowercase | + +
Lt | + +Letter, Titlecase | + +
Mn | + +Mark, Non-Spacing | + +
Mc | + +Mark, Spacing Combining | + +
Me | + +Mark, Enclosing | + +
Nd | + +Number, Decimal Digit | + +
Nl | + +Number, Letter | + +
No | + +Number, Other | + +
Zs | + +Separator, Space | + +
Zl | + +Separator, Line | + +
Zp | + +Separator, Paragraph | + +
Cc | + +Other, Control | + +
Cf | + +Other, Format | + +
Cs | + +Other, Surrogate | + +
Co | + +Other, Private Use | + +
Cn | + +Other, Not Assigned (no characters in the file have this property) | + +
Abbr. |
+
+ Description |
+
+
---|---|
Lm | + +Letter, Modifier | + +
Lo | + +Letter, Other | + +
Pc | + +Punctuation, Connector | + +
Pd | + +Punctuation, Dash | + +
Ps | + +Punctuation, Open | + +
Pe | + +Punctuation, Close | + +
Pi | + +Punctuation, Initial quote (may behave like Ps or Pe depending on usage) | + +
Pf | + +Punctuation, Final quote (may behave like Ps or Pe depending on usage) | + +
Po | + +Punctuation, Other | + +
Sm | + +Symbol, Math | + +
Sc | + +Symbol, Currency | + +
Sk | + +Symbol, Modifier | + +
So | + +Symbol, Other | + +
Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional + +Behavior and an explanation of the significance of these categories. An up-to-date version + +can be found on Unicode Technical + +Report #9: The Bidirectional Algorithm. These values are normative.
+ + + +Type |
+
+ Description |
+
+
---|---|
L | + +Left-to-Right | + +
LRE | + +Left-to-Right Embedding | + +
LRO | + +Left-to-Right Override | + +
R | + +Right-to-Left | + +
AL | + +Right-to-Left Arabic | + +
RLE | + +Right-to-Left Embedding | + +
RLO | + +Right-to-Left Override | + +
Pop Directional Format | + +|
EN | + +European Number | + +
ES | + +European Number Separator | + +
ET | + +European Number Terminator | + +
AN | + +Arabic Number | + +
CS | + +Common Number Separator | + +
NSM | + +Non-Spacing Mark | + +
BN | + +Boundary Neutral | + +
B | + +Paragraph Separator | + +
S | + +Segment Separator | + +
WS | + +Whitespace | + +
ON | + +Other Neutrals | + +
The decomposition is a normative property of a character. The tags supplied with + +certain decomposition mappings generally indicate formatting information. Where no such + +tag is given, the mapping is designated as canonical. Conversely, the presence of a + +formatting tag also indicates that the mapping is a compatibility mapping and not a + +canonical mapping. In the absence of other formatting information in a compatibility + +mapping, the tag is used to distinguish it from canonical mappings.
+ + + +In some instances a canonical mapping or a compatibility mapping may consist of a + +single character. For a canonical mapping, this indicates that the character is a + +canonical equivalent of another single character. For a compatibility mapping, this + +indicates that the character is a compatibility equivalent of another single character. + +The compatibility formatting tags used are:
+ + + +Tag | + +Description |
+
+
---|---|
<font> | + +A font variant (e.g. a blackletter form). | + +
<noBreak> | + +A no-break version of a space or hyphen. | + +
<initial> | + +An initial presentation form (Arabic). | + +
<medial> | + +A medial presentation form (Arabic). | + +
<final> | + +A final presentation form (Arabic). | + +
<isolated> | + +An isolated presentation form (Arabic). | + +
<circle> | + +An encircled form. | + +
<super> | + +A superscript form. | + +
<sub> | + +A subscript form. | + +
<vertical> | + +A vertical layout presentation form. | + +
<wide> | + +A wide (or zenkaku) compatibility character. | + +
<narrow> | + +A narrow (or hankaku) compatibility character. | + +
<small> | + +A small variant form (CNS compatibility). | + +
<square> | + +A CJK squared font variant. | + +
<fraction> | + +A vulgar fraction form. | + +
<compat> | + +Otherwise unspecified compatibility character. | + +
Reminder: There is a difference between decomposition and decomposition mapping. + +The decomposition mappings are defined in the UnicodeData, while the decomposition (also + +termed "full decomposition") is defined in Chapter 3 to use those mappings + + +recursively. + + + +
Value |
+
+ Description |
+
+
---|---|
0: | + +Spacing, split, enclosing, reordrant, and Tibetan subjoined | + +
1: | + +Overlays and interior | + +
7: | + +Nuktas | + +
8: | + +Hiragana/Katakana voicing marks | + +
9: | + +Viramas | + +
10: | + +Start of fixed position classes | + +
199: | + +End of fixed position classes | + +
200: | + +Below left attached | + +
202: | + +Below attached | + +
204: | + +Below right attached | + +
208: | + +Left attached (reordrant around single base character) | + +
210: | + +Right attached | + +
212: | + +Above left attached | + +
214: | + +Above attached | + +
216: | + +Above right attached | + +
218: | + +Below left | + +
220: | + +Below | + +
222: | + +Below right | + +
224: | + +Left (reordrant around single base character) | + +
226: | + +Right | + +
228: | + +Above left | + +
230: | + +Above | + +
232: | + +Above right | + +
233: | + +Double below | + +
234: | + +Double above | + +
240: | + +Below (iota subscript) | + +
Note: some of the combining classes in this list do not currently have + +members but are specified here for completeness.
+ + + +Decomposition is specified in Chapter 3. Unicode Technical Report #15: + +Normalization Forms specifies the interaction between decomposition and normalization. The + +most up-to-date version is found on http://www.unicode.org/unicode/reports/tr15/. + +That report specifies how the decompositions defined in UnicodeData.txt are used to derive + +normalized forms of Unicode text.
+ + + +Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions + +in the UnicodeData.txt file can be used to recursively derive the full decomposition in + +canonical order, without the need to separately apply canonical reordering. However, + +canonical reordering of combining character sequences must still be applied in + +decomposition when normalizing source text which contains any combining marks.
+ + + +The case mapping is an informative, default mapping. Case itself, on the other hand, + +has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively + +uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The + +reason for this is that case can be considered to be an inherent property of a particular + +character (and is usually, but not always, derivable from the presence of the terms + +"CAPITAL" or "SMALL" in the character name), but case mappings between + +characters are occasionally influenced by local conventions. For example, certain + +languages, such as Turkish, German, French, or Greek may have small deviations from the + +default mappings listed in UnicodeData.
+ + + +In addition to uppercase and lowercase, because of the inclusion of certain composite + +characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case, + +called titlecase, which is used where the first letter of a word is to be + +capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter + +is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.
+ + + +The uppercase, titlecase and lowercase fields are only included for characters that + +have a single corresponding character of that type. Composite characters (such as + +"339D SQUARE CM") that do not have a single corresponding character of that type + +can be cased by decomposition.
+ + + +For compatibility with existing parsers, UnicodeData only contains case mappings for + +characters where they are one-to-one mappings; it also omits information about + +context-sensitive case mappings. Information about these special cases can be found in a + +separate data file, SpecialCasing.txt, + +which has been added starting with the 2.1.8 update to the Unicode data files. + +SpecialCasing.txt contains additional informative case mappings that are either not + +one-to-one or which are context-sensitive.
+ + + +Values in UnicodeData.txt are subject to correction as errors are found; however, some + +characteristics of the categories themselves can be considered invariants. Applications + +may wish to take these invariants into account when choosing how to implement character + +properties. The following is a partial list of known invariants for the Unicode Character + +Database.
+ + + +This section provides a summary of the changes between update versions of the Unicode + +Standard.
+ + + +Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and + +a number of property changes. These are summarized in Appendex D of The Unicode + +Standard, Version 3.0.
+ + + +Modifications made for Version 2.1.9 of UnicodeData.txt include: + + + +
Modifications made for Version 2.1.8 of UnicodeData.txt include: + + + +
This version was for internal change tracking only, and never publicly released.
+ + + +This version was for internal change tracking only, and never publicly released.
+ + + +Modifications made for Version 2.1.5 of UnicodeData.txt include: + + + +
This version was for internal change tracking only, and never publicly released.
+ + + +This version was for internal change tracking only, and never publicly released.
+ + + +Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode + +Standard, Version 2.1 (from Version 2.0) include: + + + +
This version was for internal change tracking only, and never publicly released.
+ + + +The modifications made in updating UnicodeData.txt for the Unicode + +Standard, Version 2.0 include: + + + +