7 Nokia Corporation - initial contribution. |
7 Nokia Corporation - initial contribution. |
8 Contributors: |
8 Contributors: |
9 --> |
9 --> |
10 <!DOCTYPE concept |
10 <!DOCTYPE concept |
11 PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
11 PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
12 <concept xml:lang="en" id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292"><title>Character Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody><p>This section describes the terminology used often in character conversions, such as BMP and Charconv converters. </p> <section><title>Character sets </title> <p>Textual data in electronic devices is stored in terms of a character set. A character set is a group of characters, each of which is encoded as a different number. The appearance of each character is not a property of the character set, but rather of the font. So a character may be rendered using many different glyphs, but will always have the same numeric value within its character set. Other properties which can also be included in a character set’s definition are the direction of writing, and the way in which sets of characters are combined. </p> </section> <section><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the ways of encoding them, have proliferated with the increasing acceptance of computers and communicators throughout the world. This has led to an international standard character set, which encompasses all commonly used character sets, including Eastern ideograms, in a single character set, Unicode, defined by the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for Unicode Character Set. Unicode characters are generally encoded using one 16-bit value but written to files in two bytes. This is referred to as UCS-2 encoding formats. There are also other Unicode encoding formats such as UTF-16 and UTF-8 for different purposes. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium. </p> </section> <section><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are called Basic Multilingual Plane (BMP). BMP covers almost all characters in different languages. Code points outside the BMP must be encoded using a "surrogate pair", which consists of two 16-bit values. Symbian platform currently does not support scripts with characters mapped to code points above U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section> <section><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats. It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In the text-processing subsystem, Symbian platform uses UTF-16 Unicode format. This means that any input to the text-processing subsystem must be in UTF-16. Different character converters can be used to convert text from other encoding formats to UTF-16. </p> </section> <section><title>Transformation formats </title> <p>The UCS-2 format of the Unicode character set encodes each character as 2 bytes (16 bits total). However it does not specify which of the bytes is most significant. The byte order, or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While this is not important within a system, it does mean that text encoded as UCS-2 cannot easily be shared between systems using a different endian-ness. To overcome this problem the Unicode Consortium has defined two transformation formats for sharing Unicode text. The transformation formats explicitly specify byte order, and cannot be misinterpreted by computers using a different byte order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described below. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium. </p> <p> <b>UTF-7</b> </p> <p>UTF-7 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of which only 7 bits are used. UTF-7 divides the set of Unicode characters into three subsets, which are encoded and transmitted differently. </p> <ul><li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of characters which are encoded as a single byte. It includes lower and upper case A to Z, the numeric digits, and nine other characters. </p> </li> <li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>! " # $ % & * ; < = > @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These characters can be encoded as a single byte, or with the modified <keyword>base |
12 <concept id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292" xml:lang="en"><title>Character |
13 64</keyword> encoding used for set B characters. When encoded as a single byte, set O characters can be misinterpreted by some applications — encoding as modified base 64 overcomes this problem. </p> </li> <li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the remaining characters, which are encoded as an <keyword>escape byte</keyword> followed by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li> </ul> <p> <b>UTF-8</b> </p> <p>UTF-8 encodes and transmits Unicode characters as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded without change; the most significant bit being set to zero is a signal that they have not been changed. Unicode characters U0080 to U07FF are encoded in two bytes, the remaining Unicode characters — except for the surrogates — are encoded in three bytes. The Unicode surrogate characters are supported by the Character Conversion API, but are not currently supported by all Symbian platform components. </p> <p>A variant of UTF-8 used internally by Java differs from standard UTF-8 in two ways. First, the specific case of the NULL character (0x0000) is encoded in the two-byte format, and second, only the one-, two- and three-byte formats are used, not the four-byte format which is normally used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls whether the UTF-8 generated by this is the Java variant. Support for this was removed in v6.0. </p> </section> <section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each converter implements a conversion between a single foreign character encoding and UTF-16, and is identified by a Unique Identifier (UID). Symbian platform provides the following two types of converter: </p> <ul><li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework component and used by most languages </p> </li> <li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom plug-ins in the Plug-ins component and used by certain languages. </p> </li> </ul> </section> </conbody></concept> |
13 Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody> |
|
14 <p>This section describes the terminology used often in character conversions, |
|
15 such as BMP and Charconv converters. </p> |
|
16 <section id="GUID-F964FD3C-D80B-4DBB-A99D-71CC60C362FC"><title>Character sets </title> <p>Textual data in electronic devices |
|
17 is stored in terms of a character set. A character set is a group of characters, |
|
18 each of which is encoded as a different number. The appearance of each character |
|
19 is not a property of the character set, but rather of the font. So a character |
|
20 may be rendered using many different glyphs, but will always have the same |
|
21 numeric value within its character set. Other properties which can also be |
|
22 included in a character set’s definition are the direction of writing, and |
|
23 the way in which sets of characters are combined. </p> </section> |
|
24 <section id="GUID-58021C48-1A3D-41C8-8B82-16C0481BFDCB"><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the |
|
25 ways of encoding them, have proliferated with the increasing acceptance of |
|
26 computers and communicators throughout the world. This has led to an international |
|
27 standard character set, which encompasses all commonly used character sets, |
|
28 including Eastern ideograms, in a single character set, Unicode, defined by |
|
29 the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for |
|
30 Unicode Character Set. Unicode characters are generally encoded using one |
|
31 16-bit value but written to files in two bytes. This is referred to as UCS-2 |
|
32 encoding formats. There are also other Unicode encoding formats such as UTF-16 |
|
33 and UTF-8 for different purposes. For the full definition of these formats, |
|
34 see The Unicode Standard published by the Unicode Consortium. </p> </section> |
|
35 <section id="GUID-24F61FEA-C3FE-4CBB-BDA2-4FF741288B63"><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are |
|
36 called Basic Multilingual Plane (BMP). BMP covers almost all characters in |
|
37 different languages. Code points outside the BMP must be encoded using a "surrogate |
|
38 pair", which consists of two 16-bit values. The Symbian platform |
|
39 currently does not support scripts with characters mapped to code points above |
|
40 U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section> |
|
41 <section id="GUID-21DF5FEF-2446-4D23-8139-869A0CD7B514"><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats. |
|
42 It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In |
|
43 the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format. |
|
44 This means that any input to the text-processing subsystem must be in UTF-16. |
|
45 Different character converters can be used to convert text from other encoding |
|
46 formats to UTF-16. </p> </section> |
|
47 <section id="GUID-786FEE95-D7A5-4E41-AB41-C8D54BFB8C54"><title>Transformation formats </title> <p>The UCS-2 format of the |
|
48 Unicode character set encodes each character as 2 bytes (16 bits total). However |
|
49 it does not specify which of the bytes is most significant. The byte order, |
|
50 or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While |
|
51 this is not important within a system, it does mean that text encoded as UCS-2 |
|
52 cannot easily be shared between systems using a different endian-ness. To |
|
53 overcome this problem the Unicode Consortium has defined two transformation |
|
54 formats for sharing Unicode text. The transformation formats explicitly specify |
|
55 byte order, and cannot be misinterpreted by computers using a different byte |
|
56 order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described |
|
57 below. For the full definition of these formats, see The Unicode Standard |
|
58 published by the Unicode Consortium. </p> <p> <b>UTF-7</b> </p> <p>UTF-7 |
|
59 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of |
|
60 which only 7 bits are used. UTF-7 divides the set of Unicode characters into |
|
61 three subsets, which are encoded and transmitted differently. </p> <ul> |
|
62 <li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of |
|
63 characters which are encoded as a single byte. It includes lower and upper |
|
64 case A to Z, the numeric digits, and nine other characters. </p> </li> |
|
65 <li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>! |
|
66 " # $ % & * ; < = > @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These |
|
67 characters can be encoded as a single byte, or with the modified <keyword>base |
|
68 64</keyword> encoding used for set B characters. When |
|
69 encoded as a single byte, set O characters can be misinterpreted by some applications — |
|
70 encoding as modified base 64 overcomes this problem. </p> </li> |
|
71 <li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the |
|
72 remaining characters, which are encoded as an <keyword>escape byte</keyword> followed |
|
73 by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li> |
|
74 </ul> <p> <b>UTF-8</b> </p> <p>UTF-8 encodes and transmits Unicode characters |
|
75 as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded |
|
76 without change; the most significant bit being set to zero is a signal that |
|
77 they have not been changed. Unicode characters U0080 to U07FF are encoded |
|
78 in two bytes, the remaining Unicode characters — except for the surrogates — |
|
79 are encoded in three bytes. The Unicode surrogate characters are supported |
|
80 by the Character Conversion API, but are not currently supported by all Symbian |
|
81 platform components. </p> <p>A variant of UTF-8 used internally by Java differs |
|
82 from standard UTF-8 in two ways. First, the specific case of the NULL character |
|
83 (0x0000) is encoded in the two-byte format, and second, only the one-, two- |
|
84 and three-byte formats are used, not the four-byte format which is normally |
|
85 used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls |
|
86 whether the UTF-8 generated by this is the Java variant. Support for this |
|
87 was removed in v6.0. </p> </section> |
|
88 <section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each |
|
89 converter implements a conversion between a single foreign character encoding |
|
90 and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform |
|
91 provides the following two types of converter: </p> <ul> |
|
92 <li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework |
|
93 component and used by most languages </p> </li> |
|
94 <li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom |
|
95 plug-ins in the Plug-ins component and used by certain languages. </p> </li> |
|
96 </ul> </section> |
|
97 </conbody></concept> |