Symbian3/PDK/Source/GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292.dita
changeset 5 f345bda72bc4
parent 3 46218c8b8afa
child 14 578be2adaf3e
equal deleted inserted replaced
4:4816d766a08a 5:f345bda72bc4
     7     Nokia Corporation - initial contribution.
     7     Nokia Corporation - initial contribution.
     8 Contributors: 
     8 Contributors: 
     9 -->
     9 -->
    10 <!DOCTYPE concept
    10 <!DOCTYPE concept
    11   PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
    11   PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
    12 <concept xml:lang="en" id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292"><title>Character Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody><p>This section describes the terminology used often in character conversions, such as BMP and Charconv converters. </p> <section><title>Character sets </title> <p>Textual data in electronic devices is stored in terms of a character set. A character set is a group of characters, each of which is encoded as a different number. The appearance of each character is not a property of the character set, but rather of the font. So a character may be rendered using many different glyphs, but will always have the same numeric value within its character set. Other properties which can also be included in a character set’s definition are the direction of writing, and the way in which sets of characters are combined. </p> </section> <section><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the ways of encoding them, have proliferated with the increasing acceptance of computers and communicators throughout the world. This has led to an international standard character set, which encompasses all commonly used character sets, including Eastern ideograms, in a single character set, Unicode, defined by the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for Unicode Character Set. Unicode characters are generally encoded using one 16-bit value but written to files in two bytes. This is referred to as UCS-2 encoding formats. There are also other Unicode encoding formats such as UTF-16 and UTF-8 for different purposes. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium. </p> </section> <section><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are called Basic Multilingual Plane (BMP). BMP covers almost all characters in different languages. Code points outside the BMP must be encoded using a "surrogate pair", which consists of two 16-bit values. Symbian platform currently does not support scripts with characters mapped to code points above U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section> <section><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats. It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In the text-processing subsystem, Symbian platform uses UTF-16 Unicode format. This means that any input to the text-processing subsystem must be in UTF-16. Different character converters can be used to convert text from other encoding formats to UTF-16. </p> </section> <section><title>Transformation formats </title> <p>The UCS-2 format of the Unicode character set encodes each character as 2 bytes (16 bits total). However it does not specify which of the bytes is most significant. The byte order, or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While this is not important within a system, it does mean that text encoded as UCS-2 cannot easily be shared between systems using a different endian-ness. To overcome this problem the Unicode Consortium has defined two transformation formats for sharing Unicode text. The transformation formats explicitly specify byte order, and cannot be misinterpreted by computers using a different byte order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described below. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium. </p> <p> <b>UTF-7</b>  </p> <p>UTF-7 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of which only 7 bits are used. UTF-7 divides the set of Unicode characters into three subsets, which are encoded and transmitted differently. </p> <ul><li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of characters which are encoded as a single byte. It includes lower and upper case A to Z, the numeric digits, and nine other characters. </p> </li> <li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>! " # $ % &amp; * ; &lt; = &gt; @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These characters can be encoded as a single byte, or with the modified <keyword>base
    12 <concept id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292" xml:lang="en"><title>Character
    13                 64</keyword>  encoding used for set B characters. When encoded as a single byte, set O characters can be misinterpreted by some applications — encoding as modified base 64 overcomes this problem. </p> </li> <li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the remaining characters, which are encoded as an <keyword>escape byte</keyword> followed by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li> </ul> <p> <b>UTF-8</b>  </p> <p>UTF-8 encodes and transmits Unicode characters as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded without change; the most significant bit being set to zero is a signal that they have not been changed. Unicode characters U0080 to U07FF are encoded in two bytes, the remaining Unicode characters — except for the surrogates — are encoded in three bytes. The Unicode surrogate characters are supported by the Character Conversion API, but are not currently supported by all Symbian platform components. </p> <p>A variant of UTF-8 used internally by Java differs from standard UTF-8 in two ways. First, the specific case of the NULL character (0x0000) is encoded in the two-byte format, and second, only the one-, two- and three-byte formats are used, not the four-byte format which is normally used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls whether the UTF-8 generated by this is the Java variant. Support for this was removed in v6.0. </p> </section> <section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each converter implements a conversion between a single foreign character encoding and UTF-16, and is identified by a Unique Identifier (UID). Symbian platform provides the following two types of converter: </p> <ul><li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework component and used by most languages </p> </li> <li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom plug-ins in the Plug-ins component and used by certain languages. </p> </li> </ul> </section> </conbody></concept>
    13 Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody>
       
    14 <p>This section describes the terminology used often in character conversions,
       
    15 such as BMP and Charconv converters. </p>
       
    16 <section id="GUID-F964FD3C-D80B-4DBB-A99D-71CC60C362FC"><title>Character sets </title> <p>Textual data in electronic devices
       
    17 is stored in terms of a character set. A character set is a group of characters,
       
    18 each of which is encoded as a different number. The appearance of each character
       
    19 is not a property of the character set, but rather of the font. So a character
       
    20 may be rendered using many different glyphs, but will always have the same
       
    21 numeric value within its character set. Other properties which can also be
       
    22 included in a character set’s definition are the direction of writing, and
       
    23 the way in which sets of characters are combined. </p> </section>
       
    24 <section id="GUID-58021C48-1A3D-41C8-8B82-16C0481BFDCB"><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the
       
    25 ways of encoding them, have proliferated with the increasing acceptance of
       
    26 computers and communicators throughout the world. This has led to an international
       
    27 standard character set, which encompasses all commonly used character sets,
       
    28 including Eastern ideograms, in a single character set, Unicode, defined by
       
    29 the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for
       
    30 Unicode Character Set. Unicode characters are generally encoded using one
       
    31 16-bit value but written to files in two bytes. This is referred to as UCS-2
       
    32 encoding formats. There are also other Unicode encoding formats such as UTF-16
       
    33 and UTF-8 for different purposes. For the full definition of these formats,
       
    34 see The Unicode Standard published by the Unicode Consortium. </p> </section>
       
    35 <section id="GUID-24F61FEA-C3FE-4CBB-BDA2-4FF741288B63"><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are
       
    36 called Basic Multilingual Plane (BMP). BMP covers almost all characters in
       
    37 different languages. Code points outside the BMP must be encoded using a "surrogate
       
    38 pair", which consists of two 16-bit values. The Symbian platform
       
    39 currently does not support scripts with characters mapped to code points above
       
    40 U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section>
       
    41 <section id="GUID-21DF5FEF-2446-4D23-8139-869A0CD7B514"><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats.
       
    42 It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In
       
    43 the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format.
       
    44 This means that any input to the text-processing subsystem must be in UTF-16.
       
    45 Different character converters can be used to convert text from other encoding
       
    46 formats to UTF-16. </p> </section>
       
    47 <section id="GUID-786FEE95-D7A5-4E41-AB41-C8D54BFB8C54"><title>Transformation formats </title> <p>The UCS-2 format of the
       
    48 Unicode character set encodes each character as 2 bytes (16 bits total). However
       
    49 it does not specify which of the bytes is most significant. The byte order,
       
    50 or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While
       
    51 this is not important within a system, it does mean that text encoded as UCS-2
       
    52 cannot easily be shared between systems using a different endian-ness. To
       
    53 overcome this problem the Unicode Consortium has defined two transformation
       
    54 formats for sharing Unicode text. The transformation formats explicitly specify
       
    55 byte order, and cannot be misinterpreted by computers using a different byte
       
    56 order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described
       
    57 below. For the full definition of these formats, see The Unicode Standard
       
    58 published by the Unicode Consortium. </p> <p> <b>UTF-7</b>  </p> <p>UTF-7
       
    59 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of
       
    60 which only 7 bits are used. UTF-7 divides the set of Unicode characters into
       
    61 three subsets, which are encoded and transmitted differently. </p> <ul>
       
    62 <li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of
       
    63 characters which are encoded as a single byte. It includes lower and upper
       
    64 case A to Z, the numeric digits, and nine other characters. </p> </li>
       
    65 <li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>!
       
    66 " # $ % &amp; * ; &lt; = &gt; @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These
       
    67 characters can be encoded as a single byte, or with the modified <keyword>base
       
    68                 64</keyword>  encoding used for set B characters. When
       
    69 encoded as a single byte, set O characters can be misinterpreted by some applications —
       
    70 encoding as modified base 64 overcomes this problem. </p> </li>
       
    71 <li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the
       
    72 remaining characters, which are encoded as an <keyword>escape byte</keyword> followed
       
    73 by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li>
       
    74 </ul> <p> <b>UTF-8</b>  </p> <p>UTF-8 encodes and transmits Unicode characters
       
    75 as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded
       
    76 without change; the most significant bit being set to zero is a signal that
       
    77 they have not been changed. Unicode characters U0080 to U07FF are encoded
       
    78 in two bytes, the remaining Unicode characters — except for the surrogates —
       
    79 are encoded in three bytes. The Unicode surrogate characters are supported
       
    80 by the Character Conversion API, but are not currently supported by all Symbian
       
    81 platform components. </p> <p>A variant of UTF-8 used internally by Java differs
       
    82 from standard UTF-8 in two ways. First, the specific case of the NULL character
       
    83 (0x0000) is encoded in the two-byte format, and second, only the one-, two-
       
    84 and three-byte formats are used, not the four-byte format which is normally
       
    85 used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls
       
    86 whether the UTF-8 generated by this is the Java variant. Support for this
       
    87 was removed in v6.0. </p> </section>
       
    88 <section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each
       
    89 converter implements a conversion between a single foreign character encoding
       
    90 and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform
       
    91 provides the following two types of converter: </p> <ul>
       
    92 <li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework
       
    93 component and used by most languages </p> </li>
       
    94 <li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom
       
    95 plug-ins in the Plug-ins component and used by certain languages. </p> </li>
       
    96 </ul> </section>
       
    97 </conbody></concept>