|
1 <?xml version="1.0" encoding="utf-8"?> |
|
2 <!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. --> |
|
3 <!-- This component and the accompanying materials are made available under the terms of the License |
|
4 "Eclipse Public License v1.0" which accompanies this distribution, |
|
5 and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". --> |
|
6 <!-- Initial Contributors: |
|
7 Nokia Corporation - initial contribution. |
|
8 Contributors: |
|
9 --> |
|
10 <!DOCTYPE concept |
|
11 PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
|
12 <concept id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292" xml:lang="en"><title>Character |
|
13 Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody> |
|
14 <p>This section describes the terminology used often in character conversions, |
|
15 such as BMP and Charconv converters. </p> |
|
16 <section id="GUID-F964FD3C-D80B-4DBB-A99D-71CC60C362FC"><title>Character sets </title> <p>Textual data in electronic devices |
|
17 is stored in terms of a character set. A character set is a group of characters, |
|
18 each of which is encoded as a different number. The appearance of each character |
|
19 is not a property of the character set, but rather of the font. So a character |
|
20 may be rendered using many different glyphs, but will always have the same |
|
21 numeric value within its character set. Other properties which can also be |
|
22 included in a character set’s definition are the direction of writing, and |
|
23 the way in which sets of characters are combined. </p> </section> |
|
24 <section id="GUID-58021C48-1A3D-41C8-8B82-16C0481BFDCB"><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the |
|
25 ways of encoding them, have proliferated with the increasing acceptance of |
|
26 computers and communicators throughout the world. This has led to an international |
|
27 standard character set, which encompasses all commonly used character sets, |
|
28 including Eastern ideograms, in a single character set, Unicode, defined by |
|
29 the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for |
|
30 Unicode Character Set. Unicode characters are generally encoded using one |
|
31 16-bit value but written to files in two bytes. This is referred to as UCS-2 |
|
32 encoding formats. There are also other Unicode encoding formats such as UTF-16 |
|
33 and UTF-8 for different purposes. For the full definition of these formats, |
|
34 see The Unicode Standard published by the Unicode Consortium. </p> </section> |
|
35 <section id="GUID-24F61FEA-C3FE-4CBB-BDA2-4FF741288B63"><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are |
|
36 called Basic Multilingual Plane (BMP). BMP covers almost all characters in |
|
37 different languages. Code points outside the BMP must be encoded using a "surrogate |
|
38 pair", which consists of two 16-bit values. The Symbian platform |
|
39 currently does not support scripts with characters mapped to code points above |
|
40 U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section> |
|
41 <section id="GUID-21DF5FEF-2446-4D23-8139-869A0CD7B514"><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats. |
|
42 It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In |
|
43 the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format. |
|
44 This means that any input to the text-processing subsystem must be in UTF-16. |
|
45 Different character converters can be used to convert text from other encoding |
|
46 formats to UTF-16. </p> </section> |
|
47 <section id="GUID-786FEE95-D7A5-4E41-AB41-C8D54BFB8C54"><title>Transformation formats </title> <p>The UCS-2 format of the |
|
48 Unicode character set encodes each character as 2 bytes (16 bits total). However |
|
49 it does not specify which of the bytes is most significant. The byte order, |
|
50 or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While |
|
51 this is not important within a system, it does mean that text encoded as UCS-2 |
|
52 cannot easily be shared between systems using a different endian-ness. To |
|
53 overcome this problem the Unicode Consortium has defined two transformation |
|
54 formats for sharing Unicode text. The transformation formats explicitly specify |
|
55 byte order, and cannot be misinterpreted by computers using a different byte |
|
56 order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described |
|
57 below. For the full definition of these formats, see The Unicode Standard |
|
58 published by the Unicode Consortium. </p> <p> <b>UTF-7</b> </p> <p>UTF-7 |
|
59 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of |
|
60 which only 7 bits are used. UTF-7 divides the set of Unicode characters into |
|
61 three subsets, which are encoded and transmitted differently. </p> <ul> |
|
62 <li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of |
|
63 characters which are encoded as a single byte. It includes lower and upper |
|
64 case A to Z, the numeric digits, and nine other characters. </p> </li> |
|
65 <li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>! |
|
66 " # $ % & * ; < = > @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These |
|
67 characters can be encoded as a single byte, or with the modified <keyword>base |
|
68 64</keyword> encoding used for set B characters. When |
|
69 encoded as a single byte, set O characters can be misinterpreted by some applications — |
|
70 encoding as modified base 64 overcomes this problem. </p> </li> |
|
71 <li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the |
|
72 remaining characters, which are encoded as an <keyword>escape byte</keyword> followed |
|
73 by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li> |
|
74 </ul> <p> <b>UTF-8</b> </p> <p>UTF-8 encodes and transmits Unicode characters |
|
75 as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded |
|
76 without change; the most significant bit being set to zero is a signal that |
|
77 they have not been changed. Unicode characters U0080 to U07FF are encoded |
|
78 in two bytes, the remaining Unicode characters — except for the surrogates — |
|
79 are encoded in three bytes. The Unicode surrogate characters are supported |
|
80 by the Character Conversion API, but are not currently supported by all Symbian |
|
81 platform components. </p> <p>A variant of UTF-8 used internally by Java differs |
|
82 from standard UTF-8 in two ways. First, the specific case of the NULL character |
|
83 (0x0000) is encoded in the two-byte format, and second, only the one-, two- |
|
84 and three-byte formats are used, not the four-byte format which is normally |
|
85 used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls |
|
86 whether the UTF-8 generated by this is the Java variant. Support for this |
|
87 was removed in v6.0. </p> </section> |
|
88 <section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each |
|
89 converter implements a conversion between a single foreign character encoding |
|
90 and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform |
|
91 provides the following two types of converter: </p> <ul> |
|
92 <li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework |
|
93 component and used by most languages </p> </li> |
|
94 <li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom |
|
95 plug-ins in the Plug-ins component and used by certain languages. </p> </li> |
|
96 </ul> </section> |
|
97 </conbody></concept> |