Week 32 contribution of PDK documentation content. See release notes for details. Fixes bug Bug 3582
<?xml version="1.0" encoding="utf-8"?>
<!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. -->
<!-- This component and the accompanying materials are made available under the terms of the License
"Eclipse Public License v1.0" which accompanies this distribution,
and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". -->
<!-- Initial Contributors:
Nokia Corporation - initial contribution.
Contributors:
-->
<!DOCTYPE concept
PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292" xml:lang="en"><title>Character
Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody>
<p>This section describes the terminology used often in character conversions,
such as BMP and Charconv converters. </p>
<section id="GUID-F964FD3C-D80B-4DBB-A99D-71CC60C362FC"><title>Character sets </title> <p>Textual data in electronic devices
is stored in terms of a character set. A character set is a group of characters,
each of which is encoded as a different number. The appearance of each character
is not a property of the character set, but rather of the font. So a character
may be rendered using many different glyphs, but will always have the same
numeric value within its character set. Other properties which can also be
included in a character set’s definition are the direction of writing, and
the way in which sets of characters are combined. </p> </section>
<section id="GUID-58021C48-1A3D-41C8-8B82-16C0481BFDCB"><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the
ways of encoding them, have proliferated with the increasing acceptance of
computers and communicators throughout the world. This has led to an international
standard character set, which encompasses all commonly used character sets,
including Eastern ideograms, in a single character set, Unicode, defined by
the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for
Unicode Character Set. Unicode characters are generally encoded using one
16-bit value but written to files in two bytes. This is referred to as UCS-2
encoding formats. There are also other Unicode encoding formats such as UTF-16
and UTF-8 for different purposes. For the full definition of these formats,
see The Unicode Standard published by the Unicode Consortium. </p> </section>
<section id="GUID-24F61FEA-C3FE-4CBB-BDA2-4FF741288B63"><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are
called Basic Multilingual Plane (BMP). BMP covers almost all characters in
different languages. Code points outside the BMP must be encoded using a "surrogate
pair", which consists of two 16-bit values. The Symbian platform
currently does not support scripts with characters mapped to code points above
U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section>
<section id="GUID-21DF5FEF-2446-4D23-8139-869A0CD7B514"><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats.
It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In
the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format.
This means that any input to the text-processing subsystem must be in UTF-16.
Different character converters can be used to convert text from other encoding
formats to UTF-16. </p> </section>
<section id="GUID-786FEE95-D7A5-4E41-AB41-C8D54BFB8C54"><title>Transformation formats </title> <p>The UCS-2 format of the
Unicode character set encodes each character as 2 bytes (16 bits total). However
it does not specify which of the bytes is most significant. The byte order,
or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While
this is not important within a system, it does mean that text encoded as UCS-2
cannot easily be shared between systems using a different endian-ness. To
overcome this problem the Unicode Consortium has defined two transformation
formats for sharing Unicode text. The transformation formats explicitly specify
byte order, and cannot be misinterpreted by computers using a different byte
order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described
below. For the full definition of these formats, see The Unicode Standard
published by the Unicode Consortium. </p> <p> <b>UTF-7</b> </p> <p>UTF-7
allows Unicode characters to be encoded and transmitted as 8-bit bytes, of
which only 7 bits are used. UTF-7 divides the set of Unicode characters into
three subsets, which are encoded and transmitted differently. </p> <ul>
<li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of
characters which are encoded as a single byte. It includes lower and upper
case A to Z, the numeric digits, and nine other characters. </p> </li>
<li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>!
" # $ % & * ; < = > @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These
characters can be encoded as a single byte, or with the modified <keyword>base
64</keyword> encoding used for set B characters. When
encoded as a single byte, set O characters can be misinterpreted by some applications —
encoding as modified base 64 overcomes this problem. </p> </li>
<li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the
remaining characters, which are encoded as an <keyword>escape byte</keyword> followed
by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li>
</ul> <p> <b>UTF-8</b> </p> <p>UTF-8 encodes and transmits Unicode characters
as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded
without change; the most significant bit being set to zero is a signal that
they have not been changed. Unicode characters U0080 to U07FF are encoded
in two bytes, the remaining Unicode characters — except for the surrogates —
are encoded in three bytes. The Unicode surrogate characters are supported
by the Character Conversion API, but are not currently supported by all Symbian
platform components. </p> <p>A variant of UTF-8 used internally by Java differs
from standard UTF-8 in two ways. First, the specific case of the NULL character
(0x0000) is encoded in the two-byte format, and second, only the one-, two-
and three-byte formats are used, not the four-byte format which is normally
used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls
whether the UTF-8 generated by this is the Java variant. Support for this
was removed in v6.0. </p> </section>
<section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each
converter implements a conversion between a single foreign character encoding
and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform
provides the following two types of converter: </p> <ul>
<li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework
component and used by most languages </p> </li>
<li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom
plug-ins in the Plug-ins component and used by certain languages. </p> </li>
</ul> </section>
</conbody></concept>