Symbian3/SDK/Source/GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292.dita
changeset 7 51a74ef9ed63
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/Symbian3/SDK/Source/GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292.dita	Wed Mar 31 11:11:55 2010 +0100
@@ -0,0 +1,97 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. -->
+<!-- This component and the accompanying materials are made available under the terms of the License 
+"Eclipse Public License v1.0" which accompanies this distribution, 
+and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". -->
+<!-- Initial Contributors:
+    Nokia Corporation - initial contribution.
+Contributors: 
+-->
+<!DOCTYPE concept
+  PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="GUID-06EDE5E8-04EA-5A74-ADE2-E5B8C49AB292" xml:lang="en"><title>Character
+Conversion (Charconv) Framework Concepts</title><prolog><metadata><keywords/></metadata></prolog><conbody>
+<p>This section describes the terminology used often in character conversions,
+such as BMP and Charconv converters. </p>
+<section id="GUID-F964FD3C-D80B-4DBB-A99D-71CC60C362FC"><title>Character sets </title> <p>Textual data in electronic devices
+is stored in terms of a character set. A character set is a group of characters,
+each of which is encoded as a different number. The appearance of each character
+is not a property of the character set, but rather of the font. So a character
+may be rendered using many different glyphs, but will always have the same
+numeric value within its character set. Other properties which can also be
+included in a character set’s definition are the direction of writing, and
+the way in which sets of characters are combined. </p> </section>
+<section id="GUID-58021C48-1A3D-41C8-8B82-16C0481BFDCB"><title>Unicode, UCS and UCS-2</title> <p>Character sets, and the
+ways of encoding them, have proliferated with the increasing acceptance of
+computers and communicators throughout the world. This has led to an international
+standard character set, which encompasses all commonly used character sets,
+including Eastern ideograms, in a single character set, Unicode, defined by
+the Unicode Consortium (http://www.unicode.org). </p> <p>UCS is the name for
+Unicode Character Set. Unicode characters are generally encoded using one
+16-bit value but written to files in two bytes. This is referred to as UCS-2
+encoding formats. There are also other Unicode encoding formats such as UTF-16
+and UTF-8 for different purposes. For the full definition of these formats,
+see The Unicode Standard published by the Unicode Consortium. </p> </section>
+<section id="GUID-24F61FEA-C3FE-4CBB-BDA2-4FF741288B63"><title>BMP</title> <p>Unicode points between U+0000 to U+FFFF are
+called Basic Multilingual Plane (BMP). BMP covers almost all characters in
+different languages. Code points outside the BMP must be encoded using a "surrogate
+pair", which consists of two 16-bit values. The Symbian platform
+currently does not support scripts with characters mapped to code points above
+U+FFFF. Code points above U+FFFF are also known as supplementary characters. </p> </section>
+<section id="GUID-21DF5FEF-2446-4D23-8139-869A0CD7B514"><title> UTF-16</title> <p>UTF-16 is one of the Unicode encoding formats.
+It supports characters within and outside BMP using a number of 16-bit characters. </p> <p>In
+the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format.
+This means that any input to the text-processing subsystem must be in UTF-16.
+Different character converters can be used to convert text from other encoding
+formats to UTF-16. </p> </section>
+<section id="GUID-786FEE95-D7A5-4E41-AB41-C8D54BFB8C54"><title>Transformation formats </title> <p>The UCS-2 format of the
+Unicode character set encodes each character as 2 bytes (16 bits total). However
+it does not specify which of the bytes is most significant. The byte order,
+or endian-ness, is left up to the discretion of a particular operating system. </p> <p>While
+this is not important within a system, it does mean that text encoded as UCS-2
+cannot easily be shared between systems using a different endian-ness. To
+overcome this problem the Unicode Consortium has defined two transformation
+formats for sharing Unicode text. The transformation formats explicitly specify
+byte order, and cannot be misinterpreted by computers using a different byte
+order. </p> <p>The two transformation formats, UTF-7 and UTF-8, are described
+below. For the full definition of these formats, see The Unicode Standard
+published by the Unicode Consortium. </p> <p> <b>UTF-7</b>  </p> <p>UTF-7
+allows Unicode characters to be encoded and transmitted as 8-bit bytes, of
+which only 7 bits are used. UTF-7 divides the set of Unicode characters into
+three subsets, which are encoded and transmitted differently. </p> <ul>
+<li id="GUID-8E1A1C8B-8234-57C3-93D4-5A0A4E8C1374"><p>Set D, is the set of
+characters which are encoded as a single byte. It includes lower and upper
+case A to Z, the numeric digits, and nine other characters. </p> </li>
+<li id="GUID-3E19560B-4087-575E-A091-64FCFD24C811"><p>Set O includes the characters <b>!
+" # $ % &amp; * ; &lt; = &gt; @ [ ] ^ _ </b> <b>{</b> <b> | </b> <b>}</b>. These
+characters can be encoded as a single byte, or with the modified <keyword>base
+                64</keyword>  encoding used for set B characters. When
+encoded as a single byte, set O characters can be misinterpreted by some applications —
+encoding as modified base 64 overcomes this problem. </p> </li>
+<li id="GUID-01E822AE-71CC-5F0B-BC60-53F914600A5E"><p>Set B comprises the
+remaining characters, which are encoded as an <keyword>escape byte</keyword> followed
+by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding. </p> </li>
+</ul> <p> <b>UTF-8</b>  </p> <p>UTF-8 encodes and transmits Unicode characters
+as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded
+without change; the most significant bit being set to zero is a signal that
+they have not been changed. Unicode characters U0080 to U07FF are encoded
+in two bytes, the remaining Unicode characters — except for the surrogates —
+are encoded in three bytes. The Unicode surrogate characters are supported
+by the Character Conversion API, but are not currently supported by all Symbian
+platform components. </p> <p>A variant of UTF-8 used internally by Java differs
+from standard UTF-8 in two ways. First, the specific case of the NULL character
+(0x0000) is encoded in the two-byte format, and second, only the one-, two-
+and three-byte formats are used, not the four-byte format which is normally
+used for Unicode surrogate-pairs. An argument to <codeph>ConvertFromUnicodeToUtf8</codeph> controls
+whether the UTF-8 generated by this is the Java variant. Support for this
+was removed in v6.0. </p> </section>
+<section id="GUID-5E593C5A-882B-5B11-AD6E-CFD10EA6700B"><title>Charconv converter</title> <p>Each
+converter implements a conversion between a single foreign character encoding
+and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform
+provides the following two types of converter: </p> <ul>
+<li id="GUID-8A1AFA7F-E330-5309-9ED7-26A4A46411CB"><p>built into the Framework
+component and used by most languages </p> </li>
+<li id="GUID-E7A3FD30-37E0-5B01-B2BE-7DE313045D9F"><p>implemented as Ecom
+plug-ins in the Plug-ins component and used by certain languages. </p> </li>
+</ul> </section>
+</conbody></concept>
\ No newline at end of file