Character -Conversion (Charconv) Framework Concepts

This section describes the terminology used often in character conversions, -such as BMP and Charconv converters.

Character sets

Textual data in electronic devices -is stored in terms of a character set. A character set is a group of characters, -each of which is encoded as a different number. The appearance of each character -is not a property of the character set, but rather of the font. So a character -may be rendered using many different glyphs, but will always have the same -numeric value within its character set. Other properties which can also be -included in a character set’s definition are the direction of writing, and -the way in which sets of characters are combined.

Unicode, UCS and UCS-2

Character sets, and the -ways of encoding them, have proliferated with the increasing acceptance of -computers and communicators throughout the world. This has led to an international -standard character set, which encompasses all commonly used character sets, -including Eastern ideograms, in a single character set, Unicode, defined by -the Unicode Consortium (http://www.unicode.org).

UCS is the name for -Unicode Character Set. Unicode characters are generally encoded using one -16-bit value but written to files in two bytes. This is referred to as UCS-2 -encoding formats. There are also other Unicode encoding formats such as UTF-16 -and UTF-8 for different purposes. For the full definition of these formats, -see The Unicode Standard published by the Unicode Consortium.

BMP

Unicode points between U+0000 to U+FFFF are -called Basic Multilingual Plane (BMP). BMP covers almost all characters in -different languages. Code points outside the BMP must be encoded using a "surrogate -pair", which consists of two 16-bit values. The Symbian platform -currently does not support scripts with characters mapped to code points above -U+FFFF. Code points above U+FFFF are also known as supplementary characters.

UTF-16

UTF-16 is one of the Unicode encoding formats. -It supports characters within and outside BMP using a number of 16-bit characters.

In -the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format. -This means that any input to the text-processing subsystem must be in UTF-16. -Different character converters can be used to convert text from other encoding -formats to UTF-16.

Transformation formats

The UCS-2 format of the -Unicode character set encodes each character as 2 bytes (16 bits total). However -it does not specify which of the bytes is most significant. The byte order, -or endian-ness, is left up to the discretion of a particular operating system.

While -this is not important within a system, it does mean that text encoded as UCS-2 -cannot easily be shared between systems using a different endian-ness. To -overcome this problem the Unicode Consortium has defined two transformation -formats for sharing Unicode text. The transformation formats explicitly specify -byte order, and cannot be misinterpreted by computers using a different byte -order.

The two transformation formats, UTF-7 and UTF-8, are described -below. For the full definition of these formats, see The Unicode Standard -published by the Unicode Consortium.

UTF-7

UTF-7 -allows Unicode characters to be encoded and transmitted as 8-bit bytes, of -which only 7 bits are used. UTF-7 divides the set of Unicode characters into -three subsets, which are encoded and transmitted differently.

Set D, is the set of -characters which are encoded as a single byte. It includes lower and upper -case A to Z, the numeric digits, and nine other characters.
Set O includes the characters ! -" # $ % & * ; < = > @ [ ] ^ _ { | }. These -characters can be encoded as a single byte, or with the modified base - 64 encoding used for set B characters. When -encoded as a single byte, set O characters can be misinterpreted by some applications — -encoding as modified base 64 overcomes this problem.
Set B comprises the -remaining characters, which are encoded as an escape byte followed -by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding.

UTF-8

UTF-8 encodes and transmits Unicode characters -as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded -without change; the most significant bit being set to zero is a signal that -they have not been changed. Unicode characters U0080 to U07FF are encoded -in two bytes, the remaining Unicode characters — except for the surrogates — -are encoded in three bytes. The Unicode surrogate characters are supported -by the Character Conversion API, but are not currently supported by all Symbian -platform components.

A variant of UTF-8 used internally by Java differs -from standard UTF-8 in two ways. First, the specific case of the NULL character -(0x0000) is encoded in the two-byte format, and second, only the one-, two- -and three-byte formats are used, not the four-byte format which is normally -used for Unicode surrogate-pairs. An argument to ConvertFromUnicodeToUtf8 controls -whether the UTF-8 generated by this is the Java variant. Support for this -was removed in v6.0.

Charconv converter

Each -converter implements a conversion between a single foreign character encoding -and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform -provides the following two types of converter:

built into the Framework -component and used by most languages
implemented as Ecom -plug-ins in the Plug-ins component and used by certain languages.

+ + + + + +Character +Conversion (Charconv) Framework Concepts +

This section describes the terminology used often in character conversions, +such as BMP and Charconv converters.

Character sets

Textual data in electronic devices +is stored in terms of a character set. A character set is a group of characters, +each of which is encoded as a different number. The appearance of each character +is not a property of the character set, but rather of the font. So a character +may be rendered using many different glyphs, but will always have the same +numeric value within its character set. Other properties which can also be +included in a character set’s definition are the direction of writing, and +the way in which sets of characters are combined.

Unicode, UCS and UCS-2

Character sets, and the +ways of encoding them, have proliferated with the increasing acceptance of +computers and communicators throughout the world. This has led to an international +standard character set, which encompasses all commonly used character sets, +including Eastern ideograms, in a single character set, Unicode, defined by +the Unicode Consortium (http://www.unicode.org).

UCS is the name for +Unicode Character Set. Unicode characters are generally encoded using one +16-bit value but written to files in two bytes. This is referred to as UCS-2 +encoding formats. There are also other Unicode encoding formats such as UTF-16 +and UTF-8 for different purposes. For the full definition of these formats, +see The Unicode Standard published by the Unicode Consortium.

BMP

Unicode points between U+0000 to U+FFFF are +called Basic Multilingual Plane (BMP). BMP covers almost all characters in +different languages. Code points outside the BMP must be encoded using a "surrogate +pair", which consists of two 16-bit values. The Symbian platform +currently does not support scripts with characters mapped to code points above +U+FFFF. Code points above U+FFFF are also known as supplementary characters.

UTF-16

UTF-16 is one of the Unicode encoding formats. +It supports characters within and outside BMP using a number of 16-bit characters.

In +the text-processing subsystem, the Symbian platform uses UTF-16 Unicode format. +This means that any input to the text-processing subsystem must be in UTF-16. +Different character converters can be used to convert text from other encoding +formats to UTF-16.

Transformation formats

The UCS-2 format of the +Unicode character set encodes each character as 2 bytes (16 bits total). However +it does not specify which of the bytes is most significant. The byte order, +or endian-ness, is left up to the discretion of a particular operating system.

While +this is not important within a system, it does mean that text encoded as UCS-2 +cannot easily be shared between systems using a different endian-ness. To +overcome this problem the Unicode Consortium has defined two transformation +formats for sharing Unicode text. The transformation formats explicitly specify +byte order, and cannot be misinterpreted by computers using a different byte +order.

The two transformation formats, UTF-7 and UTF-8, are described +below. For the full definition of these formats, see The Unicode Standard +published by the Unicode Consortium.

UTF-7

UTF-7 +allows Unicode characters to be encoded and transmitted as 8-bit bytes, of +which only 7 bits are used. UTF-7 divides the set of Unicode characters into +three subsets, which are encoded and transmitted differently.

Set D, is the set of +characters which are encoded as a single byte. It includes lower and upper +case A to Z, the numeric digits, and nine other characters.
Set O includes the characters ! +" # $ % & * ; < = > @ [ ] ^ _ { | }. These +characters can be encoded as a single byte, or with the modified base + 64 encoding used for set B characters. When +encoded as a single byte, set O characters can be misinterpreted by some applications — +encoding as modified base 64 overcomes this problem.
Set B comprises the +remaining characters, which are encoded as an escape byte followed +by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding.

UTF-8

UTF-8 encodes and transmits Unicode characters +as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded +without change; the most significant bit being set to zero is a signal that +they have not been changed. Unicode characters U0080 to U07FF are encoded +in two bytes, the remaining Unicode characters — except for the surrogates — +are encoded in three bytes. The Unicode surrogate characters are supported +by the Character Conversion API, but are not currently supported by all Symbian +platform components.

A variant of UTF-8 used internally by Java differs +from standard UTF-8 in two ways. First, the specific case of the NULL character +(0x0000) is encoded in the two-byte format, and second, only the one-, two- +and three-byte formats are used, not the four-byte format which is normally +used for Unicode surrogate-pairs. An argument to ConvertFromUnicodeToUtf8 controls +whether the UTF-8 generated by this is the Java variant. Support for this +was removed in v6.0.

Charconv converter

Each +converter implements a conversion between a single foreign character encoding +and UTF-16, and is identified by a Unique Identifier (UID). The Symbian platform +provides the following two types of converter:

built into the Framework +component and used by most languages
implemented as Ecom +plug-ins in the Plug-ins component and used by certain languages.

\ No newline at end of file