Symbian3/SDK/Source/GUID-FE94596E-B5BB-51FE-BE38-069840323915.dita
author Dominic Pinkman <dominic.pinkman@nokia.com>
Fri, 11 Jun 2010 12:39:03 +0100
changeset 8 ae94777fff8f
parent 7 51a74ef9ed63
child 13 48780e181b38
permissions -rw-r--r--
Week 23 contribution of SDK documentation content. See release notes for details. Fixes bugs Bug 2714, Bug 462.

<?xml version="1.0" encoding="utf-8"?>
<!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. -->
<!-- This component and the accompanying materials are made available under the terms of the License 
"Eclipse Public License v1.0" which accompanies this distribution, 
and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". -->
<!-- Initial Contributors:
    Nokia Corporation - initial contribution.
Contributors: 
-->
<!DOCTYPE concept
  PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="GUID-FE94596E-B5BB-51FE-BE38-069840323915" xml:lang="en"><title>Encoding
Types</title><prolog><metadata><keywords/></metadata></prolog><conbody>
<p>This topic describes the types of SMS encoding. </p>
<section id="GUID-F7D1E6C8-9605-57FA-9788-AF7FC72BD94C"><title>7-bit GSM encoding</title> <p>7-bit
GSM encoding supports the GSM 7-bit default alphabet and GSM 7-bit default
alphabet extension table through an escape mechanism. </p> <p>Figure 1 </p> <fig id="GUID-CDEE59FC-F035-5B75-8838-96E94A6714E8">
<title>              Escape mechanism            </title>
<image href="GUID-08A6B93F-92CD-5182-B142-D353E78016F3_d0e406599_href.png" placement="inline"/>
</fig> <p>The GSM 7-bit default alphabet consists of 128 characters. Each
character is represented by 7 bits. 10 extra characters are defined in the
GSM 7-bit default extension table. These characters are represented by an
escape mechanism using the escape character (0x1B). For example, 0x1B65 maps
to the Euro sign € (U+20AC). If an escape character byte is followed by a
character that is not included in the 10 characters, the escape character
is just ignored. This means 0x1B41 maps to Latin capital letter A (U+0041). </p> <p>For
more information about the GSM 7-bit default table, extension table and escape
mechanism, see 3GPP TS 23.038 V8.1.0. </p> </section>
<section id="GUID-918FF2E3-B9F4-5C61-8DBA-F9143DB16460"><title>Lossy 7-bit
encoding</title> <p>Lossy 7-bit encoding enlarges the character set supported
by 7-bit GSM encoding. Some Unicode Characters do not exist in the target
7-bit set. These characters are converted to ones that do exist in the target
7-bit set and closely resemble the original, intended character. A lossy encoding
using a 7-bit encoding is more cost effective than a UCS-2 encoding. </p> <p> <b>Example
of 7-bit encoding</b>  </p> <p>Accented Latin characters are not supported
by 7-bit GSM encoding. Figure 2 describes how an accented Latin characters
Á, is sent by SMS. Á has a Unicode value of 0x00C1. When it is processed by
the Lossy converter the character is converted from the Unicode to 7-bit code
letter A. A has a 7-bit code of 0x41. The SMS receiver reads A instead of
Á. By substituting the character that is similar enough to the original, the
reader can understand the word. The process of converting Á to A is called
a lossy conversion. </p> <p> <b>Note</b>: The 7-bit code of A (0x41) can only
be decoded back to the same Unicode letter A instead of Á. </p> <p>Figure
2 </p> <fig id="GUID-ACFF9511-D5E0-5558-8008-4CD48EE0B7A1">
<title>              Lossy conversion            </title>
<image href="GUID-8862E271-ABA4-5A25-8990-C0B3931E370D_d0e406639_href.png" placement="inline"/>
</fig> </section>
<section id="GUID-D2F0E6BE-932E-545D-A0C8-39017E3D67B4"><title>16-bit Unicode
encoding</title> <p>Unicode is an international standard character set. It
includes the characters of every language. In Unicode, each character is usually
encoded in two 8-bit bytes, and takes up more space than 7-bit encoding. </p> </section>
<section id="GUID-93B3DDF2-8EB1-5853-9DFD-3ABF42ADCB40"><title>National language
encoding</title> <p>According to 3GPP TS 23.038 V8.1.0, National Language
Encoding supports additional characters for certain languages which cannot
be represented in the GSM default 7-bit alphabet. It defines two mechanisms
for doing this: </p> <ul>
<li id="GUID-9ECCA8BD-0BA0-5AE2-B2D6-4677D2CD1BD7"><p>Locking shift mechanism–the
default GSM table is replaced with a table containing the character set needed
for a language. The table is referred to as locking shift table. </p> </li>
<li id="GUID-3900D849-350A-5722-9759-D1D768FE6A84"><p>Single shift mechanism–the
GSM extension table is replaced with a table containing the character set
needed for a language. The table is referred to as single shift table. </p> </li>
</ul> <p>When the locking shift mechanism is used, the escape table can be
the existing GSM extension table or it can be the escape table used by the
single shift mechanism. This supports three possible mappings as shown in
Figure 3: </p> <ul>
<li id="GUID-34ECF450-6265-58E2-9CB6-00E0C5DDA6F8"><p>The GSM 7-bit default
escapes to language-specific escape table. It is referred to as GSM-single. </p> </li>
<li id="GUID-6E8A53BF-0572-5DE2-8D41-FB588B6FB812"><p>The Language-specific
basic table escapes to GSM 7-bit default extension table. It is referred to
as locking-GSM ext. </p> </li>
<li id="GUID-830569B1-8ACD-5924-AF7F-15705FEF76B0"><p>The Language-specific
basic table escapes to language-specific extension table. It is referred to
as locking-single. </p> </li>
</ul> <p>Figure 3 </p> <fig id="GUID-541CED9A-2450-5C9D-AADF-93EE59E4D77E">
<title>              National language encoding            </title>
<image href="GUID-44347376-702D-5648-8938-EB55AFA329EC_d0e406701_href.png" placement="inline"/>
</fig><p>The single shift mechanism is useful when a message contains only
a few characters outside the default GSM table. It is however inefficient
when a message contains many unsupported characters, because each escaped
character must occupy 2 bytes. GSM-single supports more characters than locking-GSM
ext, but these characters are in the single table, which takes 2 bytes. Locking-single
is used more for the decoding purpose in case the extra characters can come
from the locking or single table. </p><p>The locking or single table is not
a complete replacement. For example, the locking table for Turkish redefines
only 8-character codes from the default GSM table, as shown in table 1. The
escape table for Turkish adds 7 characters to the GSM extension, as shown
in table 2. </p><table id="GUID-4AE6F58D-A5DA-4AD9-B39E-A61AA378F3F6"><title>Table 1</title>
<tgroup cols="3"><colspec colname="col1"/><colspec colname="col2"/><colspec colname="col3"/>
<thead>
<row>
<entry><p>GSM 7-Bit Code</p></entry>
<entry><p>Turkish Locking Shift Table</p></entry>
<entry><p>GSM 7-Bit Default Table</p></entry>
</row>
</thead>
<tbody>
<row>
<entry><p><codeph>0x40</codeph></p></entry>
<entry><p>I LATIN CAPITAL LETTER I WITH DOT ABOVE</p></entry>
<entry><p>¡ INVERTED EXCLAMATION MARK </p></entry>
</row>
<row>
<entry><p><codeph>0x60</codeph></p></entry>
<entry><p>ç LATIN SMALL LETTER C WITH CEDILLA</p></entry>
<entry><p>¿ INVERTED QUESTION MARK</p></entry>
</row>
<row>
<entry><p><codeph>0x04</codeph></p></entry>
<entry><p>€ EURO SIGN</p></entry>
<entry><p>è LATIN SMALL LETTER E WITH GRAVE</p></entry>
</row>
<row>
<entry><p><codeph>0x07</codeph></p></entry>
<entry><p>i LATIN SMALL LETTER DOTLESS</p></entry>
<entry><p>ì LATIN SMALL LETTER I WITH GRAVE</p></entry>
</row>
<row>
<entry><p><codeph>0x0B</codeph></p></entry>
<entry><p>G LATIN CAPITAL LETTER G WITH BREVE</p></entry>
<entry><p>Ø LATIN CAPITAL LETTER O WITH STROKE</p></entry>
</row>
<row>
<entry><p><codeph>0x0C</codeph></p></entry>
<entry><p>g LATIN SMALL LETTER G WITH BREVE</p></entry>
<entry><p>ø LATIN SMALL LETTER O WITH STROKE</p></entry>
</row>
<row>
<entry><p><codeph>0x1C</codeph></p></entry>
<entry><p>S LATIN CAPITAL LETTER S WITH CEDILLA *</p></entry>
<entry><p>Æ LATIN CAPITAL LETTER AE</p></entry>
</row>
<row>
<entry><p><codeph>0x1D</codeph></p></entry>
<entry><p>s LATIN SMALL LETTER S WITH CEDILLA *</p></entry>
<entry><p>æ LATIN SMALL LETTER AE</p></entry>
</row>
</tbody>
</tgroup>
</table> <table id="GUID-EC345039-0CB5-4F51-8CFA-83286790AC75"><title>Table 2</title>
<tgroup cols="3"><colspec colname="col1"/><colspec colname="col2"/><colspec colname="col3"/>
<thead>
<row>
<entry><p>GSM 7-Bit Code</p></entry>
<entry><p>Turkish Single Shift Table</p></entry>
<entry><p>GSM 7-Bit Extension Table</p></entry>
</row>
</thead>
<tbody>
<row>
<entry><p><codeph>0x1B49</codeph></p></entry>
<entry><p>I LATIN CAPITAL LETTER I WITH DOT ABOVE</p></entry>
<entry><p/></entry>
</row>
<row>
<entry><p><codeph>0x1B63</codeph></p></entry>
<entry><p>ç LATIN SMALL LETTER C WITH CEDILLA</p></entry>
<entry><p/></entry>
</row>
<row>
<entry><p><codeph>0x1B69</codeph></p></entry>
<entry><p>i LATIN SMALL LETTER DOTLESS</p></entry>
<entry><p/></entry>
</row>
<row>
<entry><p><codeph>0x1B47</codeph></p></entry>
<entry><p>G LATIN CAPITAL LETTER G WITH BREVE</p></entry>
<entry><p/></entry>
</row>
<row>
<entry><p><codeph>0x1B67</codeph></p></entry>
<entry><p>g LATIN SMALL LETTER G WITH BREVE</p></entry>
<entry><p/></entry>
</row>
<row>
<entry><p><codeph>0x1B53</codeph></p></entry>
<entry><p>S LATIN CAPITAL LETTER S WITH CEDILLA *</p></entry>
<entry><p/></entry>
</row>
<row>
<entry><p><codeph>0x1B73</codeph></p></entry>
<entry><p>s LATIN SMALL LETTER S WITH CEDILLA *</p></entry>
<entry><p/></entry>
</row>
</tbody>
</tgroup>
</table><p>For more information about the National Language Identifier, Single
or Locking mechanism, see 3GPP TS 23.038 V8.1.0: National Language Identifier.</p></section>
<section><title>See also</title> <p> <xref href="GUID-0BC9A9A1-DB99-5095-8390-E1C1B04D0080.dita">SMS
Encodings and Converters Overview</xref> </p> </section>
</conbody></concept>