|
1 <?xml version="1.0" encoding="utf-8"?> |
|
2 <!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. --> |
|
3 <!-- This component and the accompanying materials are made available under the terms of the License |
|
4 "Eclipse Public License v1.0" which accompanies this distribution, |
|
5 and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". --> |
|
6 <!-- Initial Contributors: |
|
7 Nokia Corporation - initial contribution. |
|
8 Contributors: |
|
9 --> |
|
10 <!DOCTYPE concept |
|
11 PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
|
12 <concept id="GUID-FE94596E-B5BB-51FE-BE38-069840323915" xml:lang="en"><title>Encoding |
|
13 Types</title><prolog><metadata><keywords/></metadata></prolog><conbody> |
|
14 <p>This topic describes the types of SMS encoding. </p> |
|
15 <section id="GUID-F7D1E6C8-9605-57FA-9788-AF7FC72BD94C"><title>7-bit GSM encoding</title> <p>7-bit |
|
16 GSM encoding supports the GSM 7-bit default alphabet and GSM 7-bit default |
|
17 alphabet extension table through an escape mechanism. </p> <p>Figure 1 </p> <fig id="GUID-CDEE59FC-F035-5B75-8838-96E94A6714E8"> |
|
18 <title> Escape mechanism </title> |
|
19 <image href="GUID-08A6B93F-92CD-5182-B142-D353E78016F3_d0e406761_href.png" placement="inline"/> |
|
20 </fig> <p>The GSM 7-bit default alphabet consists of 128 characters. Each |
|
21 character is represented by 7 bits. 10 extra characters are defined in the |
|
22 GSM 7-bit default extension table. These characters are represented by an |
|
23 escape mechanism using the escape character (0x1B). For example, 0x1B65 maps |
|
24 to the Euro sign € (U+20AC). If an escape character byte is followed by a |
|
25 character that is not included in the 10 characters, the escape character |
|
26 is just ignored. This means 0x1B41 maps to Latin capital letter A (U+0041). </p> <p>For |
|
27 more information about the GSM 7-bit default table, extension table and escape |
|
28 mechanism, see 3GPP TS 23.038 V8.1.0. </p> </section> |
|
29 <section id="GUID-918FF2E3-B9F4-5C61-8DBA-F9143DB16460"><title>Lossy 7-bit |
|
30 encoding</title> <p>Lossy 7-bit encoding enlarges the character set supported |
|
31 by 7-bit GSM encoding. Some Unicode Characters do not exist in the target |
|
32 7-bit set. These characters are converted to ones that do exist in the target |
|
33 7-bit set and closely resemble the original, intended character. A lossy encoding |
|
34 using a 7-bit encoding is more cost effective than a UCS-2 encoding. </p> <p> <b>Example |
|
35 of 7-bit encoding</b> </p> <p>Accented Latin characters are not supported |
|
36 by 7-bit GSM encoding. Figure 2 describes how an accented Latin characters |
|
37 Á, is sent by SMS. Á has a Unicode value of 0x00C1. When it is processed by |
|
38 the Lossy converter the character is converted from the Unicode to 7-bit code |
|
39 letter A. A has a 7-bit code of 0x41. The SMS receiver reads A instead of |
|
40 Á. By substituting the character that is similar enough to the original, the |
|
41 reader can understand the word. The process of converting Á to A is called |
|
42 a lossy conversion. </p> <p> <b>Note</b>: The 7-bit code of A (0x41) can only |
|
43 be decoded back to the same Unicode letter A instead of Á. </p> <p>Figure |
|
44 2 </p> <fig id="GUID-ACFF9511-D5E0-5558-8008-4CD48EE0B7A1"> |
|
45 <title> Lossy conversion </title> |
|
46 <image href="GUID-8862E271-ABA4-5A25-8990-C0B3931E370D_d0e406801_href.png" placement="inline"/> |
|
47 </fig> </section> |
|
48 <section id="GUID-D2F0E6BE-932E-545D-A0C8-39017E3D67B4"><title>16-bit Unicode |
|
49 encoding</title> <p>Unicode is an international standard character set. It |
|
50 includes the characters of every language. In Unicode, each character is usually |
|
51 encoded in two 8-bit bytes, and takes up more space than 7-bit encoding. </p> </section> |
|
52 <section id="GUID-93B3DDF2-8EB1-5853-9DFD-3ABF42ADCB40"><title>National language |
|
53 encoding</title> <p>According to 3GPP TS 23.038 V8.1.0, National Language |
|
54 Encoding supports additional characters for certain languages which cannot |
|
55 be represented in the GSM default 7-bit alphabet. It defines two mechanisms |
|
56 for doing this: </p> <ul> |
|
57 <li id="GUID-9ECCA8BD-0BA0-5AE2-B2D6-4677D2CD1BD7"><p>Locking shift mechanism–the |
|
58 default GSM table is replaced with a table containing the character set needed |
|
59 for a language. The table is referred to as locking shift table. </p> </li> |
|
60 <li id="GUID-3900D849-350A-5722-9759-D1D768FE6A84"><p>Single shift mechanism–the |
|
61 GSM extension table is replaced with a table containing the character set |
|
62 needed for a language. The table is referred to as single shift table. </p> </li> |
|
63 </ul> <p>When the locking shift mechanism is used, the escape table can be |
|
64 the existing GSM extension table or it can be the escape table used by the |
|
65 single shift mechanism. This supports three possible mappings as shown in |
|
66 Figure 3: </p> <ul> |
|
67 <li id="GUID-34ECF450-6265-58E2-9CB6-00E0C5DDA6F8"><p>The GSM 7-bit default |
|
68 escapes to language-specific escape table. It is referred to as GSM-single. </p> </li> |
|
69 <li id="GUID-6E8A53BF-0572-5DE2-8D41-FB588B6FB812"><p>The Language-specific |
|
70 basic table escapes to GSM 7-bit default extension table. It is referred to |
|
71 as locking-GSM ext. </p> </li> |
|
72 <li id="GUID-830569B1-8ACD-5924-AF7F-15705FEF76B0"><p>The Language-specific |
|
73 basic table escapes to language-specific extension table. It is referred to |
|
74 as locking-single. </p> </li> |
|
75 </ul> <p>Figure 3 </p> <fig id="GUID-541CED9A-2450-5C9D-AADF-93EE59E4D77E"> |
|
76 <title> National language encoding </title> |
|
77 <image href="GUID-44347376-702D-5648-8938-EB55AFA329EC_d0e406863_href.png" placement="inline"/> |
|
78 </fig><p>The single shift mechanism is useful when a message contains only |
|
79 a few characters outside the default GSM table. It is however inefficient |
|
80 when a message contains many unsupported characters, because each escaped |
|
81 character must occupy 2 bytes. GSM-single supports more characters than locking-GSM |
|
82 ext, but these characters are in the single table, which takes 2 bytes. Locking-single |
|
83 is used more for the decoding purpose in case the extra characters can come |
|
84 from the locking or single table. </p><p>The locking or single table is not |
|
85 a complete replacement. For example, the locking table for Turkish redefines |
|
86 only 8-character codes from the default GSM table, as shown in table 1. The |
|
87 escape table for Turkish adds 7 characters to the GSM extension, as shown |
|
88 in table 2. </p><table id="GUID-4AE6F58D-A5DA-4AD9-B39E-A61AA378F3F6"><title>Table 1</title> |
|
89 <tgroup cols="3"><colspec colname="col1"/><colspec colname="col2"/><colspec colname="col3"/> |
|
90 <thead> |
|
91 <row> |
|
92 <entry><p>GSM 7-Bit Code</p></entry> |
|
93 <entry><p>Turkish Locking Shift Table</p></entry> |
|
94 <entry><p>GSM 7-Bit Default Table</p></entry> |
|
95 </row> |
|
96 </thead> |
|
97 <tbody> |
|
98 <row> |
|
99 <entry><p><codeph>0x40</codeph></p></entry> |
|
100 <entry><p>I LATIN CAPITAL LETTER I WITH DOT ABOVE</p></entry> |
|
101 <entry><p>¡ INVERTED EXCLAMATION MARK </p></entry> |
|
102 </row> |
|
103 <row> |
|
104 <entry><p><codeph>0x60</codeph></p></entry> |
|
105 <entry><p>ç LATIN SMALL LETTER C WITH CEDILLA</p></entry> |
|
106 <entry><p>¿ INVERTED QUESTION MARK</p></entry> |
|
107 </row> |
|
108 <row> |
|
109 <entry><p><codeph>0x04</codeph></p></entry> |
|
110 <entry><p>€ EURO SIGN</p></entry> |
|
111 <entry><p>è LATIN SMALL LETTER E WITH GRAVE</p></entry> |
|
112 </row> |
|
113 <row> |
|
114 <entry><p><codeph>0x07</codeph></p></entry> |
|
115 <entry><p>i LATIN SMALL LETTER DOTLESS</p></entry> |
|
116 <entry><p>ì LATIN SMALL LETTER I WITH GRAVE</p></entry> |
|
117 </row> |
|
118 <row> |
|
119 <entry><p><codeph>0x0B</codeph></p></entry> |
|
120 <entry><p>G LATIN CAPITAL LETTER G WITH BREVE</p></entry> |
|
121 <entry><p>Ø LATIN CAPITAL LETTER O WITH STROKE</p></entry> |
|
122 </row> |
|
123 <row> |
|
124 <entry><p><codeph>0x0C</codeph></p></entry> |
|
125 <entry><p>g LATIN SMALL LETTER G WITH BREVE</p></entry> |
|
126 <entry><p>ø LATIN SMALL LETTER O WITH STROKE</p></entry> |
|
127 </row> |
|
128 <row> |
|
129 <entry><p><codeph>0x1C</codeph></p></entry> |
|
130 <entry><p>S LATIN CAPITAL LETTER S WITH CEDILLA *</p></entry> |
|
131 <entry><p>Æ LATIN CAPITAL LETTER AE</p></entry> |
|
132 </row> |
|
133 <row> |
|
134 <entry><p><codeph>0x1D</codeph></p></entry> |
|
135 <entry><p>s LATIN SMALL LETTER S WITH CEDILLA *</p></entry> |
|
136 <entry><p>æ LATIN SMALL LETTER AE</p></entry> |
|
137 </row> |
|
138 </tbody> |
|
139 </tgroup> |
|
140 </table> <table id="GUID-EC345039-0CB5-4F51-8CFA-83286790AC75"><title>Table 2</title> |
|
141 <tgroup cols="3"><colspec colname="col1"/><colspec colname="col2"/><colspec colname="col3"/> |
|
142 <thead> |
|
143 <row> |
|
144 <entry><p>GSM 7-Bit Code</p></entry> |
|
145 <entry><p>Turkish Single Shift Table</p></entry> |
|
146 <entry><p>GSM 7-Bit Extension Table</p></entry> |
|
147 </row> |
|
148 </thead> |
|
149 <tbody> |
|
150 <row> |
|
151 <entry><p><codeph>0x1B49</codeph></p></entry> |
|
152 <entry><p>I LATIN CAPITAL LETTER I WITH DOT ABOVE</p></entry> |
|
153 <entry><p/></entry> |
|
154 </row> |
|
155 <row> |
|
156 <entry><p><codeph>0x1B63</codeph></p></entry> |
|
157 <entry><p>ç LATIN SMALL LETTER C WITH CEDILLA</p></entry> |
|
158 <entry><p/></entry> |
|
159 </row> |
|
160 <row> |
|
161 <entry><p><codeph>0x1B69</codeph></p></entry> |
|
162 <entry><p>i LATIN SMALL LETTER DOTLESS</p></entry> |
|
163 <entry><p/></entry> |
|
164 </row> |
|
165 <row> |
|
166 <entry><p><codeph>0x1B47</codeph></p></entry> |
|
167 <entry><p>G LATIN CAPITAL LETTER G WITH BREVE</p></entry> |
|
168 <entry><p/></entry> |
|
169 </row> |
|
170 <row> |
|
171 <entry><p><codeph>0x1B67</codeph></p></entry> |
|
172 <entry><p>g LATIN SMALL LETTER G WITH BREVE</p></entry> |
|
173 <entry><p/></entry> |
|
174 </row> |
|
175 <row> |
|
176 <entry><p><codeph>0x1B53</codeph></p></entry> |
|
177 <entry><p>S LATIN CAPITAL LETTER S WITH CEDILLA *</p></entry> |
|
178 <entry><p/></entry> |
|
179 </row> |
|
180 <row> |
|
181 <entry><p><codeph>0x1B73</codeph></p></entry> |
|
182 <entry><p>s LATIN SMALL LETTER S WITH CEDILLA *</p></entry> |
|
183 <entry><p/></entry> |
|
184 </row> |
|
185 </tbody> |
|
186 </tgroup> |
|
187 </table><p>For more information about the National Language Identifier, Single |
|
188 or Locking mechanism, see 3GPP TS 23.038 V8.1.0: National Language Identifier.</p></section> |
|
189 <section><title>See also</title> <p> <xref href="GUID-0BC9A9A1-DB99-5095-8390-E1C1B04D0080.dita">SMS |
|
190 Encodings and Converters Overview</xref> </p> </section> |
|
191 </conbody></concept> |