Symbian3/PDK/Source/GUID-2624060D-A5E7-590A-9FA0-471AE42A9BE9.dita
changeset 1 25a17d01db0c
child 3 46218c8b8afa
equal deleted inserted replaced
0:89d6a7a84779 1:25a17d01db0c
       
     1 <?xml version="1.0" encoding="utf-8"?>
       
     2 <!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. -->
       
     3 <!-- This component and the accompanying materials are made available under the terms of the License 
       
     4 "Eclipse Public License v1.0" which accompanies this distribution, 
       
     5 and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". -->
       
     6 <!-- Initial Contributors:
       
     7     Nokia Corporation - initial contribution.
       
     8 Contributors: 
       
     9 -->
       
    10 <!DOCTYPE concept
       
    11   PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
       
    12 <concept id="GUID-2624060D-A5E7-590A-9FA0-471AE42A9BE9" xml:lang="en"><title>Cnvtool
       
    13 Control File</title><prolog><metadata><keywords/></metadata></prolog><conbody>
       
    14 <p>The control file is a text file which specifies the conversion algorithms
       
    15 used to convert (both ways) between ranges of characters. It is one of the
       
    16 input files used by <filepath>cnvtool</filepath> to create a Charconv plug-in
       
    17 DLL. </p>
       
    18 <p>The control file also specifies the code(s) of the character(s) to use
       
    19 to replace unconvertible Unicode characters, the endian-ness of the foreign
       
    20 character set (if single characters may be encoded by more than one byte)
       
    21 and the preferred character to use when a character has multiple equivalents
       
    22 in the target character set. </p>
       
    23 <p>The control file is case-insensitive. Comments begin with a # and extend
       
    24 to the end of the line. Additional blank lines and leading and trailing whitespace
       
    25 are ignored. </p>
       
    26 <section><title>Syntax</title> <p>There are four sections in the control file:
       
    27 the header, the foreign variable-byte data, the foreign-to-Unicode data and
       
    28 the Unicode-to-foreign data. </p> <p><b>The header</b> </p> <p>The header
       
    29 consists of two lines in fixed order. Their format is as follows (alternatives
       
    30 are separated by a <codeph>|</codeph>, single space characters represent single
       
    31 or multiple whitespace characters): </p> <codeblock id="GUID-1AE937FB-E7A5-5ECD-BCD3-2A0C24C8869D" xml:space="preserve">Endianness Unspecified|FixedLittleEndian|FixedBigEndian</codeblock> <codeblock id="GUID-1E1209FC-855C-5764-92C0-F4F984FA727F" xml:space="preserve">ReplacementForUnconvertibleUnicodeCharacters &lt;see-below&gt;</codeblock> <p>The
       
    32 value of <codeph>Endianness</codeph> is only an issue for foreign character
       
    33 sets where single characters may be encoded by more than one byte. The value
       
    34 of <codeph>ReplacementForUnconvertibleUnicodeCharacters</codeph> is a series
       
    35 of one or more hexadecimal numbers (not greater than 0xff) separated by whitespace,
       
    36 each prefixed with 0x. These byte values are output for each Unicode character
       
    37 that has no equivalent in the foreign character set (when converting from
       
    38 Unicode to foreign). </p> <p><b>The foreign variable-byte data</b> </p> <p>This
       
    39 section is contained within the following lines: </p> <codeblock id="GUID-8AE97A28-7224-53BE-9D65-AAEA02191665" xml:space="preserve">StartForeignVariableByteData</codeblock> <codeblock id="GUID-769E3B0B-B3E3-5685-90A7-77BEEB715CEB" xml:space="preserve">EndForeignVariableByteData</codeblock> <p>In
       
    40 between these lines are one or more lines, each consisting of two hexadecimal
       
    41 numbers (each prefixed with 0x and not greater than 0xff), followed by a decimal
       
    42 number. All three numbers are separated by whitespace. </p> <p>The two hexadecimal
       
    43 numbers are the start and end of the range of values for the initial foreign
       
    44 byte (inclusive). The decimal number is the number of subsequent bytes to
       
    45 make up a foreign character code. The way these bytes are put together to
       
    46 make the foreign character code is determined by the value of <codeph>Endianness</codeph> in
       
    47 the header of the control file. For example, if the foreign character set
       
    48 uses only a single byte per character and its first character has code 0x07
       
    49 and its last character has code 0xe6, the foreign variable-byte data would
       
    50 be: </p> <codeblock id="GUID-72F31554-6AB7-52F8-ACCA-640C9BAC9F1B" xml:space="preserve">StartForeignVariableByteData
       
    51 0x07 0xe6 0
       
    52 EndForeignVariableByteData</codeblock> <p><b>The foreign-to-Unicode data</b> </p> <p>This
       
    53 section is contained within the following lines: </p> <codeblock id="GUID-C09DAF0F-AC55-5F44-8183-CD50C52A0F9E" xml:space="preserve">StartForeignToUnicodeData</codeblock> <codeblock id="GUID-1AE31BA2-F49E-5287-BC2C-FD9D28ACB3F5" xml:space="preserve">EndForeignToUnicodeData</codeblock> <p>In
       
    54 between these two lines are one or more lines in format A (defined below).
       
    55 These may be optionally followed by one or more lines in format B (defined
       
    56 below), in which case the lines in format A and format B are separated by
       
    57 the line: </p> <codeblock id="GUID-1D099346-7111-5367-BE2C-6015F121D5B9" xml:space="preserve">ConflictResolution</codeblock> <p>Each
       
    58 line in format A indicates the conversion algorithm to be used for a particular
       
    59 range of foreign character codes. Lines in format A contain the following
       
    60 fields, each separated by whitespace: </p> <ul>
       
    61 <li id="GUID-BEA311EB-3633-57CF-8136-FD5DF64AE054"><p>first field and second
       
    62 field–reserved for future use and must be set to zero </p> </li>
       
    63 <li id="GUID-7A8CA63C-6C27-5E7B-96A4-BC463BE55AF2"><p>first input character
       
    64 code in the range–a hexadecimal number prefixed with 0x </p> </li>
       
    65 <li id="GUID-9CC95C68-0DB5-53A3-AE25-AA70D293C096"><p>last input character
       
    66 code in the range–a hexadecimal number prefixed with 0x </p> </li>
       
    67 <li id="GUID-B18A63F4-0CC8-518B-ADE7-31A86D3DC2F0"><p><xref href="GUID-2624060D-A5E7-590A-9FA0-471AE42A9BE9.dita#GUID-2624060D-A5E7-590A-9FA0-471AE42A9BE9/GUID-29B10367-F31D-5756-9DAA-8E4840BAB042">algorithm</xref> –one of <codeph>Direct</codeph>, <codeph>Offset</codeph>, <codeph>IndexedTable16</codeph> or <codeph>KeyedTable1616</codeph>  </p> </li>
       
    68 <li id="GUID-B8E84CFA-1047-5E98-A831-85F6073BDFD6"><p>parameters–if not applicable
       
    69 to any of the current choice of algorithms, set this to <codeph>{}</codeph>. </p> </li>
       
    70 </ul> <p>Lines in format B, if present, consist of two hexadecimal numbers,
       
    71 prefixed with 0x, separated by whitespace. The first of these is a foreign
       
    72 character code which has multiple equivalents in Unicode (according to the
       
    73 data in the source file), and the second is the code of the preferred Unicode
       
    74 character to which the foreign character should be converted. </p> <p><b>The
       
    75 Unicode-to-foreign data</b> </p> <p>This section is structured similarly to
       
    76 the foreign-to-Unicode data section. It is contained within the following
       
    77 lines: </p> <codeblock id="GUID-5F9DD42E-8676-53CB-B6B4-8CA8574DFD6A" xml:space="preserve">StartUnicodeToForeignData</codeblock> <codeblock id="GUID-5CEAE8E0-D098-54CF-B7AE-71766ADEC3C5" xml:space="preserve">EndUnicodeToForeignData</codeblock> <p>In
       
    78 between these two lines are one or more lines in format C (defined below).
       
    79 These may be optionally followed by one or more lines in format D (defined
       
    80 below), in which case the lines in format C and format D are separated by
       
    81 the line: </p> <codeblock id="GUID-DA74F595-BD77-540B-940B-6536A83940A3" xml:space="preserve">ConflictResolution</codeblock> <p>Format
       
    82 C is very similar to format A with one exception, which is an additional field
       
    83 to specify the size of the output character code in bytes (as this is a foreign
       
    84 character code). Each line in format C indicates the conversion algorithm
       
    85 to be used for a particular range of Unicode character codes. Lines in format
       
    86 C contains the following fields, each separated by whitespace: </p> <ul>
       
    87 <li id="GUID-0622493A-08F8-58E1-A7E3-F7F331C14DA7"><p>first field and second
       
    88 field–reserved for future use and must be set to zero </p> </li>
       
    89 <li id="GUID-DDC93F91-6911-591F-BDA6-4E4099180D12"><p>first input character
       
    90 code in the range–a hexadecimal number prefixed with 0x </p> </li>
       
    91 <li id="GUID-80C572A0-DB92-5281-AFA0-1215FA899289"><p>last input character
       
    92 code in the range–a hexadecimal number prefixed with 0x </p> </li>
       
    93 <li id="GUID-010F91D8-C408-5EF8-BEBD-DE5DE57D040F"><p><xref href="GUID-2624060D-A5E7-590A-9FA0-471AE42A9BE9.dita#GUID-2624060D-A5E7-590A-9FA0-471AE42A9BE9/GUID-29B10367-F31D-5756-9DAA-8E4840BAB042">algorithm</xref> –one of <codeph>Direct</codeph>, <codeph>Offset</codeph>, <codeph>IndexedTable16</codeph> or <codeph>KeyedTable1616</codeph>  </p> </li>
       
    94 <li id="GUID-4D6EEC55-A3D1-5193-B9B1-B16CDB4A903F"><p>size of the output character
       
    95 code in bytes (not present in format A)–a decimal number </p> </li>
       
    96 <li id="GUID-C3DF518F-5624-5618-B2BF-4CF230DAE3DE"><p>parameters–if not applicable
       
    97 to any of the current choice of algorithms, set this to <codeph>{}</codeph>. </p> </li>
       
    98 </ul> <p>Format D is analogous to format B (described above). Like format
       
    99 B, it consists of two hexadecimal numbers prefixed with 0x, separated by whitespace.
       
   100 However, the first of these is a Unicode character code which has multiple
       
   101 equivalents in the foreign character set (according to the data in the source
       
   102 file), and the second is the code of the preferred foreign character to which
       
   103 the Unicode character should be converted. </p> </section>
       
   104 <section><title>Multiple SCnvConversionData data structures</title> <p>The <filepath>cnvtool</filepath> generates
       
   105 the main <codeph>SCnvConversionData</codeph> data structure using the input
       
   106 from the source file and the control file. The <codeph>SCnvConversionData</codeph> data
       
   107 structure contains the character set conversion data. </p> <codeblock id="GUID-C7959166-28B4-5C77-9979-B0BDDEF85928" xml:space="preserve">
       
   108 ....
       
   109 GLDEF_D const SCnvConversionData conversionData=
       
   110     {
       
   111     SCnvConversionData::EFixedBigEndian,
       
   112         {
       
   113         ARRAY_LENGTH(foreignVariableByteDataRanges),
       
   114         foreignVariableByteDataRanges
       
   115         },
       
   116         {
       
   117         ARRAY_LENGTH(foreignToUnicodeDataRanges),
       
   118         foreignToUnicodeDataRanges
       
   119         },
       
   120         {
       
   121         ARRAY_LENGTH(unicodeToForeignDataRanges),
       
   122         unicodeToForeignDataRanges
       
   123         },
       
   124     NULL,
       
   125     NULL
       
   126     };
       
   127 ...
       
   128 </codeblock> <p>It is sometimes desirable for further objects to be generated
       
   129 which provide a view of a subset of the main <codeph>SCnvConversionData</codeph> object.
       
   130 This is possible by inserting an extra pair of lines of the following form
       
   131 in both the foreign-to-Unicode data and the Unicode-to-foreign data sections
       
   132 in the control file: </p> <codeblock id="GUID-6D436751-1392-52C3-95CF-87DC6C440284" xml:space="preserve">StartAdditionalSubsetTable &lt;name-of-SCnvConversionData-object&gt;
       
   133 ...
       
   134 EndAdditionalSubsetTable &lt;name-of-SCnvConversionData-object&gt;</codeblock> <p>These
       
   135 lines must be placed around the above pair with a name (<codeph>name-of-SCnvConversionData-object</codeph>).
       
   136 Only one pair of these lines can occur in each of the foreign-to-Unicode data
       
   137 and the Unicode-to-foreign data sections, and if a pair occurs in one, it
       
   138 must also occur in the other. Accessing one of these <codeph>SCnvConversionData</codeph> objects
       
   139 from handwritten C++ files is done by adding the following line at the top
       
   140 of the relevant C++ file. The named object can then be used as required. </p> <codeblock id="GUID-B6A0EFA5-66B8-5EF3-A42A-DEAF334F664F" xml:space="preserve">GLREF_D const SCnvConversionData &lt;name-of-SCnvConversionData-object&gt;;</codeblock> <p>Below
       
   141 is an example control file with subset tables defined in both the foreign-to-Unicode
       
   142 data and the Unicode-to-foreign data sections: </p> <codeblock id="GUID-EEB02661-CB3D-5270-8FEA-A86259D4CAC4" xml:space="preserve">
       
   143 ...
       
   144 StartForeignToUnicodeData
       
   145 #        IncludePriority    SearchPriority    FirstInputCharacterCodeInRange    LastInputCharacterCodeInRange    Algorithm            Parameters
       
   146     StartAdditionalSubsetTable jisRomanConversionData
       
   147         6                6                0x00                            0x5b                            Direct                {}        # ASCII characters [1]
       
   148         5                2                0x5c                            0x5c                            Offset                {}        # yen sign
       
   149         4                5                0x5d                            0x7d                            Direct                {}        # ASCII characters [2]
       
   150         3                1                0x7e                            0x7e                            Offset                {}        # overline
       
   151         2                4                0x7f                            0x7f                            Direct                {}        # ASCII characters [3]
       
   152     EndAdditionalSubsetTable jisRomanConversionData
       
   153     StartAdditionalSubsetTable halfWidthKatakana8ConversionData
       
   154         1                3                0xa1                            0xdf                            Offset                {}        # half-width katakana
       
   155     EndAdditionalSubsetTable halfWidthKatakana8ConversionData
       
   156 EndForeignToUnicodeData
       
   157 
       
   158 StartUnicodeToForeignData
       
   159 #        IncludePriority    SearchPriority    FirstInputCharacterCodeInRange    LastInputCharacterCodeInRange    Algorithm            SizeOfOutputCharacterCodeInBytes    Parameters
       
   160     StartAdditionalSubsetTable jisRomanConversionData
       
   161         6                1                0x0000                            0x005b                            Direct                1                                    {}        # ASCII characters [1]
       
   162         5                2                0x005d                            0x007d                            Direct                1                                    {}        # ASCII characters [2]
       
   163         4                3                0x007f                            0x007f                            Direct                1                                    {}        # ASCII characters [3]
       
   164         3                5                0x00a5                            0x00a5                            Offset                1                                    {}        # yen sign
       
   165         2                6                0x203e                            0x203e                            Offset                1                                    {}        # overline
       
   166     EndAdditionalSubsetTable jisRomanConversionData
       
   167     StartAdditionalSubsetTable halfWidthKatakana8ConversionData
       
   168         1                4                0xff61                            0xff9f                            Offset                1                                    {}        # half-width katakana
       
   169     EndAdditionalSubsetTable halfWidthKatakana8ConversionData
       
   170 EndUnicodeToForeignData
       
   171 ...</codeblock> <p>The generated C++ source file by <filepath>cnvtool</filepath> contains
       
   172 multiple <codeph>SCnvConversionData</codeph> data structures: </p> <codeblock id="GUID-A3664842-0671-5C04-9E73-726B64BC8D92" xml:space="preserve">GLDEF_D const SCnvConversionData conversionData=
       
   173     {
       
   174     SCnvConversionData::EFixedBigEndian,
       
   175         {
       
   176         ARRAY_LENGTH(foreignVariableByteDataRanges),
       
   177         foreignVariableByteDataRanges
       
   178         },
       
   179         {
       
   180         ARRAY_LENGTH(foreignToUnicodeDataRanges),
       
   181         foreignToUnicodeDataRanges
       
   182         },
       
   183         {
       
   184         ARRAY_LENGTH(unicodeToForeignDataRanges),
       
   185         unicodeToForeignDataRanges
       
   186         },
       
   187     NULL,
       
   188     NULL
       
   189     };
       
   190 
       
   191 GLREF_D const SCnvConversionData jisRomanConversionData;
       
   192 GLDEF_D const SCnvConversionData jisRomanConversionData=
       
   193     {
       
   194     SCnvConversionData::EFixedBigEndian,
       
   195         {
       
   196         ARRAY_LENGTH(foreignVariableByteDataRanges),
       
   197         foreignVariableByteDataRanges
       
   198         },
       
   199         {
       
   200         5-0,
       
   201         foreignToUnicodeDataRanges+0
       
   202         },
       
   203         {
       
   204         5-0,
       
   205         unicodeToForeignDataRanges+0
       
   206         }
       
   207     };
       
   208 
       
   209 GLREF_D const SCnvConversionData halfWidthKatakana8ConversionData;
       
   210 GLDEF_D const SCnvConversionData halfWidthKatakana8ConversionData=
       
   211     {
       
   212     SCnvConversionData::EFixedBigEndian,
       
   213         {
       
   214         ARRAY_LENGTH(foreignVariableByteDataRanges),
       
   215         foreignVariableByteDataRanges
       
   216         },
       
   217         {
       
   218         6-5,
       
   219         foreignToUnicodeDataRanges+5
       
   220         },
       
   221         {
       
   222         6-5,
       
   223         unicodeToForeignDataRanges+5
       
   224         }
       
   225     };
       
   226 </codeblock> <p>Using this technique means that two (or more) foreign character
       
   227 sets–where one is a subset of the other(s)–can share the same conversion data.
       
   228 This conversion data would need to be in a shared-library DLL which the two
       
   229 (or more) plug-in DLLs would both link to. </p> </section>
       
   230 <section id="GUID-29B10367-F31D-5756-9DAA-8E4840BAB042"><title>Conversion
       
   231 algorithm</title> <p>There are four possible conversion algorithms: </p> <ul>
       
   232 <li id="GUID-028531BA-31F0-5D0F-9F1F-CDCDB362F80C"><p> <codeph>Direct</codeph> is
       
   233 where each character in the range has the same encoding in Unicode as in the
       
   234 foreign character set, </p> </li>
       
   235 <li id="GUID-F6124B2A-AA6F-560D-8BAC-1BF88E40EBF4"><p> <codeph>Offset</codeph> is
       
   236 where the offset from the foreign encoding to the Unicode encoding is the
       
   237 same for each character in the range, </p> </li>
       
   238 <li id="GUID-FF176CC1-0BA1-57EA-9F37-34C41601A037"><p> <codeph>Indexed table
       
   239 (16)</codeph> is where a contiguous block of foreign character codes maps
       
   240 onto a random collection of Unicode character codes (the 16 refers to the
       
   241 fact that each Unicode character code must use no more than 16 bits), </p> </li>
       
   242 <li id="GUID-202AA338-1610-5F81-B249-74678F6D42C5"><p> <codeph>Keyed table
       
   243 (16-16)</codeph> is where a sparse collection of foreign character codes map
       
   244 onto a random collection of Unicode character codes (the 16 refers to the
       
   245 fact that each foreign character code and each Unicode character code must
       
   246 use no more than 16 bits). </p> </li>
       
   247 </ul> <p>Of the four conversion algorithms listed above, the keyed table is
       
   248 the most general and can be used for any foreign character set. However, it
       
   249 is the algorithm requiring the most storage space, as well as being the slowest
       
   250 (a binary search is required), therefore it is best avoided if possible. The
       
   251 indexed table also requires storage space (although less than the keyed table),
       
   252 but is much faster. The direct and offset algorithms are the fastest and require
       
   253 negligible storage. It is thus necessary to choose appropriate algorithms
       
   254 to minimize storage and to maximize speed of conversion. </p> <p>Ranges of
       
   255 characters in the control file are permitted to overlap. This is useful as
       
   256 it means that a keyed table whose range is the entire range of the foreign
       
   257 character set (or the Unicode character set) can be used at the end of the
       
   258 foreign-to-Unicode data (or Unicode-to-foreign data) to <b>catch</b> all the
       
   259 characters that were not <b>caught</b> by the preceding ranges, which will
       
   260 have used better algorithms. </p> </section>
       
   261 </conbody></concept>