Cnvtool -Control File

The control file is a text file which specifies the conversion algorithms -used to convert (both ways) between ranges of characters. It is one of the -input files used by cnvtool to create a Charconv plug-in -DLL.

The control file also specifies the code(s) of the character(s) to use -to replace unconvertible Unicode characters, the endian-ness of the foreign -character set (if single characters may be encoded by more than one byte) -and the preferred character to use when a character has multiple equivalents -in the target character set.

The control file is case-insensitive. Comments begin with a # and extend -to the end of the line. Additional blank lines and leading and trailing whitespace -are ignored.

Syntax

There are four sections in the control file: -the header, the foreign variable-byte data, the foreign-to-Unicode data and -the Unicode-to-foreign data.

The header

The header -consists of two lines in fixed order. Their format is as follows (alternatives -are separated by a |, single space characters represent single -or multiple whitespace characters):

Endianness Unspecified|FixedLittleEndian|FixedBigEndian ReplacementForUnconvertibleUnicodeCharacters <see-below>

The -value of Endianness is only an issue for foreign character -sets where single characters may be encoded by more than one byte. The value -of ReplacementForUnconvertibleUnicodeCharacters is a series -of one or more hexadecimal numbers (not greater than 0xff) separated by whitespace, -each prefixed with 0x. These byte values are output for each Unicode character -that has no equivalent in the foreign character set (when converting from -Unicode to foreign).

The foreign variable-byte data

This -section is contained within the following lines:

StartForeignVariableByteData EndForeignVariableByteData

In -between these lines are one or more lines, each consisting of two hexadecimal -numbers (each prefixed with 0x and not greater than 0xff), followed by a decimal -number. All three numbers are separated by whitespace.

The two hexadecimal -numbers are the start and end of the range of values for the initial foreign -byte (inclusive). The decimal number is the number of subsequent bytes to -make up a foreign character code. The way these bytes are put together to -make the foreign character code is determined by the value of Endianness in -the header of the control file. For example, if the foreign character set -uses only a single byte per character and its first character has code 0x07 -and its last character has code 0xe6, the foreign variable-byte data would -be:

StartForeignVariableByteData -0x07 0xe6 0 -EndForeignVariableByteData

The foreign-to-Unicode data

This -section is contained within the following lines:

StartForeignToUnicodeData EndForeignToUnicodeData

In -between these two lines are one or more lines in format A (defined below). -These may be optionally followed by one or more lines in format B (defined -below), in which case the lines in format A and format B are separated by -the line:

ConflictResolution

Each -line in format A indicates the conversion algorithm to be used for a particular -range of foreign character codes. Lines in format A contain the following -fields, each separated by whitespace:

first field and second -field–reserved for future use and must be set to zero
first input character -code in the range–a hexadecimal number prefixed with 0x
last input character -code in the range–a hexadecimal number prefixed with 0x
algorithm –one of Direct, Offset, IndexedTable16 or KeyedTable1616
parameters–if not applicable -to any of the current choice of algorithms, set this to {}.

Lines in format B, if present, consist of two hexadecimal numbers, -prefixed with 0x, separated by whitespace. The first of these is a foreign -character code which has multiple equivalents in Unicode (according to the -data in the source file), and the second is the code of the preferred Unicode -character to which the foreign character should be converted.

The -Unicode-to-foreign data

This section is structured similarly to -the foreign-to-Unicode data section. It is contained within the following -lines:

StartUnicodeToForeignData EndUnicodeToForeignData

In -between these two lines are one or more lines in format C (defined below). -These may be optionally followed by one or more lines in format D (defined -below), in which case the lines in format C and format D are separated by -the line:

ConflictResolution

Format -C is very similar to format A with one exception, which is an additional field -to specify the size of the output character code in bytes (as this is a foreign -character code). Each line in format C indicates the conversion algorithm -to be used for a particular range of Unicode character codes. Lines in format -C contains the following fields, each separated by whitespace:

first field and second -field–reserved for future use and must be set to zero
first input character -code in the range–a hexadecimal number prefixed with 0x
last input character -code in the range–a hexadecimal number prefixed with 0x
algorithm –one of Direct, Offset, IndexedTable16 or KeyedTable1616
size of the output character -code in bytes (not present in format A)–a decimal number
parameters–if not applicable -to any of the current choice of algorithms, set this to {}.

Format D is analogous to format B (described above). Like format -B, it consists of two hexadecimal numbers prefixed with 0x, separated by whitespace. -However, the first of these is a Unicode character code which has multiple -equivalents in the foreign character set (according to the data in the source -file), and the second is the code of the preferred foreign character to which -the Unicode character should be converted.

Multiple SCnvConversionData data structures

The cnvtool generates -the main SCnvConversionData data structure using the input -from the source file and the control file. The SCnvConversionData data -structure contains the character set conversion data.

-.... -GLDEF_D const SCnvConversionData conversionData= - { - SCnvConversionData::EFixedBigEndian, - { - ARRAY_LENGTH(foreignVariableByteDataRanges), - foreignVariableByteDataRanges - }, - { - ARRAY_LENGTH(foreignToUnicodeDataRanges), - foreignToUnicodeDataRanges - }, - { - ARRAY_LENGTH(unicodeToForeignDataRanges), - unicodeToForeignDataRanges - }, - NULL, - NULL - }; -... -

It is sometimes desirable for further objects to be generated -which provide a view of a subset of the main SCnvConversionData object. -This is possible by inserting an extra pair of lines of the following form -in both the foreign-to-Unicode data and the Unicode-to-foreign data sections -in the control file:

StartAdditionalSubsetTable <name-of-SCnvConversionData-object> -... -EndAdditionalSubsetTable <name-of-SCnvConversionData-object>

These -lines must be placed around the above pair with a name (name-of-SCnvConversionData-object). -Only one pair of these lines can occur in each of the foreign-to-Unicode data -and the Unicode-to-foreign data sections, and if a pair occurs in one, it -must also occur in the other. Accessing one of these SCnvConversionData objects -from handwritten C++ files is done by adding the following line at the top -of the relevant C++ file. The named object can then be used as required.

GLREF_D const SCnvConversionData <name-of-SCnvConversionData-object>;

Below -is an example control file with subset tables defined in both the foreign-to-Unicode -data and the Unicode-to-foreign data sections:

-... -StartForeignToUnicodeData -# IncludePriority SearchPriority FirstInputCharacterCodeInRange LastInputCharacterCodeInRange Algorithm Parameters - StartAdditionalSubsetTable jisRomanConversionData - 6 6 0x00 0x5b Direct {} # ASCII characters [1] - 5 2 0x5c 0x5c Offset {} # yen sign - 4 5 0x5d 0x7d Direct {} # ASCII characters [2] - 3 1 0x7e 0x7e Offset {} # overline - 2 4 0x7f 0x7f Direct {} # ASCII characters [3] - EndAdditionalSubsetTable jisRomanConversionData - StartAdditionalSubsetTable halfWidthKatakana8ConversionData - 1 3 0xa1 0xdf Offset {} # half-width katakana - EndAdditionalSubsetTable halfWidthKatakana8ConversionData -EndForeignToUnicodeData - -StartUnicodeToForeignData -# IncludePriority SearchPriority FirstInputCharacterCodeInRange LastInputCharacterCodeInRange Algorithm SizeOfOutputCharacterCodeInBytes Parameters - StartAdditionalSubsetTable jisRomanConversionData - 6 1 0x0000 0x005b Direct 1 {} # ASCII characters [1] - 5 2 0x005d 0x007d Direct 1 {} # ASCII characters [2] - 4 3 0x007f 0x007f Direct 1 {} # ASCII characters [3] - 3 5 0x00a5 0x00a5 Offset 1 {} # yen sign - 2 6 0x203e 0x203e Offset 1 {} # overline - EndAdditionalSubsetTable jisRomanConversionData - StartAdditionalSubsetTable halfWidthKatakana8ConversionData - 1 4 0xff61 0xff9f Offset 1 {} # half-width katakana - EndAdditionalSubsetTable halfWidthKatakana8ConversionData -EndUnicodeToForeignData -...

The generated C++ source file by cnvtool contains -multiple SCnvConversionData data structures:

GLDEF_D const SCnvConversionData conversionData= - { - SCnvConversionData::EFixedBigEndian, - { - ARRAY_LENGTH(foreignVariableByteDataRanges), - foreignVariableByteDataRanges - }, - { - ARRAY_LENGTH(foreignToUnicodeDataRanges), - foreignToUnicodeDataRanges - }, - { - ARRAY_LENGTH(unicodeToForeignDataRanges), - unicodeToForeignDataRanges - }, - NULL, - NULL - }; - -GLREF_D const SCnvConversionData jisRomanConversionData; -GLDEF_D const SCnvConversionData jisRomanConversionData= - { - SCnvConversionData::EFixedBigEndian, - { - ARRAY_LENGTH(foreignVariableByteDataRanges), - foreignVariableByteDataRanges - }, - { - 5-0, - foreignToUnicodeDataRanges+0 - }, - { - 5-0, - unicodeToForeignDataRanges+0 - } - }; - -GLREF_D const SCnvConversionData halfWidthKatakana8ConversionData; -GLDEF_D const SCnvConversionData halfWidthKatakana8ConversionData= - { - SCnvConversionData::EFixedBigEndian, - { - ARRAY_LENGTH(foreignVariableByteDataRanges), - foreignVariableByteDataRanges - }, - { - 6-5, - foreignToUnicodeDataRanges+5 - }, - { - 6-5, - unicodeToForeignDataRanges+5 - } - }; -

Using this technique means that two (or more) foreign character -sets–where one is a subset of the other(s)–can share the same conversion data. -This conversion data would need to be in a shared-library DLL which the two -(or more) plug-in DLLs would both link to.

Conversion -algorithm

There are four possible conversion algorithms:

Direct is -where each character in the range has the same encoding in Unicode as in the -foreign character set,
Offset is -where the offset from the foreign encoding to the Unicode encoding is the -same for each character in the range,
Indexed table -(16) is where a contiguous block of foreign character codes maps -onto a random collection of Unicode character codes (the 16 refers to the -fact that each Unicode character code must use no more than 16 bits),
Keyed table -(16-16) is where a sparse collection of foreign character codes map -onto a random collection of Unicode character codes (the 16 refers to the -fact that each foreign character code and each Unicode character code must -use no more than 16 bits).

Of the four conversion algorithms listed above, the keyed table is -the most general and can be used for any foreign character set. However, it -is the algorithm requiring the most storage space, as well as being the slowest -(a binary search is required), therefore it is best avoided if possible. The -indexed table also requires storage space (although less than the keyed table), -but is much faster. The direct and offset algorithms are the fastest and require -negligible storage. It is thus necessary to choose appropriate algorithms -to minimize storage and to maximize speed of conversion.

Ranges of -characters in the control file are permitted to overlap. This is useful as -it means that a keyed table whose range is the entire range of the foreign -character set (or the Unicode character set) can be used at the end of the -foreign-to-Unicode data (or Unicode-to-foreign data) to catch all the -characters that were not caught by the preceding ranges, which will -have used better algorithms.

+ + + + + +Cnvtool +Control File +

The control file is a text file which specifies the conversion algorithms +used to convert (both ways) between ranges of characters. It is one of the +input files used by cnvtool to create a Charconv plug-in +DLL.

The control file also specifies the code(s) of the character(s) to use +to replace unconvertible Unicode characters, the endian-ness of the foreign +character set (if single characters may be encoded by more than one byte) +and the preferred character to use when a character has multiple equivalents +in the target character set.

The control file is case-insensitive. Comments begin with a # and extend +to the end of the line. Additional blank lines and leading and trailing whitespace +are ignored.

Syntax

There are four sections in the control file: +the header, the foreign variable-byte data, the foreign-to-Unicode data and +the Unicode-to-foreign data.

The header

The header +consists of two lines in fixed order. Their format is as follows (alternatives +are separated by a |, single space characters represent single +or multiple whitespace characters):

Endianness Unspecified|FixedLittleEndian|FixedBigEndian ReplacementForUnconvertibleUnicodeCharacters <see-below>

The +value of Endianness is only an issue for foreign character +sets where single characters may be encoded by more than one byte. The value +of ReplacementForUnconvertibleUnicodeCharacters is a series +of one or more hexadecimal numbers (not greater than 0xff) separated by whitespace, +each prefixed with 0x. These byte values are output for each Unicode character +that has no equivalent in the foreign character set (when converting from +Unicode to foreign).

The foreign variable-byte data

This +section is contained within the following lines:

StartForeignVariableByteData EndForeignVariableByteData

In +between these lines are one or more lines, each consisting of two hexadecimal +numbers (each prefixed with 0x and not greater than 0xff), followed by a decimal +number. All three numbers are separated by whitespace.

The two hexadecimal +numbers are the start and end of the range of values for the initial foreign +byte (inclusive). The decimal number is the number of subsequent bytes to +make up a foreign character code. The way these bytes are put together to +make the foreign character code is determined by the value of Endianness in +the header of the control file. For example, if the foreign character set +uses only a single byte per character and its first character has code 0x07 +and its last character has code 0xe6, the foreign variable-byte data would +be:

StartForeignVariableByteData +0x07 0xe6 0 +EndForeignVariableByteData

The foreign-to-Unicode data

This +section is contained within the following lines:

StartForeignToUnicodeData EndForeignToUnicodeData

In +between these two lines are one or more lines in format A (defined below). +These may be optionally followed by one or more lines in format B (defined +below), in which case the lines in format A and format B are separated by +the line:

ConflictResolution

Each +line in format A indicates the conversion algorithm to be used for a particular +range of foreign character codes. Lines in format A contain the following +fields, each separated by whitespace:

first field and second +field–reserved for future use and must be set to zero
first input character +code in the range–a hexadecimal number prefixed with 0x
last input character +code in the range–a hexadecimal number prefixed with 0x
algorithm –one of Direct, Offset, IndexedTable16 or KeyedTable1616
parameters–if not applicable +to any of the current choice of algorithms, set this to {}.

Lines in format B, if present, consist of two hexadecimal numbers, +prefixed with 0x, separated by whitespace. The first of these is a foreign +character code which has multiple equivalents in Unicode (according to the +data in the source file), and the second is the code of the preferred Unicode +character to which the foreign character should be converted.

The +Unicode-to-foreign data

This section is structured similarly to +the foreign-to-Unicode data section. It is contained within the following +lines:

StartUnicodeToForeignData EndUnicodeToForeignData

In +between these two lines are one or more lines in format C (defined below). +These may be optionally followed by one or more lines in format D (defined +below), in which case the lines in format C and format D are separated by +the line:

ConflictResolution

Format +C is very similar to format A with one exception, which is an additional field +to specify the size of the output character code in bytes (as this is a foreign +character code). Each line in format C indicates the conversion algorithm +to be used for a particular range of Unicode character codes. Lines in format +C contains the following fields, each separated by whitespace:

first field and second +field–reserved for future use and must be set to zero
first input character +code in the range–a hexadecimal number prefixed with 0x
last input character +code in the range–a hexadecimal number prefixed with 0x
algorithm –one of Direct, Offset, IndexedTable16 or KeyedTable1616
size of the output character +code in bytes (not present in format A)–a decimal number
parameters–if not applicable +to any of the current choice of algorithms, set this to {}.

Format D is analogous to format B (described above). Like format +B, it consists of two hexadecimal numbers prefixed with 0x, separated by whitespace. +However, the first of these is a Unicode character code which has multiple +equivalents in the foreign character set (according to the data in the source +file), and the second is the code of the preferred foreign character to which +the Unicode character should be converted.

Multiple SCnvConversionData data structures

The cnvtool generates +the main SCnvConversionData data structure using the input +from the source file and the control file. The SCnvConversionData data +structure contains the character set conversion data.

+.... +GLDEF_D const SCnvConversionData conversionData= + { + SCnvConversionData::EFixedBigEndian, + { + ARRAY_LENGTH(foreignVariableByteDataRanges), + foreignVariableByteDataRanges + }, + { + ARRAY_LENGTH(foreignToUnicodeDataRanges), + foreignToUnicodeDataRanges + }, + { + ARRAY_LENGTH(unicodeToForeignDataRanges), + unicodeToForeignDataRanges + }, + NULL, + NULL + }; +... +

It is sometimes desirable for further objects to be generated +which provide a view of a subset of the main SCnvConversionData object. +This is possible by inserting an extra pair of lines of the following form +in both the foreign-to-Unicode data and the Unicode-to-foreign data sections +in the control file:

StartAdditionalSubsetTable <name-of-SCnvConversionData-object> +... +EndAdditionalSubsetTable <name-of-SCnvConversionData-object>

These +lines must be placed around the above pair with a name (name-of-SCnvConversionData-object). +Only one pair of these lines can occur in each of the foreign-to-Unicode data +and the Unicode-to-foreign data sections, and if a pair occurs in one, it +must also occur in the other. Accessing one of these SCnvConversionData objects +from handwritten C++ files is done by adding the following line at the top +of the relevant C++ file. The named object can then be used as required.

GLREF_D const SCnvConversionData <name-of-SCnvConversionData-object>;

Below +is an example control file with subset tables defined in both the foreign-to-Unicode +data and the Unicode-to-foreign data sections:

+... +StartForeignToUnicodeData +# IncludePriority SearchPriority FirstInputCharacterCodeInRange LastInputCharacterCodeInRange Algorithm Parameters + StartAdditionalSubsetTable jisRomanConversionData + 6 6 0x00 0x5b Direct {} # ASCII characters [1] + 5 2 0x5c 0x5c Offset {} # yen sign + 4 5 0x5d 0x7d Direct {} # ASCII characters [2] + 3 1 0x7e 0x7e Offset {} # overline + 2 4 0x7f 0x7f Direct {} # ASCII characters [3] + EndAdditionalSubsetTable jisRomanConversionData + StartAdditionalSubsetTable halfWidthKatakana8ConversionData + 1 3 0xa1 0xdf Offset {} # half-width katakana + EndAdditionalSubsetTable halfWidthKatakana8ConversionData +EndForeignToUnicodeData + +StartUnicodeToForeignData +# IncludePriority SearchPriority FirstInputCharacterCodeInRange LastInputCharacterCodeInRange Algorithm SizeOfOutputCharacterCodeInBytes Parameters + StartAdditionalSubsetTable jisRomanConversionData + 6 1 0x0000 0x005b Direct 1 {} # ASCII characters [1] + 5 2 0x005d 0x007d Direct 1 {} # ASCII characters [2] + 4 3 0x007f 0x007f Direct 1 {} # ASCII characters [3] + 3 5 0x00a5 0x00a5 Offset 1 {} # yen sign + 2 6 0x203e 0x203e Offset 1 {} # overline + EndAdditionalSubsetTable jisRomanConversionData + StartAdditionalSubsetTable halfWidthKatakana8ConversionData + 1 4 0xff61 0xff9f Offset 1 {} # half-width katakana + EndAdditionalSubsetTable halfWidthKatakana8ConversionData +EndUnicodeToForeignData +...

The generated C++ source file by cnvtool contains +multiple SCnvConversionData data structures:

GLDEF_D const SCnvConversionData conversionData= + { + SCnvConversionData::EFixedBigEndian, + { + ARRAY_LENGTH(foreignVariableByteDataRanges), + foreignVariableByteDataRanges + }, + { + ARRAY_LENGTH(foreignToUnicodeDataRanges), + foreignToUnicodeDataRanges + }, + { + ARRAY_LENGTH(unicodeToForeignDataRanges), + unicodeToForeignDataRanges + }, + NULL, + NULL + }; + +GLREF_D const SCnvConversionData jisRomanConversionData; +GLDEF_D const SCnvConversionData jisRomanConversionData= + { + SCnvConversionData::EFixedBigEndian, + { + ARRAY_LENGTH(foreignVariableByteDataRanges), + foreignVariableByteDataRanges + }, + { + 5-0, + foreignToUnicodeDataRanges+0 + }, + { + 5-0, + unicodeToForeignDataRanges+0 + } + }; + +GLREF_D const SCnvConversionData halfWidthKatakana8ConversionData; +GLDEF_D const SCnvConversionData halfWidthKatakana8ConversionData= + { + SCnvConversionData::EFixedBigEndian, + { + ARRAY_LENGTH(foreignVariableByteDataRanges), + foreignVariableByteDataRanges + }, + { + 6-5, + foreignToUnicodeDataRanges+5 + }, + { + 6-5, + unicodeToForeignDataRanges+5 + } + }; +

Using this technique means that two (or more) foreign character +sets–where one is a subset of the other(s)–can share the same conversion data. +This conversion data would need to be in a shared-library DLL which the two +(or more) plug-in DLLs would both link to.

Conversion +algorithm

There are four possible conversion algorithms:

Direct is +where each character in the range has the same encoding in Unicode as in the +foreign character set,
Offset is +where the offset from the foreign encoding to the Unicode encoding is the +same for each character in the range,
Indexed table +(16) is where a contiguous block of foreign character codes maps +onto a random collection of Unicode character codes (the 16 refers to the +fact that each Unicode character code must use no more than 16 bits),
Keyed table +(16-16) is where a sparse collection of foreign character codes map +onto a random collection of Unicode character codes (the 16 refers to the +fact that each foreign character code and each Unicode character code must +use no more than 16 bits).

Of the four conversion algorithms listed above, the keyed table is +the most general and can be used for any foreign character set. However, it +is the algorithm requiring the most storage space, as well as being the slowest +(a binary search is required), therefore it is best avoided if possible. The +indexed table also requires storage space (although less than the keyed table), +but is much faster. The direct and offset algorithms are the fastest and require +negligible storage. It is thus necessary to choose appropriate algorithms +to minimize storage and to maximize speed of conversion.

Ranges of +characters in the control file are permitted to overlap. This is useful as +it means that a keyed table whose range is the entire range of the foreign +character set (or the Unicode character set) can be used at the end of the +foreign-to-Unicode data (or Unicode-to-foreign data) to catch all the +characters that were not caught by the preceding ranges, which will +have used better algorithms.

\ No newline at end of file