Folding -and collation (comparing strings)

There are two techniques that may be used to modify the characters in a -descriptor prior to performing operations such as comparisons on text strings:

folding
collation

Folding

Folding is a relatively simple way of normalising -text for comparison by removing case distinctions, converting accented characters -to characters without accents etc. Folding is used for tolerant comparisons, -i.e. comparisons that are biased towards a match.

For example, the -file system uses folding to decide whether two file names are identical or -not. Folding is locale-independent behaviour, and means that the file system, -for example, can be locale-independent.

It is important to note -that there can be no guarantee that folding is in any way culturally appropriate, -and should not be used for comparing strings in natural language; collation is -the correct functionality for this.

Variants of member functions -that fold are provided where appropriate. For example, TDesC16::CompareF() for -folded comparison.

See also:

TDesC16::CompareF()TDesC16::MatchF()TDesC16::FindF()TDesC16::LocateF() TDesC16::LocateF() TDesC16::LocateReverseF()

Collation

Collation -is a much better and more powerful way to compare strings and produces a dictionary-like -('lexicographic') ordering. Folding cannot remove piece accents or deal with -correspondences that are not one-to-one like the mapping from German upper -case SS to lower case ß. In addition, folding cannot optionally ignore punctuation.

For -languages using the Latin script, for example, collation is about deciding -whether to ignore punctuation, whether to fold upper and lower case, how to -treat accents, and so on. In a given locale there is usually a standard set -of collation rules that can be used.

Collation should always be -used for comparing strings in natural language.

Variants of member -functions that use collation are provided where appropriate. For example, TDesC16::CompareC() for -collated comparison.

Comparing -and sorting strings

The TDesC16::CompareC() variant -prototyped as:

TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;

returns 0, if two strings match.

There are many ways in which -two strings can match, even when they do not have the same length:

if one string includes -combining characters, but the collation level is set to 0 (which means that -accents are ignored)
if one string contains -"pre-composed" versions of accented characters and the other contains "decomposed" -versions of the same character
if one string contains -a ligature that, in a collation table, matches multiple characters in the -other string and the collation level is set to less than 3 (for example "æ" -might match "ae")
if one string contains -a "surrogate pair" (a 32-bit encoded character) which happens to match a normal -character at the level specified
if the collation method -does not have its "ignore none" flag set and the collation level is set to -less than 3, then spaces and punctuation are ignored; this means that one -string could be much longer than the other just by adding a large number of -spaces
if one string were to -contain the Hangul representation of Korean and the other were to contain -the Jamo representation of the same Korean and the collation level is set -to less than 3.

The collation level is an integer that can take one of the values: -0, 1, 2 or 3, and determines how tightly the matching of two strings should -be. This value is passed as the second parameter to CompareC(). -The values have the following meanings:

0 - only test the character -identity; accents and case are ignored
1 - test the character -identity and accents; case is ignored
2 - test the character -identity, accents and case
3 - test the Unicode -value as well as the character identity, accents and case.

At levels 0-2:

ligatures (e.g. "æ") -are the same as their decomposed equivalents (e.g. "ae")
script variants are -the same (for example "R" matches the mathematical real number symbol (Unicode -211D)
the "micro" symbol (Unicode -00B5) matches Greek "mu" (Unicode 03BC)).

At level 3 these are treated differently.

If the aim is to sort strings, -then level 3 must be used. For any strings a and b, -if a < b for some level of collation, -then a < b for all higher -levels of collation as well. It is impossible, therefore, to affect the order -that is generated by using lower collation levels than 3. This just causes -similar strings to sort in a random order. In standard English, sorting at -level 3 gives the following order:

bat < bee < BEE < bus

The -case of the B only affects the comparison after all the letter identities -have been found to be the same - this is usually what people are trying to -achieve by using lower collation levels than 3 for sorting. It is never necessary.

The -sort order can be affected by setting flags in the TCollationMethod object.

Note -that when strings match at level 3, they do not necessarily have the same -binary representation, or even the same length. Unicode contains many strings -that are regarded as equivalent, even though they have different binary representations.

- See also:

TDesC16::CompareC()TDesC16::MatchC()TDesC16::FindC()

+ + + + + +Folding +and collation (comparing strings)Describes descriptor folding and descriptor collation. +

There are two techniques that may be used to modify the characters in a +descriptor prior to performing operations such as comparisons on text strings:

folding
collation

Folding

Folding is a relatively simple way of normalising +text for comparison by removing case distinctions, converting accented characters +to characters without accents etc. Folding is used for tolerant comparisons, +i.e. comparisons that are biased towards a match.

For example, the +file system uses folding to decide whether two file names are identical or +not. Folding is locale-independent behaviour, and means that the file system, +for example, can be locale-independent.

It is important to note +that there can be no guarantee that folding is in any way culturally appropriate, +and should not be used for comparing strings in natural language; collation is +the correct functionality for this.

Variants of member functions +that fold are provided where appropriate. For example, TDesC16::CompareF() for +folded comparison.