Folding +and collation (comparing strings)

There are two techniques that may be used to modify the characters in a +descriptor prior to performing operations such as comparisons on text strings:

Folding

Folding is a relatively simple way of normalising +text for comparison by removing case distinctions, converting accented characters +to characters without accents etc. Folding is used for tolerant comparisons, +i.e. comparisons that are biased towards a match.

For example, the +file system uses folding to decide whether two file names are identical or +not. Folding is locale-independent behaviour, and means that the file system, +for example, can be locale-independent.

It is important to note +that there can be no guarantee that folding is in any way culturally appropriate, +and should not be used for comparing strings in natural language; collation is +the correct functionality for this.

Variants of member functions +that fold are provided where appropriate. For example, TDesC16::CompareF() for +folded comparison.

See also:

TDesC16::CompareF()TDesC16::MatchF()TDesC16::FindF()TDesC16::LocateF() TDesC16::LocateF() TDesC16::LocateReverseF()

Collation

Collation +is a much better and more powerful way to compare strings and produces a dictionary-like +('lexicographic') ordering. Folding cannot remove piece accents or deal with +correspondences that are not one-to-one like the mapping from German upper +case SS to lower case ß. In addition, folding cannot optionally ignore punctuation.

For +languages using the Latin script, for example, collation is about deciding +whether to ignore punctuation, whether to fold upper and lower case, how to +treat accents, and so on. In a given locale there is usually a standard set +of collation rules that can be used.

Collation should always be +used for comparing strings in natural language.

Variants of member +functions that use collation are provided where appropriate. For example, TDesC16::CompareC() for +collated comparison.

Comparing +and sorting strings

The TDesC16::CompareC() variant +prototyped as:

TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;

returns 0, if two strings match.

There are many ways in which +two strings can match, even when they do not have the same length:

if one string includes +combining characters, but the collation level is set to 0 (which means that +accents are ignored)
if one string contains +"pre-composed" versions of accented characters and the other contains "decomposed" +versions of the same character
if one string contains +a ligature that, in a collation table, matches multiple characters in the +other string and the collation level is set to less than 3 (for example "æ" +might match "ae")
if one string contains +a "surrogate pair" (a 32-bit encoded character) which happens to match a normal +character at the level specified
if the collation method +does not have its "ignore none" flag set and the collation level is set to +less than 3, then spaces and punctuation are ignored; this means that one +string could be much longer than the other just by adding a large number of +spaces
if one string were to +contain the Hangul representation of Korean and the other were to contain +the Jamo representation of the same Korean and the collation level is set +to less than 3.

The collation level is an integer that can take one of the values: +0, 1, 2 or 3, and determines how tightly the matching of two strings should +be. This value is passed as the second parameter to CompareC(). +The values have the following meanings:

0 - only test the character +identity; accents and case are ignored
1 - test the character +identity and accents; case is ignored
2 - test the character +identity, accents and case
3 - test the Unicode +value as well as the character identity, accents and case.

At levels 0-2:

ligatures (e.g. "æ") +are the same as their decomposed equivalents (e.g. "ae")
script variants are +the same (for example "R" matches the mathematical real number symbol (Unicode +211D)
the "micro" symbol (Unicode +00B5) matches Greek "mu" (Unicode 03BC)).

At level 3 these are treated differently.

If the aim is to sort strings, +then level 3 must be used. For any strings a and b, +if a < b for some level of collation, +then a < b for all higher +levels of collation as well. It is impossible, therefore, to affect the order +that is generated by using lower collation levels than 3. This just causes +similar strings to sort in a random order. In standard English, sorting at +level 3 gives the following order:

bat < bee < BEE < bus

The +case of the B only affects the comparison after all the letter identities +have been found to be the same - this is usually what people are trying to +achieve by using lower collation levels than 3 for sorting. It is never necessary.

The +sort order can be affected by setting flags in the TCollationMethod object.

Note +that when strings match at level 3, they do not necessarily have the same +binary representation, or even the same length. Unicode contains many strings +that are regarded as equivalent, even though they have different binary representations.

+ See also:

TDesC16::CompareC()TDesC16::MatchC()TDesC16::FindC()