diff -r 4816d766a08a -r f345bda72bc4 Symbian3/PDK/Source/GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita --- a/Symbian3/PDK/Source/GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita Tue Mar 30 11:42:04 2010 +0100 +++ b/Symbian3/PDK/Source/GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita Tue Mar 30 11:56:28 2010 +0100 @@ -1,105 +1,105 @@ - - - - - -Folding -and collation (comparing strings)Describes descriptor folding and descriptor collation. -

There are two techniques that may be used to modify the characters in a -descriptor prior to performing operations such as comparisons on text strings:

- -
Folding

Folding is a relatively simple way of normalising -text for comparison by removing case distinctions, converting accented characters -to characters without accents etc. Folding is used for tolerant comparisons, -i.e. comparisons that are biased towards a match.

For example, the -file system uses folding to decide whether two file names are identical or -not. Folding is locale-independent behaviour, and means that the file system, -for example, can be locale-independent.

It is important to note -that there can be no guarantee that folding is in any way culturally appropriate, -and should not be used for comparing strings in natural language; collation is -the correct functionality for this.

Variants of member functions -that fold are provided where appropriate. For example, TDesC16::CompareF() for -folded comparison.

See also:

TDesC16::CompareF()TDesC16::MatchF()TDesC16::FindF()TDesC16::LocateF() TDesC16::LocateF() TDesC16::LocateReverseF()

-
Collation

Collation -is a much better and more powerful way to compare strings and produces a dictionary-like -('lexicographic') ordering. Folding cannot remove piece accents or deal with -correspondences that are not one-to-one like the mapping from German upper -case SS to lower case ß. In addition, folding cannot optionally ignore punctuation.

For -languages using the Latin script, for example, collation is about deciding -whether to ignore punctuation, whether to fold upper and lower case, how to -treat accents, and so on. In a given locale there is usually a standard set -of collation rules that can be used.

Collation should always be -used for comparing strings in natural language.

Variants of member -functions that use collation are provided where appropriate. For example, TDesC16::CompareC() for -collated comparison.

Comparing -and sorting strings

The TDesC16::CompareC() variant -prototyped as:

TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;

returns 0, if two strings match.

There are many ways in which -two strings can match, even when they do not have the same length:

    -
  • if one string includes -combining characters, but the collation level is set to 0 (which means that -accents are ignored)

  • -
  • if one string contains -"pre-composed" versions of accented characters and the other contains "decomposed" -versions of the same character

  • -
  • if one string contains -a ligature that, in a collation table, matches multiple characters in the -other string and the collation level is set to less than 3 (for example "æ" -might match "ae")

  • -
  • if one string contains -a "surrogate pair" (a 32-bit encoded character) which happens to match a normal -character at the level specified

  • -
  • if the collation method -does not have its "ignore none" flag set and the collation level is set to -less than 3, then spaces and punctuation are ignored; this means that one -string could be much longer than the other just by adding a large number of -spaces

  • -
  • if one string were to -contain the Hangul representation of Korean and the other were to contain -the Jamo representation of the same Korean and the collation level is set -to less than 3.

  • -

The collation level is an integer that can take one of the values: -0, 1, 2 or 3, and determines how tightly the matching of two strings should -be. This value is passed as the second parameter to CompareC(). -The values have the following meanings:

    -
  • 0 - only test the character -identity; accents and case are ignored

  • -
  • 1 - test the character -identity and accents; case is ignored

  • -
  • 2 - test the character -identity, accents and case

  • -
  • 3 - test the Unicode -value as well as the character identity, accents and case.

  • -

At levels 0-2:

    -
  • ligatures (e.g. "æ") -are the same as their decomposed equivalents (e.g. "ae")

  • -
  • script variants are -the same (for example "R" matches the mathematical real number symbol (Unicode -211D)

  • -
  • the "micro" symbol (Unicode -00B5) matches Greek "mu" (Unicode 03BC)).

  • -

At level 3 these are treated differently.

If the aim is to sort strings, -then level 3 must be used. For any strings a and b, -if a < b for some level of collation, -then a < b for all higher -levels of collation as well. It is impossible, therefore, to affect the order -that is generated by using lower collation levels than 3. This just causes -similar strings to sort in a random order. In standard English, sorting at -level 3 gives the following order:

bat < bee < BEE < bus

The -case of the B only affects the comparison after all the letter identities -have been found to be the same - this is usually what people are trying to -achieve by using lower collation levels than 3 for sorting. It is never necessary.

The -sort order can be affected by setting flags in the TCollationMethod object.

Note -that when strings match at level 3, they do not necessarily have the same -binary representation, or even the same length. Unicode contains many strings -that are regarded as equivalent, even though they have different binary representations.

- See also:

TDesC16::CompareC()TDesC16::MatchC()TDesC16::FindC()

+ + + + + +Folding +and collation (comparing strings)Describes descriptor folding and descriptor collation. +

There are two techniques that may be used to modify the characters in a +descriptor prior to performing operations such as comparisons on text strings:

+
    +
  • folding

  • +
  • collation

  • +
+
Folding

Folding is a relatively simple way of normalising +text for comparison by removing case distinctions, converting accented characters +to characters without accents etc. Folding is used for tolerant comparisons, +i.e. comparisons that are biased towards a match.

For example, the +file system uses folding to decide whether two file names are identical or +not. Folding is locale-independent behaviour, and means that the file system, +for example, can be locale-independent.

It is important to note +that there can be no guarantee that folding is in any way culturally appropriate, +and should not be used for comparing strings in natural language; collation is +the correct functionality for this.

Variants of member functions +that fold are provided where appropriate. For example, TDesC16::CompareF() for +folded comparison.

See also:

TDesC16::CompareF()TDesC16::MatchF()TDesC16::FindF()TDesC16::LocateF() TDesC16::LocateF() TDesC16::LocateReverseF()

+
Collation

Collation +is a much better and more powerful way to compare strings and produces a dictionary-like +('lexicographic') ordering. Folding cannot remove piece accents or deal with +correspondences that are not one-to-one like the mapping from German upper +case SS to lower case ß. In addition, folding cannot optionally ignore punctuation.

For +languages using the Latin script, for example, collation is about deciding +whether to ignore punctuation, whether to fold upper and lower case, how to +treat accents, and so on. In a given locale there is usually a standard set +of collation rules that can be used.

Collation should always be +used for comparing strings in natural language.

Variants of member +functions that use collation are provided where appropriate. For example, TDesC16::CompareC() for +collated comparison.

Comparing +and sorting strings

The TDesC16::CompareC() variant +prototyped as:

TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;

returns 0, if two strings match.

There are many ways in which +two strings can match, even when they do not have the same length:

    +
  • if one string includes +combining characters, but the collation level is set to 0 (which means that +accents are ignored)

  • +
  • if one string contains +"pre-composed" versions of accented characters and the other contains "decomposed" +versions of the same character

  • +
  • if one string contains +a ligature that, in a collation table, matches multiple characters in the +other string and the collation level is set to less than 3 (for example "æ" +might match "ae")

  • +
  • if one string contains +a "surrogate pair" (a 32-bit encoded character) which happens to match a normal +character at the level specified

  • +
  • if the collation method +does not have its "ignore none" flag set and the collation level is set to +less than 3, then spaces and punctuation are ignored; this means that one +string could be much longer than the other just by adding a large number of +spaces

  • +
  • if one string were to +contain the Hangul representation of Korean and the other were to contain +the Jamo representation of the same Korean and the collation level is set +to less than 3.

  • +

The collation level is an integer that can take one of the values: +0, 1, 2 or 3, and determines how tightly the matching of two strings should +be. This value is passed as the second parameter to CompareC(). +The values have the following meanings:

    +
  • 0 - only test the character +identity; accents and case are ignored

  • +
  • 1 - test the character +identity and accents; case is ignored

  • +
  • 2 - test the character +identity, accents and case

  • +
  • 3 - test the Unicode +value as well as the character identity, accents and case.

  • +

At levels 0-2:

    +
  • ligatures (e.g. "æ") +are the same as their decomposed equivalents (e.g. "ae")

  • +
  • script variants are +the same (for example "R" matches the mathematical real number symbol (Unicode +211D)

  • +
  • the "micro" symbol (Unicode +00B5) matches Greek "mu" (Unicode 03BC)).

  • +

At level 3 these are treated differently.

If the aim is to sort strings, +then level 3 must be used. For any strings a and b, +if a < b for some level of collation, +then a < b for all higher +levels of collation as well. It is impossible, therefore, to affect the order +that is generated by using lower collation levels than 3. This just causes +similar strings to sort in a random order. In standard English, sorting at +level 3 gives the following order:

bat < bee < BEE < bus

The +case of the B only affects the comparison after all the letter identities +have been found to be the same - this is usually what people are trying to +achieve by using lower collation levels than 3 for sorting. It is never necessary.

The +sort order can be affected by setting flags in the TCollationMethod object.

Note +that when strings match at level 3, they do not necessarily have the same +binary representation, or even the same length. Unicode contains many strings +that are regarded as equivalent, even though they have different binary representations.

+ See also:

TDesC16::CompareC()TDesC16::MatchC()TDesC16::FindC()

\ No newline at end of file