<?xml version="1.0" encoding="utf-8"?>
<!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. -->
<!-- This component and the accompanying materials are made available under the terms of the License
"Eclipse Public License v1.0" which accompanies this distribution,
and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". -->
<!-- Initial Contributors:
Nokia Corporation - initial contribution.
Contributors:
-->
<!DOCTYPE concept
PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="GUID-C501E703-E39D-598C-B962-7A32AC9091DD" xml:lang="en"><title>Folding
and collation (comparing strings)</title><shortdesc>Describes descriptor folding and descriptor collation.</shortdesc><prolog><metadata><keywords/></metadata></prolog><conbody>
<p>There are two techniques that may be used to modify the characters in a
descriptor prior to performing operations such as comparisons on text strings: </p>
<ul>
<li id="GUID-C550B19E-0312-52F4-936C-95C53D4D4FA9"><p>folding </p> </li>
<li id="GUID-77D0249A-F9B7-5189-808F-C4FB88BED5B3"><p>collation </p> </li>
</ul>
<section id="GUID-4AD769A8-A90B-4BE5-B514-DCE9C808C4A8"><title>Folding</title> <p>Folding is a relatively simple way of normalising
text for comparison by removing case distinctions, converting accented characters
to characters without accents etc. Folding is used for tolerant comparisons,
i.e. comparisons that are biased towards a match. </p> <p>For example, the
file system uses folding to decide whether two file names are identical or
not. Folding is locale-independent behaviour, and means that the file system,
for example, can be locale-independent. </p> <p> <i> It is important to note
that there can be no guarantee that folding is in any way culturally appropriate,
and should not be used for comparing strings in natural language; </i> <xref href="GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita#GUID-C501E703-E39D-598C-B962-7A32AC9091DD/GUID-F93D3C40-FDB4-5D92-A90C-736BB0225982">collation</xref> <i>is
the correct functionality for this.</i> </p> <p>Variants of member functions
that fold are provided where appropriate. For example, <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-7BDF7FA1-39FF-35D2-97DE-12A223514345"><apiname>TDesC16::CompareF()</apiname></xref> for
folded comparison. </p> <p>See also: </p><p><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-7BDF7FA1-39FF-35D2-97DE-12A223514345"><apiname>TDesC16::CompareF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-57DED784-A51D-308B-888C-968EFB35B732"><apiname>TDesC16::MatchF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-D4BDA3FC-E11A-392B-A8A5-B468AC800396"><apiname>TDesC16::FindF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-F88740FB-C90A-30AF-AA19-E2260EB39A47"><apiname>TDesC16::LocateF()</apiname></xref> <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-F88740FB-C90A-30AF-AA19-E2260EB39A47"><apiname>TDesC16::LocateF()</apiname></xref> <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-BE28DE82-AEF1-3E71-A0E1-7A053095B5B0"><apiname>TDesC16::LocateReverseF()</apiname></xref></p> </section>
<section id="GUID-F93D3C40-FDB4-5D92-A90C-736BB0225982"><title>Collation</title> <p>Collation
is a much better and more powerful way to compare strings and produces a dictionary-like
('lexicographic') ordering. Folding cannot remove piece accents or deal with
correspondences that are not one-to-one like the mapping from German upper
case SS to lower case ß. In addition, folding cannot optionally ignore punctuation. </p> <p>For
languages using the Latin script, for example, collation is about deciding
whether to ignore punctuation, whether to fold upper and lower case, how to
treat accents, and so on. In a given locale there is usually a standard set
of collation rules that can be used. </p> <p> <i>Collation should always be
used for comparing strings in natural language.</i> </p> <p>Variants of member
functions that use collation are provided where appropriate. For example, <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref> for
collated comparison. </p> <p><b>Comparing
and sorting strings</b> </p> <p>The <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref> variant
prototyped as: </p> <codeblock id="GUID-42E8C509-DA19-50AD-8A80-D381F812639A" xml:space="preserve">TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;</codeblock> <p>returns 0, if two strings match. </p> <p>There are many ways in which
two strings can match, even when they do not have the same length: </p> <ul>
<li id="GUID-8BD9DEF8-3B24-576D-B634-0E38BA2D5859"><p>if one string includes
combining characters, but the collation level is set to 0 (which means that
accents are ignored) </p> </li>
<li id="GUID-999CCABF-D8DC-5967-90F1-7739EDCAEFF9"><p>if one string contains
"pre-composed" versions of accented characters and the other contains "decomposed"
versions of the same character </p> </li>
<li id="GUID-9491E1CF-9BA3-52D6-8996-FC32BB54538A"><p>if one string contains
a ligature that, in a collation table, matches multiple characters in the
other string and the collation level is set to less than 3 (for example "æ"
might match "ae") </p> </li>
<li id="GUID-2D4FF198-7C41-5838-97FB-8A23A2D3DD49"><p>if one string contains
a "surrogate pair" (a 32-bit encoded character) which happens to match a normal
character at the level specified </p> </li>
<li id="GUID-66B770C7-3215-559A-B70D-68FF3AC9DDAC"><p>if the collation method
does not have its "ignore none" flag set and the collation level is set to
less than 3, then spaces and punctuation are ignored; this means that one
string could be much longer than the other just by adding a large number of
spaces </p> </li>
<li id="GUID-33026C64-3740-527A-97E0-DCBB6D7F087D"><p>if one string were to
contain the Hangul representation of Korean and the other were to contain
the Jamo representation of the same Korean and the collation level is set
to less than 3. </p> </li>
</ul> <p>The collation level is an integer that can take one of the values:
0, 1, 2 or 3, and determines how tightly the matching of two strings should
be. This value is passed as the second parameter to <codeph>CompareC()</codeph>.
The values have the following meanings: </p> <ul>
<li id="GUID-EDA2DAB4-E6B4-5815-ADC1-1BDF216D0C2E"><p>0 - only test the character
identity; accents and case are ignored </p> </li>
<li id="GUID-B0300A5F-6AF5-5722-B538-E0DBDE25576B"><p>1 - test the character
identity and accents; case is ignored </p> </li>
<li id="GUID-B6B7E302-120D-5240-B2AE-C14B19A36FF6"><p>2 - test the character
identity, accents and case </p> </li>
<li id="GUID-22E919E8-819D-5DE2-BABA-ECCC6F105B3C"><p>3 - test the Unicode
value as well as the character identity, accents and case. </p> </li>
</ul> <p>At levels 0-2: </p> <ul>
<li id="GUID-E08F7C36-2E53-5348-A8D4-6A1E420CEB76"><p>ligatures (e.g. "æ")
are the same as their decomposed equivalents (e.g. "ae") </p> </li>
<li id="GUID-937B9898-878A-59BE-B185-A5347038F14C"><p>script variants are
the same (for example "R" matches the mathematical real number symbol (Unicode
211D) </p> </li>
<li id="GUID-6D1AF8EB-E8E8-565E-BA1E-CAED0382D801"><p>the "micro" symbol (Unicode
00B5) matches Greek "mu" (Unicode 03BC)). </p> </li>
</ul> <p>At level 3 these are treated differently. </p> <p>If the aim is to <b>sort</b> strings,
then <b>level 3 must be used</b>. For any strings <codeph>a</codeph> and <codeph>b</codeph>,
if <codeph>a</codeph> < <codeph>b</codeph> for some level of collation,
then <codeph> a</codeph> < <codeph>b</codeph> for all higher
levels of collation as well. It is impossible, therefore, to affect the order
that is generated by using lower collation levels than 3. This just causes
similar strings to sort in a random order. In standard English, sorting at
level 3 gives the following order: </p> <p>bat < bee < BEE < bus </p> <p>The
case of the B only affects the comparison after all the letter identities
have been found to be the same - this is usually what people are trying to
achieve by using lower collation levels than 3 for sorting. It is never necessary. </p> <p>The
sort order can be affected by setting flags in the <xref href="GUID-78C4965C-BFCD-3E7E-8F46-2EE3D1BAF6EC.dita"><apiname>TCollationMethod</apiname></xref> object. </p> <p>Note
that when strings match at level 3, they do not necessarily have the same
binary representation, or even the same length. Unicode contains many strings
that are regarded as equivalent, even though they have different binary representations. </p><p>
See also: </p><p><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-ACEEA02F-2594-3C61-B7A9-E96F0737C3AE"><apiname>TDesC16::MatchC()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-33D33034-0757-31F9-B3A2-BA351AADC816"><apiname>TDesC16::FindC()</apiname></xref></p> </section>
</conbody></concept>