Symbian3/SDK/Source/GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita
changeset 0 89d6a7a84779
equal deleted inserted replaced
-1:000000000000 0:89d6a7a84779
       
     1 <?xml version="1.0" encoding="utf-8"?>
       
     2 <!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. -->
       
     3 <!-- This component and the accompanying materials are made available under the terms of the License 
       
     4 "Eclipse Public License v1.0" which accompanies this distribution, 
       
     5 and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". -->
       
     6 <!-- Initial Contributors:
       
     7     Nokia Corporation - initial contribution.
       
     8 Contributors: 
       
     9 -->
       
    10 <!DOCTYPE concept
       
    11   PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
       
    12 <concept id="GUID-C501E703-E39D-598C-B962-7A32AC9091DD" xml:lang="en"><title>Folding
       
    13 and collation (comparing strings)</title><shortdesc>Describes descriptor folding and descriptor collation.</shortdesc><prolog><metadata><keywords/></metadata></prolog><conbody>
       
    14 <p>There are two techniques that may be used to modify the characters in a
       
    15 descriptor prior to performing operations such as comparisons on text strings: </p>
       
    16 <ul>
       
    17 <li id="GUID-C550B19E-0312-52F4-936C-95C53D4D4FA9"><p>folding </p> </li>
       
    18 <li id="GUID-77D0249A-F9B7-5189-808F-C4FB88BED5B3"><p>collation </p> </li>
       
    19 </ul>
       
    20 <section id="GUID-4AD769A8-A90B-4BE5-B514-DCE9C808C4A8"><title>Folding</title> <p>Folding is a relatively simple way of normalising
       
    21 text for comparison by removing case distinctions, converting accented characters
       
    22 to characters without accents etc. Folding is used for tolerant comparisons,
       
    23 i.e. comparisons that are biased towards a match. </p> <p>For example, the
       
    24 file system uses folding to decide whether two file names are identical or
       
    25 not. Folding is locale-independent behaviour, and means that the file system,
       
    26 for example, can be locale-independent. </p> <p> <i> It is important to note
       
    27 that there can be no guarantee that folding is in any way culturally appropriate,
       
    28 and should not be used for comparing strings in natural language; </i> <xref href="GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita#GUID-C501E703-E39D-598C-B962-7A32AC9091DD/GUID-F93D3C40-FDB4-5D92-A90C-736BB0225982">collation</xref> <i>is
       
    29 the correct functionality for this.</i>  </p> <p>Variants of member functions
       
    30 that fold are provided where appropriate. For example, <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-7BDF7FA1-39FF-35D2-97DE-12A223514345"><apiname>TDesC16::CompareF()</apiname></xref> for
       
    31 folded comparison. </p> <p>See also: </p><p><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-7BDF7FA1-39FF-35D2-97DE-12A223514345"><apiname>TDesC16::CompareF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-57DED784-A51D-308B-888C-968EFB35B732"><apiname>TDesC16::MatchF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-D4BDA3FC-E11A-392B-A8A5-B468AC800396"><apiname>TDesC16::FindF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-F88740FB-C90A-30AF-AA19-E2260EB39A47"><apiname>TDesC16::LocateF()</apiname></xref> <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-F88740FB-C90A-30AF-AA19-E2260EB39A47"><apiname>TDesC16::LocateF()</apiname></xref> <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-BE28DE82-AEF1-3E71-A0E1-7A053095B5B0"><apiname>TDesC16::LocateReverseF()</apiname></xref></p> </section>
       
    32 <section id="GUID-F93D3C40-FDB4-5D92-A90C-736BB0225982"><title>Collation</title> <p>Collation
       
    33 is a much better and more powerful way to compare strings and produces a dictionary-like
       
    34 ('lexicographic') ordering. Folding cannot remove piece accents or deal with
       
    35 correspondences that are not one-to-one like the mapping from German upper
       
    36 case SS to lower case ß. In addition, folding cannot optionally ignore punctuation. </p> <p>For
       
    37 languages using the Latin script, for example, collation is about deciding
       
    38 whether to ignore punctuation, whether to fold upper and lower case, how to
       
    39 treat accents, and so on. In a given locale there is usually a standard set
       
    40 of collation rules that can be used. </p> <p> <i>Collation should always be
       
    41 used for comparing strings in natural language.</i>  </p> <p>Variants of member
       
    42 functions that use collation are provided where appropriate. For example, <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref> for
       
    43 collated comparison. </p> <p><b>Comparing
       
    44 and sorting strings</b> </p> <p>The <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref> variant
       
    45 prototyped as: </p> <codeblock id="GUID-42E8C509-DA19-50AD-8A80-D381F812639A" xml:space="preserve">TInt CompareC(const TDesC16&amp; aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;</codeblock> <p>returns 0, if two strings match. </p> <p>There are many ways in which
       
    46 two strings can match, even when they do not have the same length: </p> <ul>
       
    47 <li id="GUID-8BD9DEF8-3B24-576D-B634-0E38BA2D5859"><p>if one string includes
       
    48 combining characters, but the collation level is set to 0 (which means that
       
    49 accents are ignored) </p> </li>
       
    50 <li id="GUID-999CCABF-D8DC-5967-90F1-7739EDCAEFF9"><p>if one string contains
       
    51 "pre-composed" versions of accented characters and the other contains "decomposed"
       
    52 versions of the same character </p> </li>
       
    53 <li id="GUID-9491E1CF-9BA3-52D6-8996-FC32BB54538A"><p>if one string contains
       
    54 a ligature that, in a collation table, matches multiple characters in the
       
    55 other string and the collation level is set to less than 3 (for example "æ"
       
    56 might match "ae") </p> </li>
       
    57 <li id="GUID-2D4FF198-7C41-5838-97FB-8A23A2D3DD49"><p>if one string contains
       
    58 a "surrogate pair" (a 32-bit encoded character) which happens to match a normal
       
    59 character at the level specified </p> </li>
       
    60 <li id="GUID-66B770C7-3215-559A-B70D-68FF3AC9DDAC"><p>if the collation method
       
    61 does not have its "ignore none" flag set and the collation level is set to
       
    62 less than 3, then spaces and punctuation are ignored; this means that one
       
    63 string could be much longer than the other just by adding a large number of
       
    64 spaces </p> </li>
       
    65 <li id="GUID-33026C64-3740-527A-97E0-DCBB6D7F087D"><p>if one string were to
       
    66 contain the Hangul representation of Korean and the other were to contain
       
    67 the Jamo representation of the same Korean and the collation level is set
       
    68 to less than 3. </p> </li>
       
    69 </ul> <p>The collation level is an integer that can take one of the values:
       
    70 0, 1, 2 or 3, and determines how tightly the matching of two strings should
       
    71 be. This value is passed as the second parameter to <codeph>CompareC()</codeph>.
       
    72 The values have the following meanings: </p> <ul>
       
    73 <li id="GUID-EDA2DAB4-E6B4-5815-ADC1-1BDF216D0C2E"><p>0 - only test the character
       
    74 identity; accents and case are ignored </p> </li>
       
    75 <li id="GUID-B0300A5F-6AF5-5722-B538-E0DBDE25576B"><p>1 - test the character
       
    76 identity and accents; case is ignored </p> </li>
       
    77 <li id="GUID-B6B7E302-120D-5240-B2AE-C14B19A36FF6"><p>2 - test the character
       
    78 identity, accents and case </p> </li>
       
    79 <li id="GUID-22E919E8-819D-5DE2-BABA-ECCC6F105B3C"><p>3 - test the Unicode
       
    80 value as well as the character identity, accents and case. </p> </li>
       
    81 </ul> <p>At levels 0-2: </p> <ul>
       
    82 <li id="GUID-E08F7C36-2E53-5348-A8D4-6A1E420CEB76"><p>ligatures (e.g. "æ")
       
    83 are the same as their decomposed equivalents (e.g. "ae") </p> </li>
       
    84 <li id="GUID-937B9898-878A-59BE-B185-A5347038F14C"><p>script variants are
       
    85 the same (for example "R" matches the mathematical real number symbol (Unicode
       
    86 211D) </p> </li>
       
    87 <li id="GUID-6D1AF8EB-E8E8-565E-BA1E-CAED0382D801"><p>the "micro" symbol (Unicode
       
    88 00B5) matches Greek "mu" (Unicode 03BC)). </p> </li>
       
    89 </ul> <p>At level 3 these are treated differently. </p> <p>If the aim is to <b>sort</b> strings,
       
    90 then <b>level 3 must be used</b>. For any strings <codeph>a</codeph> and <codeph>b</codeph>,
       
    91 if <codeph>a</codeph> &lt; <codeph>b</codeph> for some level of collation,
       
    92 then <codeph>              a</codeph> &lt; <codeph>b</codeph> for all higher
       
    93 levels of collation as well. It is impossible, therefore, to affect the order
       
    94 that is generated by using lower collation levels than 3. This just causes
       
    95 similar strings to sort in a random order. In standard English, sorting at
       
    96 level 3 gives the following order: </p> <p>bat &lt; bee &lt; BEE &lt; bus </p> <p>The
       
    97 case of the B only affects the comparison after all the letter identities
       
    98 have been found to be the same - this is usually what people are trying to
       
    99 achieve by using lower collation levels than 3 for sorting. It is never necessary. </p> <p>The
       
   100 sort order can be affected by setting flags in the <xref href="GUID-78C4965C-BFCD-3E7E-8F46-2EE3D1BAF6EC.dita"><apiname>TCollationMethod</apiname></xref> object. </p> <p>Note
       
   101 that when strings match at level 3, they do not necessarily have the same
       
   102 binary representation, or even the same length. Unicode contains many strings
       
   103 that are regarded as equivalent, even though they have different binary representations. </p><p> 
       
   104        See also:  </p><p><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-ACEEA02F-2594-3C61-B7A9-E96F0737C3AE"><apiname>TDesC16::MatchC()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-33D33034-0757-31F9-B3A2-BA351AADC816"><apiname>TDesC16::FindC()</apiname></xref></p> </section>
       
   105 </conbody></concept>