|
1 <?xml version="1.0" encoding="utf-8"?> |
|
2 <!-- Copyright (c) 2007-2010 Nokia Corporation and/or its subsidiary(-ies) All rights reserved. --> |
|
3 <!-- This component and the accompanying materials are made available under the terms of the License |
|
4 "Eclipse Public License v1.0" which accompanies this distribution, |
|
5 and is available at the URL "http://www.eclipse.org/legal/epl-v10.html". --> |
|
6 <!-- Initial Contributors: |
|
7 Nokia Corporation - initial contribution. |
|
8 Contributors: |
|
9 --> |
|
10 <!DOCTYPE concept |
|
11 PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
|
12 <concept id="GUID-C501E703-E39D-598C-B962-7A32AC9091DD" xml:lang="en"><title>Folding |
|
13 and collation (comparing strings)</title><shortdesc>Describes descriptor folding and descriptor collation.</shortdesc><prolog><metadata><keywords/></metadata></prolog><conbody> |
|
14 <p>There are two techniques that may be used to modify the characters in a |
|
15 descriptor prior to performing operations such as comparisons on text strings: </p> |
|
16 <ul> |
|
17 <li id="GUID-C550B19E-0312-52F4-936C-95C53D4D4FA9"><p>folding </p> </li> |
|
18 <li id="GUID-77D0249A-F9B7-5189-808F-C4FB88BED5B3"><p>collation </p> </li> |
|
19 </ul> |
|
20 <section id="GUID-4AD769A8-A90B-4BE5-B514-DCE9C808C4A8"><title>Folding</title> <p>Folding is a relatively simple way of normalising |
|
21 text for comparison by removing case distinctions, converting accented characters |
|
22 to characters without accents etc. Folding is used for tolerant comparisons, |
|
23 i.e. comparisons that are biased towards a match. </p> <p>For example, the |
|
24 file system uses folding to decide whether two file names are identical or |
|
25 not. Folding is locale-independent behaviour, and means that the file system, |
|
26 for example, can be locale-independent. </p> <p> <i> It is important to note |
|
27 that there can be no guarantee that folding is in any way culturally appropriate, |
|
28 and should not be used for comparing strings in natural language; </i> <xref href="GUID-C501E703-E39D-598C-B962-7A32AC9091DD.dita#GUID-C501E703-E39D-598C-B962-7A32AC9091DD/GUID-F93D3C40-FDB4-5D92-A90C-736BB0225982">collation</xref> <i>is |
|
29 the correct functionality for this.</i> </p> <p>Variants of member functions |
|
30 that fold are provided where appropriate. For example, <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-7BDF7FA1-39FF-35D2-97DE-12A223514345"><apiname>TDesC16::CompareF()</apiname></xref> for |
|
31 folded comparison. </p> <p>See also: </p><p><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-7BDF7FA1-39FF-35D2-97DE-12A223514345"><apiname>TDesC16::CompareF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-57DED784-A51D-308B-888C-968EFB35B732"><apiname>TDesC16::MatchF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-D4BDA3FC-E11A-392B-A8A5-B468AC800396"><apiname>TDesC16::FindF()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-F88740FB-C90A-30AF-AA19-E2260EB39A47"><apiname>TDesC16::LocateF()</apiname></xref> <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-F88740FB-C90A-30AF-AA19-E2260EB39A47"><apiname>TDesC16::LocateF()</apiname></xref> <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-BE28DE82-AEF1-3E71-A0E1-7A053095B5B0"><apiname>TDesC16::LocateReverseF()</apiname></xref></p> </section> |
|
32 <section id="GUID-F93D3C40-FDB4-5D92-A90C-736BB0225982"><title>Collation</title> <p>Collation |
|
33 is a much better and more powerful way to compare strings and produces a dictionary-like |
|
34 ('lexicographic') ordering. Folding cannot remove piece accents or deal with |
|
35 correspondences that are not one-to-one like the mapping from German upper |
|
36 case SS to lower case ß. In addition, folding cannot optionally ignore punctuation. </p> <p>For |
|
37 languages using the Latin script, for example, collation is about deciding |
|
38 whether to ignore punctuation, whether to fold upper and lower case, how to |
|
39 treat accents, and so on. In a given locale there is usually a standard set |
|
40 of collation rules that can be used. </p> <p> <i>Collation should always be |
|
41 used for comparing strings in natural language.</i> </p> <p>Variants of member |
|
42 functions that use collation are provided where appropriate. For example, <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref> for |
|
43 collated comparison. </p> <p><b>Comparing |
|
44 and sorting strings</b> </p> <p>The <xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref> variant |
|
45 prototyped as: </p> <codeblock id="GUID-42E8C509-DA19-50AD-8A80-D381F812639A" xml:space="preserve">TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;</codeblock> <p>returns 0, if two strings match. </p> <p>There are many ways in which |
|
46 two strings can match, even when they do not have the same length: </p> <ul> |
|
47 <li id="GUID-8BD9DEF8-3B24-576D-B634-0E38BA2D5859"><p>if one string includes |
|
48 combining characters, but the collation level is set to 0 (which means that |
|
49 accents are ignored) </p> </li> |
|
50 <li id="GUID-999CCABF-D8DC-5967-90F1-7739EDCAEFF9"><p>if one string contains |
|
51 "pre-composed" versions of accented characters and the other contains "decomposed" |
|
52 versions of the same character </p> </li> |
|
53 <li id="GUID-9491E1CF-9BA3-52D6-8996-FC32BB54538A"><p>if one string contains |
|
54 a ligature that, in a collation table, matches multiple characters in the |
|
55 other string and the collation level is set to less than 3 (for example "æ" |
|
56 might match "ae") </p> </li> |
|
57 <li id="GUID-2D4FF198-7C41-5838-97FB-8A23A2D3DD49"><p>if one string contains |
|
58 a "surrogate pair" (a 32-bit encoded character) which happens to match a normal |
|
59 character at the level specified </p> </li> |
|
60 <li id="GUID-66B770C7-3215-559A-B70D-68FF3AC9DDAC"><p>if the collation method |
|
61 does not have its "ignore none" flag set and the collation level is set to |
|
62 less than 3, then spaces and punctuation are ignored; this means that one |
|
63 string could be much longer than the other just by adding a large number of |
|
64 spaces </p> </li> |
|
65 <li id="GUID-33026C64-3740-527A-97E0-DCBB6D7F087D"><p>if one string were to |
|
66 contain the Hangul representation of Korean and the other were to contain |
|
67 the Jamo representation of the same Korean and the collation level is set |
|
68 to less than 3. </p> </li> |
|
69 </ul> <p>The collation level is an integer that can take one of the values: |
|
70 0, 1, 2 or 3, and determines how tightly the matching of two strings should |
|
71 be. This value is passed as the second parameter to <codeph>CompareC()</codeph>. |
|
72 The values have the following meanings: </p> <ul> |
|
73 <li id="GUID-EDA2DAB4-E6B4-5815-ADC1-1BDF216D0C2E"><p>0 - only test the character |
|
74 identity; accents and case are ignored </p> </li> |
|
75 <li id="GUID-B0300A5F-6AF5-5722-B538-E0DBDE25576B"><p>1 - test the character |
|
76 identity and accents; case is ignored </p> </li> |
|
77 <li id="GUID-B6B7E302-120D-5240-B2AE-C14B19A36FF6"><p>2 - test the character |
|
78 identity, accents and case </p> </li> |
|
79 <li id="GUID-22E919E8-819D-5DE2-BABA-ECCC6F105B3C"><p>3 - test the Unicode |
|
80 value as well as the character identity, accents and case. </p> </li> |
|
81 </ul> <p>At levels 0-2: </p> <ul> |
|
82 <li id="GUID-E08F7C36-2E53-5348-A8D4-6A1E420CEB76"><p>ligatures (e.g. "æ") |
|
83 are the same as their decomposed equivalents (e.g. "ae") </p> </li> |
|
84 <li id="GUID-937B9898-878A-59BE-B185-A5347038F14C"><p>script variants are |
|
85 the same (for example "R" matches the mathematical real number symbol (Unicode |
|
86 211D) </p> </li> |
|
87 <li id="GUID-6D1AF8EB-E8E8-565E-BA1E-CAED0382D801"><p>the "micro" symbol (Unicode |
|
88 00B5) matches Greek "mu" (Unicode 03BC)). </p> </li> |
|
89 </ul> <p>At level 3 these are treated differently. </p> <p>If the aim is to <b>sort</b> strings, |
|
90 then <b>level 3 must be used</b>. For any strings <codeph>a</codeph> and <codeph>b</codeph>, |
|
91 if <codeph>a</codeph> < <codeph>b</codeph> for some level of collation, |
|
92 then <codeph> a</codeph> < <codeph>b</codeph> for all higher |
|
93 levels of collation as well. It is impossible, therefore, to affect the order |
|
94 that is generated by using lower collation levels than 3. This just causes |
|
95 similar strings to sort in a random order. In standard English, sorting at |
|
96 level 3 gives the following order: </p> <p>bat < bee < BEE < bus </p> <p>The |
|
97 case of the B only affects the comparison after all the letter identities |
|
98 have been found to be the same - this is usually what people are trying to |
|
99 achieve by using lower collation levels than 3 for sorting. It is never necessary. </p> <p>The |
|
100 sort order can be affected by setting flags in the <xref href="GUID-78C4965C-BFCD-3E7E-8F46-2EE3D1BAF6EC.dita"><apiname>TCollationMethod</apiname></xref> object. </p> <p>Note |
|
101 that when strings match at level 3, they do not necessarily have the same |
|
102 binary representation, or even the same length. Unicode contains many strings |
|
103 that are regarded as equivalent, even though they have different binary representations. </p><p> |
|
104 See also: </p><p><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-8B44C890-6E64-37CF-B3D9-AEF9EFCBA284"><apiname>TDesC16::CompareC()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-ACEEA02F-2594-3C61-B7A9-E96F0737C3AE"><apiname>TDesC16::MatchC()</apiname></xref><xref href="GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23.dita#GUID-440FF2B4-353B-3097-A2BA-5887D10B8B23/GUID-33D33034-0757-31F9-B3A2-BA351AADC816"><apiname>TDesC16::FindC()</apiname></xref></p> </section> |
|
105 </conbody></concept> |