buildframework/helium/external/python/lib/common/Sphinx-0.5.1-py2.5.egg/sphinx/util/stemmer.py
author wbernard
Wed, 23 Dec 2009 19:29:07 +0200
changeset 179 d8ac696cc51f
permissions -rw-r--r--
helium_7.0-r14027
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
179
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     1
# -*- coding: utf-8 -*-
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     2
"""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     3
    sphinx.util.stemmer
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     4
    ~~~~~~~~~~~~~~~~~~~
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     5
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     6
    Porter Stemming Algorithm
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     7
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     8
    This is the Porter stemming algorithm, ported to Python from the
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
     9
    version coded up in ANSI C by the author. It may be be regarded
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    10
    as canonical, in that it follows the algorithm presented in
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    11
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    12
    Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    13
    no. 3, pp 130-137,
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    14
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    15
    only differing from it at the points maked --DEPARTURE-- below.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    16
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    17
    See also http://www.tartarus.org/~martin/PorterStemmer
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    18
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    19
    The algorithm as described in the paper could be exactly replicated
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    20
    by adjusting the points of DEPARTURE, but this is barely necessary,
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    21
    because (a) the points of DEPARTURE are definitely improvements, and
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    22
    (b) no encoding of the Porter stemmer I have seen is anything like
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    23
    as exact as this version, even with the points of DEPARTURE!
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    24
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    25
    Release 1: January 2001
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    26
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    27
    :copyright: 2001 by Vivake Gupta <v@nano.com>.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    28
    :license: Public Domain ("can be used free of charge for any purpose").
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    29
"""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    30
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    31
class PorterStemmer(object):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    32
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    33
    def __init__(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    34
        """The main part of the stemming algorithm starts here.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    35
        b is a buffer holding a word to be stemmed. The letters are in b[k0],
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    36
        b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k is
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    37
        readjusted downwards as the stemming progresses. Zero termination is
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    38
        not in fact used in the algorithm.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    39
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    40
        Note that only lower case sequences are stemmed. Forcing to lower case
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    41
        should be done before stem(...) is called.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    42
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    43
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    44
        self.b = ""  # buffer for word to be stemmed
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    45
        self.k = 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    46
        self.k0 = 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    47
        self.j = 0   # j is a general offset into the string
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    48
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    49
    def cons(self, i):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    50
        """cons(i) is TRUE <=> b[i] is a consonant."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    51
        if self.b[i] == 'a' or self.b[i] == 'e' or self.b[i] == 'i' \
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    52
            or self.b[i] == 'o' or self.b[i] == 'u':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    53
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    54
        if self.b[i] == 'y':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    55
            if i == self.k0:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    56
                return 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    57
            else:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    58
                return (not self.cons(i - 1))
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    59
        return 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    60
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    61
    def m(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    62
        """m() measures the number of consonant sequences between k0 and j.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    63
        if c is a consonant sequence and v a vowel sequence, and <..>
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    64
        indicates arbitrary presence,
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    65
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    66
           <c><v>       gives 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    67
           <c>vc<v>     gives 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    68
           <c>vcvc<v>   gives 2
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    69
           <c>vcvcvc<v> gives 3
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    70
           ....
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    71
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    72
        n = 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    73
        i = self.k0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    74
        while 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    75
            if i > self.j:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    76
                return n
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    77
            if not self.cons(i):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    78
                break
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    79
            i = i + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    80
        i = i + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    81
        while 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    82
            while 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    83
                if i > self.j:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    84
                    return n
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    85
                if self.cons(i):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    86
                    break
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    87
                i = i + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    88
            i = i + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    89
            n = n + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    90
            while 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    91
                if i > self.j:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    92
                    return n
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    93
                if not self.cons(i):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    94
                    break
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    95
                i = i + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    96
            i = i + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    97
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    98
    def vowelinstem(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
    99
        """vowelinstem() is TRUE <=> k0,...j contains a vowel"""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   100
        for i in range(self.k0, self.j + 1):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   101
            if not self.cons(i):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   102
                return 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   103
        return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   104
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   105
    def doublec(self, j):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   106
        """doublec(j) is TRUE <=> j,(j-1) contain a double consonant."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   107
        if j < (self.k0 + 1):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   108
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   109
        if (self.b[j] != self.b[j-1]):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   110
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   111
        return self.cons(j)
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   112
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   113
    def cvc(self, i):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   114
        """cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   115
        and also if the second c is not w,x or y. this is used when trying to
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   116
        restore an e at the end of a short  e.g.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   117
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   118
           cav(e), lov(e), hop(e), crim(e), but
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   119
           snow, box, tray.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   120
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   121
        if i < (self.k0 + 2) or not self.cons(i) or self.cons(i-1) or not self.cons(i-2):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   122
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   123
        ch = self.b[i]
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   124
        if ch == 'w' or ch == 'x' or ch == 'y':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   125
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   126
        return 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   127
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   128
    def ends(self, s):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   129
        """ends(s) is TRUE <=> k0,...k ends with the string s."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   130
        length = len(s)
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   131
        if s[length - 1] != self.b[self.k]: # tiny speed-up
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   132
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   133
        if length > (self.k - self.k0 + 1):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   134
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   135
        if self.b[self.k-length+1:self.k+1] != s:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   136
            return 0
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   137
        self.j = self.k - length
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   138
        return 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   139
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   140
    def setto(self, s):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   141
        """setto(s) sets (j+1),...k to the characters in the string s, readjusting k."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   142
        length = len(s)
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   143
        self.b = self.b[:self.j+1] + s + self.b[self.j+length+1:]
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   144
        self.k = self.j + length
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   145
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   146
    def r(self, s):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   147
        """r(s) is used further down."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   148
        if self.m() > 0:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   149
            self.setto(s)
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   150
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   151
    def step1ab(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   152
        """step1ab() gets rid of plurals and -ed or -ing. e.g.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   153
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   154
           caresses  ->  caress
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   155
           ponies    ->  poni
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   156
           ties      ->  ti
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   157
           caress    ->  caress
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   158
           cats      ->  cat
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   159
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   160
           feed      ->  feed
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   161
           agreed    ->  agree
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   162
           disabled  ->  disable
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   163
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   164
           matting   ->  mat
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   165
           mating    ->  mate
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   166
           meeting   ->  meet
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   167
           milling   ->  mill
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   168
           messing   ->  mess
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   169
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   170
           meetings  ->  meet
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   171
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   172
        if self.b[self.k] == 's':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   173
            if self.ends("sses"):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   174
                self.k = self.k - 2
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   175
            elif self.ends("ies"):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   176
                self.setto("i")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   177
            elif self.b[self.k - 1] != 's':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   178
                self.k = self.k - 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   179
        if self.ends("eed"):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   180
            if self.m() > 0:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   181
                self.k = self.k - 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   182
        elif (self.ends("ed") or self.ends("ing")) and self.vowelinstem():
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   183
            self.k = self.j
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   184
            if self.ends("at"):   self.setto("ate")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   185
            elif self.ends("bl"): self.setto("ble")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   186
            elif self.ends("iz"): self.setto("ize")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   187
            elif self.doublec(self.k):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   188
                self.k = self.k - 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   189
                ch = self.b[self.k]
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   190
                if ch == 'l' or ch == 's' or ch == 'z':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   191
                    self.k = self.k + 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   192
            elif (self.m() == 1 and self.cvc(self.k)):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   193
                self.setto("e")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   194
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   195
    def step1c(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   196
        """step1c() turns terminal y to i when there is another vowel in the stem."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   197
        if (self.ends("y") and self.vowelinstem()):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   198
            self.b = self.b[:self.k] + 'i' + self.b[self.k+1:]
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   199
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   200
    def step2(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   201
        """step2() maps double suffices to single ones.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   202
        so -ization ( = -ize plus -ation) maps to -ize etc. note that the
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   203
        string before the suffix must give m() > 0.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   204
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   205
        if self.b[self.k - 1] == 'a':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   206
            if self.ends("ational"):   self.r("ate")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   207
            elif self.ends("tional"):  self.r("tion")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   208
        elif self.b[self.k - 1] == 'c':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   209
            if self.ends("enci"):      self.r("ence")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   210
            elif self.ends("anci"):    self.r("ance")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   211
        elif self.b[self.k - 1] == 'e':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   212
            if self.ends("izer"):      self.r("ize")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   213
        elif self.b[self.k - 1] == 'l':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   214
            if self.ends("bli"):       self.r("ble") # --DEPARTURE--
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   215
            # To match the published algorithm, replace this phrase with
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   216
            #   if self.ends("abli"):      self.r("able")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   217
            elif self.ends("alli"):    self.r("al")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   218
            elif self.ends("entli"):   self.r("ent")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   219
            elif self.ends("eli"):     self.r("e")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   220
            elif self.ends("ousli"):   self.r("ous")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   221
        elif self.b[self.k - 1] == 'o':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   222
            if self.ends("ization"):   self.r("ize")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   223
            elif self.ends("ation"):   self.r("ate")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   224
            elif self.ends("ator"):    self.r("ate")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   225
        elif self.b[self.k - 1] == 's':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   226
            if self.ends("alism"):     self.r("al")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   227
            elif self.ends("iveness"): self.r("ive")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   228
            elif self.ends("fulness"): self.r("ful")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   229
            elif self.ends("ousness"): self.r("ous")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   230
        elif self.b[self.k - 1] == 't':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   231
            if self.ends("aliti"):     self.r("al")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   232
            elif self.ends("iviti"):   self.r("ive")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   233
            elif self.ends("biliti"):  self.r("ble")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   234
        elif self.b[self.k - 1] == 'g': # --DEPARTURE--
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   235
            if self.ends("logi"):      self.r("log")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   236
        # To match the published algorithm, delete this phrase
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   237
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   238
    def step3(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   239
        """step3() dels with -ic-, -full, -ness etc. similar strategy to step2."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   240
        if self.b[self.k] == 'e':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   241
            if self.ends("icate"):     self.r("ic")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   242
            elif self.ends("ative"):   self.r("")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   243
            elif self.ends("alize"):   self.r("al")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   244
        elif self.b[self.k] == 'i':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   245
            if self.ends("iciti"):     self.r("ic")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   246
        elif self.b[self.k] == 'l':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   247
            if self.ends("ical"):      self.r("ic")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   248
            elif self.ends("ful"):     self.r("")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   249
        elif self.b[self.k] == 's':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   250
            if self.ends("ness"):      self.r("")
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   251
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   252
    def step4(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   253
        """step4() takes off -ant, -ence etc., in context <c>vcvc<v>."""
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   254
        if self.b[self.k - 1] == 'a':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   255
            if self.ends("al"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   256
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   257
        elif self.b[self.k - 1] == 'c':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   258
            if self.ends("ance"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   259
            elif self.ends("ence"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   260
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   261
        elif self.b[self.k - 1] == 'e':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   262
            if self.ends("er"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   263
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   264
        elif self.b[self.k - 1] == 'i':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   265
            if self.ends("ic"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   266
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   267
        elif self.b[self.k - 1] == 'l':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   268
            if self.ends("able"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   269
            elif self.ends("ible"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   270
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   271
        elif self.b[self.k - 1] == 'n':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   272
            if self.ends("ant"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   273
            elif self.ends("ement"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   274
            elif self.ends("ment"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   275
            elif self.ends("ent"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   276
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   277
        elif self.b[self.k - 1] == 'o':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   278
            if self.ends("ion") and (self.b[self.j] == 's' \
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   279
                or self.b[self.j] == 't'): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   280
            elif self.ends("ou"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   281
            # takes care of -ous
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   282
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   283
        elif self.b[self.k - 1] == 's':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   284
            if self.ends("ism"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   285
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   286
        elif self.b[self.k - 1] == 't':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   287
            if self.ends("ate"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   288
            elif self.ends("iti"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   289
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   290
        elif self.b[self.k - 1] == 'u':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   291
            if self.ends("ous"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   292
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   293
        elif self.b[self.k - 1] == 'v':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   294
            if self.ends("ive"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   295
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   296
        elif self.b[self.k - 1] == 'z':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   297
            if self.ends("ize"): pass
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   298
            else: return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   299
        else:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   300
            return
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   301
        if self.m() > 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   302
            self.k = self.j
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   303
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   304
    def step5(self):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   305
        """step5() removes a final -e if m() > 1, and changes -ll to -l if
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   306
        m() > 1.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   307
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   308
        self.j = self.k
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   309
        if self.b[self.k] == 'e':
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   310
            a = self.m()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   311
            if a > 1 or (a == 1 and not self.cvc(self.k-1)):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   312
                self.k = self.k - 1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   313
        if self.b[self.k] == 'l' and self.doublec(self.k) and self.m() > 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   314
            self.k = self.k -1
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   315
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   316
    def stem(self, p, i, j):
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   317
        """In stem(p,i,j), p is a char pointer, and the string to be stemmed
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   318
        is from p[i] to p[j] inclusive. Typically i is zero and j is the
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   319
        offset to the last character of a string, (p[j+1] == '\0'). The
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   320
        stemmer adjusts the characters p[i] ... p[j] and returns the new
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   321
        end-point of the string, k. Stemming never increases word length, so
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   322
        i <= k <= j. To turn the stemmer into a module, declare 'stem' as
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   323
        extern, and delete the remainder of this file.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   324
        """
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   325
        # copy the parameters into statics
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   326
        self.b = p
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   327
        self.k = j
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   328
        self.k0 = i
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   329
        if self.k <= self.k0 + 1:
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   330
            return self.b # --DEPARTURE--
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   331
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   332
        # With this line, strings of length 1 or 2 don't go through the
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   333
        # stemming process, although no mention is made of this in the
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   334
        # published algorithm. Remove the line to match the published
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   335
        # algorithm.
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   336
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   337
        self.step1ab()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   338
        self.step1c()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   339
        self.step2()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   340
        self.step3()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   341
        self.step4()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   342
        self.step5()
d8ac696cc51f helium_7.0-r14027
wbernard
parents:
diff changeset
   343
        return self.b[self.k0:self.k+1]