A Stop List For General Text
A Stop List For General Text
1. Introduction
A stop list, or negative dictionary is a device used in automatic indexing to filter out words that would
make poor index terms [1]. Traditionally[21 stop lists are supposed to have included only the most frequently
occurring words. In practice, however, stop lists have tended to include infrequently occurring words, and
have not included many frequently occurring words. Infrequently occurring words seem to have been
included because stop list compilers have not, for whatever reason, consulted empirical studies of word
frequencies. Frequently occurring words seem to have been left out for the same reason, and also because
many of them might still be important as index terms.
This paper reports an exercise in generating a stop list for general text based on the Brown corpus [3] of
1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring
more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that
they are too important as potential index terms. Twenty-six words are then added to the list in the belief
that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list
because the finite state machine based filter in which this list is intended to be used is able to filter them at
almost no cost. The final product is a list of 421 stop words that should be maximally efficient and
effective in filtering the most frequently occurring and semantically neutral words in general literature in
English.
potential index terms to be filtered from our indexes. Choice of these words was again an arbitrary
decision, although we suspect that our choices will not be controversial. Thirty-two words were culled
from the list in this way; they are listed in alphabetical order in Appendix B, along with their frequency of
occurrence and their rank in the original list.
Altogether, this list has 149 words in it. These words are listed in alphabetical order in Appendix D. With
these words, the stop list is brought to its final size of 421 words. A minimal DFA recognizer for this list
will contain 317 states and 552 arcs, which is remarkably small considering that the list contains 421 words
and 2450 characters.
6. Conclusion
Selecting a stop word list is more difficult than it appears, especially if its members are chosen based on
empirical data about word usage in English. The stop list that we have generated can serve as the basis for
stop lists for specialized data bases, or as a list for general English literature.
- 22 -
a 23073 5
about 1816 59
above 296 - e x t r a fluff
across 282 - e x t r a fluff
after 1070 89
again 580 156
against 627 142
all 3002 37
almost 433 189
alone 195 - cheap with "along"
along 355 246
already 274 - e x t r a fluff
also 1070 88
although 323 261
always 456 180
among 369 235
an 3727 29
and 28872 3
another 690 130
any 1348 76
anybody 45 - (body, one,thing, where) endings
anyone 146 - (body, one,thing, where) endings
anything 280 - (body, one, thing, where) endings
anywhere 39 - (body, one,thing, where) endings
are 4372 25
area 318 265
areas 236 - c h e a p w i t h "area"
around 567 158
as 7254 14
ask 128 - (ed, ing, s) e n d i n g s
asked 397 207
asking 67 - (ed, ing, s) e n d i n g s
asks 18 - (ed, ing, s) e n d i n g s
at 5377 19
away 458 179
b 140 - free
back 936 98
backed 24 - (ed, ing, s) e n d i n g s
backing 8 - (ed, ing, s) e n d i n g s
backs 15 - (ed, ing, s) e n d i n g s
be 6361 18
because 883 104
become 359 242
becomes 104 - c h e a p w i t h "become"
became 246 - c h e a p w i t h "become"
been 2470 46
before 1018 92
began 312 272
behind 258 - e x t r a fluff
- 28 -
facts 87 m
cheap with "fact"
far 426 192
felt 357 245
few 601 151
find 397 205
finds 59 cheap with "find"
first 1361 74
for 9495 II
four 348 248
from 4371 26
full 230 D
extra fluff
fully 8O (ly) ending
further 218 m
extra fluff
furthered 3 (ed, ing, s) endings
furthering 2 (ed, ing, s) endings
furthers 0 m
(ed, ing, s) endings
g 55 m
free
gave 285 extra fluff
general 414 197
generally 132 (ly) ending
get 742 124
gets 66 cheap with "get"
give 387 212
given 375 226
gives 114 cheap with "give"
go 613 149
going 395 209
good 789 116
goods 57 cheap with "good"
got 338 253
great 615 145
greater 188 (er,est) endings
greatest 88 (er,est) endings
group 382 216
grouped 5 (ed, ing, s) endings
grouping 4 (ed, ing, s) endings
groups 125 m
(ed, ing, s) endings
h 80 free
had 5133 23
has 2430 47
have 3925 28
having 279 extra fluff
he 9500 10
her 3032 36
herself 125 cheap with "her"
here 761 122
high 499 168
higher 161 (er, est) endings
highest 4 (er, est) endings
him 2572 45
himself 596 152
his 6891 16
how 823 113
h o we ver 552 160
i 5149 21
- 30 -
if 2199 51
important 369 233
in 21337 6
interest 324 260
interested I01 m
(ed, ing, s) endings
interesting 82 (ed, ing, s) endings
interests 83 (ed, ing, s) endings
into 1784 61
is 10066 8
it 8730 12
its 1854 58
itself 304 280
j 13o free
just 872 106
k 30 free
keep 260 extra fluff
keeps 21 cheap with "keep"
kind 312 270
knew 394 211
know 674 136
known 245 extra fluff
knows 99 cheap with "know"
1 60 free
large 361 238
largely 68 (ly) ending
last 676 135
later 397 204
latest 35 (er, est) endings
least 343 251
less 337 254
let 377 222
lets 5 cheap with "let"
like 1294 80
likely 151 (ly) ending
long 753 123
longer 193 (er, est) endings
longest 6 (er, est) endings
m 70 free
made Iii0 87
make 794 115
making 231 extra fluff
man 1281 81
many 1027 91
may 1398 72
me 1173 84
member 137 m
free
members 318 264
men 770 120
might 672 137
more 2202 50
most 1159 86
mostly 44 (ly) ending
mr 833 iii
mrs 536 162
much 937 97
- 31 -
must 1013 93
my 1306 78
myself 129 cheap with "my"
n 35 g
free
necessary 222 extra fluff
need 360 239
needed 187 (ed, ing, s) endings
needing 5 (ed, ing, s) endings
needs 59 (ed, ing, s) endings
never 697 129
new 1635 66
newer 20 (er, est) endings
newest 15 (er, est) endings
next 395 208
no 2143 52
non 146 cheap with "not"
not 6976 15
nobody 79 n
(body, one,thing, where) endings
noone 0 E
(body, one,thing, where) endings
nothing 412 198
now 1314 77
nowhere 30 D
(body, one, thing, where) endings
number 472 175
numbers 125 cheap with "number"
o 35 free
of 36432 2
off 639 140
often 369 232
old 561 159
older 93 (er, est) endings
oldest 14 (er, est) endings
on 6742 17
once 499 167
one 3298 33
only 1748 64
open 318 263
opened 131 (ed, ing, s) endings
opening 83 (ed, ing, s) endings
opens 16 (ed, ing, s) endings
or 4204 27
order 376 223
ordered 69 (ed, ing, s) endings
ordering 13 (ed, ing, s) endings
orders 58 (ed, ing, s) endings
other 1701 65
others 325 259
our 1233 83
out 2082 53
over 1235 82
P 120 free
part 499 166
parted 5 (ed, ing, s) endings
parting 3 (ed, ing, s) endings
parts 113 (ed, ing, s) endings
per 371 230
- 32 -
W 9O I
free
want 328 257
wanted 226 (ed, ing, s) endings
wanting 16 (ed, ing, s) endings
wants 72 (ed, ing, s) endings
was 9806 9
way 910 i00
ways 128 m
cheap with "way"
we 2628 44
well 897 102
wells 6 w
cheap with "well"
went 5O8 165
were 3283 34
what 1961 55
when 2333 49
where 946 96
whether 286 extra fluff
which 3560 32
while 680 133
who 2380 48
whole 309 278
whose 251 m
extra fluff
why 404 201
will 2798 4O
with 7289 13
within 359 241
without 583 155
work 763 121
worked 128 (ed, ing, s) endings
working 151 m
(ed, ing, s} endings
works 130 (ed, ing, s) endings
would 3062 35
Y 20 free
year 699 128
years 957 95
yet 283 m
extra fluff
you 3668 30
young 378 219
younger 41 (er, est) endings
youngest 14 (er, est) endings
your 912 99
yours 25 cheap with "your"
- 35 -
REFERENCES