Skip to content

Commit

Permalink
[SPARK-14050][ML] Add multiple languages support and additional metho…
Browse files Browse the repository at this point in the history
…ds for Stop Words Remover

## What changes were proposed in this pull request?

This PR continues the work from apache#11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes apache#11871

cc: burakkose srowen

Author: Burak Köse <[email protected]>
Author: Xiangrui Meng <[email protected]>
Author: Burak KOSE <[email protected]>

Closes apache#12843 from mengxr/SPARK-14050.
  • Loading branch information
burakkose authored and mengxr committed May 6, 2016
1 parent 5c8fad7 commit e20cd9f
Show file tree
Hide file tree
Showing 20 changed files with 2,614 additions and 87 deletions.
24 changes: 24 additions & 0 deletions licenses/LICENSE-postgresql.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
PostgreSQL Database Management System
(formerly known as Postgres, then as Postgres95)

Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group

Portions Copyright (c) 1994, The Regents of the University of California

Permission to use, copy, modify, and distribute this software and its
documentation for any purpose, without fee, and without a written agreement
is hereby granted, provided that the above copyright notice and this
paragraph and the following two paragraphs appear in all copies.

IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO
PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Stopwords Corpus

This corpus contains lists of stop words for several languages. These
are high-frequency grammatical words which are usually ignored in text
retrieval applications.

They were obtained from:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/

The English list has been augmented
https://github.com/nltk/nltk_data/issues/22

Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
og
i
jeg
det
at
en
den
til
er
som
de
med
han
af
for
ikke
der
var
mig
sig
men
et
har
om
vi
min
havde
ham
hun
nu
over
da
fra
du
ud
sin
dem
os
op
man
hans
hvor
eller
hvad
skal
selv
her
alle
vil
blev
kunne
ind
når
være
dog
noget
ville
jo
deres
efter
ned
skulle
denne
end
dette
mit
også
under
have
dig
anden
hende
mine
alt
meget
sit
sine
vor
mod
disse
hvis
din
nogle
hos
blive
mange
ad
bliver
hendes
været
thi
jer
sådan
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
de
en
van
ik
te
dat
die
in
een
hij
het
niet
zijn
is
was
op
aan
met
als
voor
had
er
maar
om
hem
dan
zou
of
wat
mijn
men
dit
zo
door
over
ze
zich
bij
ook
tot
je
mij
uit
der
daar
haar
naar
heb
hoe
heeft
hebben
deze
u
want
nog
zal
me
zij
nu
ge
geen
omdat
iets
worden
toch
al
waren
veel
meer
doen
toen
moet
ben
zonder
kan
hun
dus
alles
onder
ja
eens
hier
wie
werd
altijd
doch
wordt
wezen
kunnen
ons
zelf
tegen
na
reeds
wil
kon
niets
uw
iemand
geweest
andere
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
should
now
d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn
Loading

0 comments on commit e20cd9f

Please sign in to comment.