forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-14050][ML] Add multiple languages support and additional metho…
…ds for Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from apache#11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes apache#11871 cc: burakkose srowen Author: Burak Köse <[email protected]> Author: Xiangrui Meng <[email protected]> Author: Burak KOSE <[email protected]> Closes apache#12843 from mengxr/SPARK-14050.
- Loading branch information
Showing
20 changed files
with
2,614 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
PostgreSQL Database Management System | ||
(formerly known as Postgres, then as Postgres95) | ||
|
||
Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group | ||
|
||
Portions Copyright (c) 1994, The Regents of the University of California | ||
|
||
Permission to use, copy, modify, and distribute this software and its | ||
documentation for any purpose, without fee, and without a written agreement | ||
is hereby granted, provided that the above copyright notice and this | ||
paragraph and the following two paragraphs appear in all copies. | ||
|
||
IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR | ||
DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING | ||
LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS | ||
DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE | ||
POSSIBILITY OF SUCH DAMAGE. | ||
|
||
THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, | ||
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY | ||
AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS | ||
ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO | ||
PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. | ||
|
12 changes: 12 additions & 0 deletions
12
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Stopwords Corpus | ||
|
||
This corpus contains lists of stop words for several languages. These | ||
are high-frequency grammatical words which are usually ignored in text | ||
retrieval applications. | ||
|
||
They were obtained from: | ||
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/ | ||
|
||
The English list has been augmented | ||
https://github.com/nltk/nltk_data/issues/22 | ||
|
94 changes: 94 additions & 0 deletions
94
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/danish.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
og | ||
i | ||
jeg | ||
det | ||
at | ||
en | ||
den | ||
til | ||
er | ||
som | ||
på | ||
de | ||
med | ||
han | ||
af | ||
for | ||
ikke | ||
der | ||
var | ||
mig | ||
sig | ||
men | ||
et | ||
har | ||
om | ||
vi | ||
min | ||
havde | ||
ham | ||
hun | ||
nu | ||
over | ||
da | ||
fra | ||
du | ||
ud | ||
sin | ||
dem | ||
os | ||
op | ||
man | ||
hans | ||
hvor | ||
eller | ||
hvad | ||
skal | ||
selv | ||
her | ||
alle | ||
vil | ||
blev | ||
kunne | ||
ind | ||
når | ||
være | ||
dog | ||
noget | ||
ville | ||
jo | ||
deres | ||
efter | ||
ned | ||
skulle | ||
denne | ||
end | ||
dette | ||
mit | ||
også | ||
under | ||
have | ||
dig | ||
anden | ||
hende | ||
mine | ||
alt | ||
meget | ||
sit | ||
sine | ||
vor | ||
mod | ||
disse | ||
hvis | ||
din | ||
nogle | ||
hos | ||
blive | ||
mange | ||
ad | ||
bliver | ||
hendes | ||
været | ||
thi | ||
jer | ||
sådan |
101 changes: 101 additions & 0 deletions
101
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/dutch.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
de | ||
en | ||
van | ||
ik | ||
te | ||
dat | ||
die | ||
in | ||
een | ||
hij | ||
het | ||
niet | ||
zijn | ||
is | ||
was | ||
op | ||
aan | ||
met | ||
als | ||
voor | ||
had | ||
er | ||
maar | ||
om | ||
hem | ||
dan | ||
zou | ||
of | ||
wat | ||
mijn | ||
men | ||
dit | ||
zo | ||
door | ||
over | ||
ze | ||
zich | ||
bij | ||
ook | ||
tot | ||
je | ||
mij | ||
uit | ||
der | ||
daar | ||
haar | ||
naar | ||
heb | ||
hoe | ||
heeft | ||
hebben | ||
deze | ||
u | ||
want | ||
nog | ||
zal | ||
me | ||
zij | ||
nu | ||
ge | ||
geen | ||
omdat | ||
iets | ||
worden | ||
toch | ||
al | ||
waren | ||
veel | ||
meer | ||
doen | ||
toen | ||
moet | ||
ben | ||
zonder | ||
kan | ||
hun | ||
dus | ||
alles | ||
onder | ||
ja | ||
eens | ||
hier | ||
wie | ||
werd | ||
altijd | ||
doch | ||
wordt | ||
wezen | ||
kunnen | ||
ons | ||
zelf | ||
tegen | ||
na | ||
reeds | ||
wil | ||
kon | ||
niets | ||
uw | ||
iemand | ||
geweest | ||
andere |
153 changes: 153 additions & 0 deletions
153
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
i | ||
me | ||
my | ||
myself | ||
we | ||
our | ||
ours | ||
ourselves | ||
you | ||
your | ||
yours | ||
yourself | ||
yourselves | ||
he | ||
him | ||
his | ||
himself | ||
she | ||
her | ||
hers | ||
herself | ||
it | ||
its | ||
itself | ||
they | ||
them | ||
their | ||
theirs | ||
themselves | ||
what | ||
which | ||
who | ||
whom | ||
this | ||
that | ||
these | ||
those | ||
am | ||
is | ||
are | ||
was | ||
were | ||
be | ||
been | ||
being | ||
have | ||
has | ||
had | ||
having | ||
do | ||
does | ||
did | ||
doing | ||
a | ||
an | ||
the | ||
and | ||
but | ||
if | ||
or | ||
because | ||
as | ||
until | ||
while | ||
of | ||
at | ||
by | ||
for | ||
with | ||
about | ||
against | ||
between | ||
into | ||
through | ||
during | ||
before | ||
after | ||
above | ||
below | ||
to | ||
from | ||
up | ||
down | ||
in | ||
out | ||
on | ||
off | ||
over | ||
under | ||
again | ||
further | ||
then | ||
once | ||
here | ||
there | ||
when | ||
where | ||
why | ||
how | ||
all | ||
any | ||
both | ||
each | ||
few | ||
more | ||
most | ||
other | ||
some | ||
such | ||
no | ||
nor | ||
not | ||
only | ||
own | ||
same | ||
so | ||
than | ||
too | ||
very | ||
s | ||
t | ||
can | ||
will | ||
just | ||
don | ||
should | ||
now | ||
d | ||
ll | ||
m | ||
o | ||
re | ||
ve | ||
y | ||
ain | ||
aren | ||
couldn | ||
didn | ||
doesn | ||
hadn | ||
hasn | ||
haven | ||
isn | ||
ma | ||
mightn | ||
mustn | ||
needn | ||
shan | ||
shouldn | ||
wasn | ||
weren | ||
won | ||
wouldn |
Oops, something went wrong.