Skip to content

Commit

Permalink
Merge remote branch 'quesada/master' into dedan
Browse files Browse the repository at this point in the history
Conflicts:
	.gitignore
	CHANGELOG.txt
	MANIFEST.in
	README.txt
	TODO.txt
	docs/_sources/apiref.txt
	docs/_sources/dist_lda.txt
	docs/_sources/dist_lsi.txt
	docs/_sources/distributed.txt
	docs/_sources/index.txt
	docs/_sources/install.txt
	docs/_sources/intro.txt
	docs/_sources/models/models.txt
	docs/_sources/tut1.txt
	docs/_sources/tut2.txt
	docs/_sources/tut3.txt
	docs/_sources/tutorial.txt
	docs/_sources/utils.txt
	docs/_sources/wiki.txt
	docs/_static/default.css
	docs/apiref.html
	docs/corpora/bleicorpus.html
	docs/corpora/corpora.html
	docs/corpora/dictionary.html
	docs/corpora/dmlcorpus.html
	docs/corpora/lowcorpus.html
	docs/corpora/mmcorpus.html
	docs/corpora/svmlightcorpus.html
	docs/corpora/wikicorpus.html
	docs/dist_lda.html
	docs/dist_lsi.html
	docs/distributed.html
	docs/genindex.html
	docs/index.html
	docs/install.html
	docs/interfaces.html
	docs/intro.html
	docs/matutils.html
	docs/models/lda_dispatcher.html
	docs/models/lda_worker.html
	docs/models/ldamodel.html
	docs/models/lsi_dispatcher.html
	docs/models/lsi_worker.html
	docs/models/lsimodel.html
	docs/models/models.html
	docs/models/rpmodel.html
	docs/models/tfidfmodel.html
	docs/modindex.html
	docs/objects.inv
	docs/py-modindex.html
	docs/search.html
	docs/searchindex.js
	docs/similarities/docsim.html
	docs/src/_static/default.css
	docs/src/_templates/indexsidebar.html
	docs/src/_templates/layout.html
	docs/src/apiref.rst
	docs/src/conf.py
	docs/src/dist_lda.rst
	docs/src/dist_lsi.rst
	docs/src/distributed.rst
	docs/src/index.rst
	docs/src/install.rst
	docs/src/intro.rst
	docs/src/models/models.rst
	docs/src/tut1.rst
	docs/src/tut2.rst
	docs/src/tut3.rst
	docs/src/tutorial.rst
	docs/src/utils.rst
	docs/src/wiki.rst
	docs/tut1.html
	docs/tut2.html
	docs/tut3.html
	docs/tutorial.html
	docs/utils.html
	docs/wiki.html
	ez_setup.py
	setup.py
	src/gensim/__init__.py
	src/gensim/corpora/__init__.py
	src/gensim/corpora/bleicorpus.py
	src/gensim/corpora/dictionary.py
	src/gensim/corpora/dmlcorpus.py
	src/gensim/corpora/lowcorpus.py
	src/gensim/corpora/mmcorpus.py
	src/gensim/corpora/sources.py
	src/gensim/corpora/svmlightcorpus.py
	src/gensim/corpora/wikicorpus.py
	src/gensim/dmlcz/gensim_build.py
	src/gensim/dmlcz/gensim_genmodel.py
	src/gensim/dmlcz/gensim_xml.py
	src/gensim/dmlcz/runall.sh
	src/gensim/interfaces.py
	src/gensim/matutils.py
	src/gensim/models/__init__.py
	src/gensim/models/lda_dispatcher.py
	src/gensim/models/lda_worker.py
	src/gensim/models/ldamodel.py
	src/gensim/models/lsi_dispatcher.py
	src/gensim/models/lsi_worker.py
	src/gensim/models/lsimodel.py
	src/gensim/models/rpmodel.py
	src/gensim/models/tfidfmodel.py
	src/gensim/similarities/docsim.py
	src/gensim/test/test_corpora.py
	src/gensim/test/test_models.py
	src/gensim/test/testcorpus.low
	src/gensim/test/testcorpus.mm
	src/gensim/test/testcorpus.svmlight
	src/gensim/utils.py
  • Loading branch information
piskvorky committed Mar 11, 2011
2 parents 74c3aa1 + 5e150aa commit d96f100
Show file tree
Hide file tree
Showing 41 changed files with 33,099 additions and 47 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,10 @@ Thumbs.db
.pydevproject
.settings/
docs/src/_build/
gensim.egg-info
*,cover
.idea
*.dict
*.index
.coverage
data
10 changes: 10 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
<<<<<<< HEAD
recursive-include docs *
recursive-include src/gensim/test testcorpus*
recursive-include src *.sh
prune docs/src*
include COPYING
include COPYING.LESSER
include ez_setup.py
=======
recursive-include docs *
recursive-include src/gensim/test testcorpus*
recursive-include src *.sh
prune docs/src*
include COPYING
include COPYING.LESSER
include ez_setup.py
>>>>>>> quesada/master
60 changes: 60 additions & 0 deletions README.git.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
This is my working version of gensim. I keep it synchronized with the upstream
svn one at assembla.
I have added some functional tests and utility functions to it. But the main
reason I'm using the library is to replicate (Gabrilovich & Markovitch, 2006,
2007b, 2009) Explicit semantic analisis (ESA).

For other implementations try:
C#: http://www.srcco.de/v/wikipedia-esa
java: airhead research library. However the lack of sparse matrix support on
java linear algebra libraries make java a poor choice.

Currently (as of 27 Aug 2010) , gensim can parse wikipedia from xml wiki dumps quite efficiently.
However, our ESA code uses a different parsing that we did before (following the
method section of the paper).

We use here a parsing from March 2008.

Our parsings have three advantages:
1- THey consider centrality measures, and this is not currently easy to do with
the xml dumps directly
2-
3- We did an unsupervised name entity recognition parsing (NER) using openNLP.
THis is parallelized on 8 cores using java code, see ri.larkc.eu:8087/tools.
We could have used

NOTE:
Because example corpora are big, the repository ignores the data folder. Our
parsing is available online at: (TODO)
download it and place it under (TODO)

folder structure:

/acme
contains my working scripts

/data/corpora
contains corpora.

/parsing
tfidf/preprocessing/porter in /parsing adapted from Mathieu Blondel:
git clone http://www.mblondel.org/code/tfidf.git

how to replicate the paper
--------------------------
code is in /acme/lee-wiki

First you need to create the tfidf space.
There's a flag. Set createCorpus = True.
The corpus creation takes about 1hr, with profuse logging.
This is faster than parsing the corpus from xml (about 16 hrs) because we do not
do any xml filtering, stopword removal etc (it's already done on the .cor file).

Once the sparse matrix is on disk, it's faster to read the serialized objects than to
parse the corpus again.

References:
------------
E. Gabrilovich and S. Markovitch (2009) "Wikipedia-based Semantic Interpretation
for Natural Language Processing", Journal of artificial intelligence research, Volume 34, pages 443-498
doi:10.1613/jair.2669
8 changes: 8 additions & 0 deletions parsing/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""
This package contains functions to preprocess raw text
"""

# bring model classes directly into package namespace, to save some typing
from porter import PorterStemmer
from tfidf import tfidf
from preprocessing import *
Loading

0 comments on commit d96f100

Please sign in to comment.