Wikipedia2Vec

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been used in the following applications:

Entity linking: Yamada et al., 2016, Eshel et al., 2017, Chen et al., 2019.
Named entity recognition: Sato et al., 2017, Lara-Clares and Garcia-Serrano, 2019.
Question answering: Yamada et al., 2017.
Entity typing: Yamada et al., 2018.
Text classification: Yamada et al., 2018, Yamada and Shindo, 2019.
Paraphrase detection: Duong et al., 2018.
Knowledge graph completion: Shah et al., 2019.
Fake news detection: Singh et al., 2019.
Plot analysis of movies: Papalampidi et al., 2019.
Enhancement of BERT using Wikipedia knowledge: Poerner et al., 2019.

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia.

@article{yamada2018wikipedia2vec,
  title={Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia},
  author={Yamada, Ikuya and Asai, Akari and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  journal={arXiv preprint 1812.06280},
  year={2018}
}

Wikipedia2Vec is an official implementation of the embedding model proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/K16-1025},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics}
}

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.circleci		.circleci
data		data
docs		docs
examples/text_classification		examples/text_classification
paper		paper
scripts		scripts
tests		tests
wikipedia2vec		wikipedia2vec
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
cythonize.sh		cythonize.sh
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia2Vec

Basic Usage

Pretrained Embeddings

Use Cases

References

License

About

Releases

Packages

Languages

License

lizh0019/wikipedia2vec

Folders and files

Latest commit

History

Repository files navigation

Wikipedia2Vec

Basic Usage

Pretrained Embeddings

Use Cases

References

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages