Skip to content

Implementing Facebook's FastText with java

License

Notifications You must be signed in to change notification settings

michalmisiewicz/fastText4j

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastText4j

Implementing Facebook's FastText with java. Fasttext is a library for text representation and classification by facebookresearch. It implements text classification and word embedding learning.

中文文档

Features:

  • Implementing with java(kotlin)
  • Well-designed API
  • Compatible with original C++ model file (include quantizer compression model)
  • Provides training api (almost the same performance)
  • Support for java file formats( can read file use mmap),read big model file with less memory

Installing

Gradle

compile 'com.mayabot:fastText4j:1.2.2'

Maven

<dependency>
  <groupId>com.mayabot</groupId>
  <artifactId>fastText4j</artifactId>
  <version>1.2.2</version>
</dependency>

Tutorial

1. Train model

GitHub

  • ModelName.sup supervised
  • ModelName.sg skipgram
  • ModelName.cow cbow
//Word representation learning
FastText fastText = FastText.train(new File("train.data"), ModelName.sg);

// Text classification

FastText fastText = FastText.train(new File("train.data"), ModelName.sup);

data.txt is also encoded in utf-8 with one sample each line. And it needs to do word spliting beforehand as well. There is a string starting with __label__ in each line,representing the classifying target, such as __label__正面. Each sample could have multiple label. Through the attribute 'label' in TrainArgs, you can customise the head.

2. save model

save model to java format

fastText.saveModel("path/data.model");

3. load model

public Fasttext loadModel(String modelPath, boolean mmap)

//load from java format 
FastText fastText = FastText.loadModel("path/data.model",true);

//load from c++ format
FastText fastText = FastText.loadFasttextBinModel("path/wiki.bin") 

4. quantizer compression

FastText quantize(FastText fastText , int dsub=2, boolean qnorm=false)

//load from java format 
FastText quantizerFastText = FastText.quantize(fastText,2,false);

5.Predict

//predict the result of a word
List<FloatStringPair> predict = fastText.predict(Arrays.asList("fastText在预测标签时使用了非线性激活函数".split(" ")), 5);

6.Nearest Neighbor Search

List<FloatStringPair> predict = fastText.nearestNeighbor("中国",5);

7.Analogies

By giving three words A, B and C, return the nearest words in terms of semantic distance and their similarity list, under the condition of (A - B + C).

List<FloatStringPair> predict = fastText.analogies("国王","皇后","男",5);

Ag News example

test agnews data set, train and predict by fastText4j

Result:

   Read 5M words
   Number of words:  95812
   Number of labels: 4
   Progress: 100.00% words/sec/thread:  5792774 lr: 0.00000 loss: 0.28018 ETA: 0h 0m 0s
   Train use time 5275 ms
   total=7600
   right=6889
   rate 0.9064473684210527

Parameters of TrainArgs

The parameters is consistant with the C++ version :

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurences [1]
  -minCountLabel      minimal number of label occurences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Resource

Official pre-trained model

Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

License

fastText is BSD-licensed. facebook provide an additional patent grant.

About

Implementing Facebook's FastText with java

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Kotlin 93.1%
  • Java 6.9%