Welcome to : https://chunshan-theta.github.io/NLPLab/
researching the core of nlp. contant: [email protected]
implement of sentiment model in Chinese
implement of word2vec model
1. loading the train data
> 讀取停用字 loading stop words ( word2vec/stop_words.txt.py )
> loading training article ( word2vec/wiki/ or word2vec/TextForTrain/ )
the main step:
* clear special character:only chinese
* simplified to treditional (../nstools/)
2. Build the dictionary and replace rare words with UNKNOWWORD token.
> Build the dictionary
> rare words processed
the main step:
* Setting the size of the word set for the training model
* using function: collections.Counter().most_common()
3. Function to generate a training batch for the skip-gram model.
4. Build and train a skip-gram model.
> Loss: tf.nn.nce_loss()
> Optimizer: tf.train.AdamOptimizer(learning_rate=1.0).minimize()
5. Begin training
> training stage
> TensorBoard (will output to word2vec/TB/)
> output to Json txt file :result_Json
Getting the date from website:
1:scientific article
2:Positive and negatiave review
the testing of model of nlp
Setting for traditional Chinese
converting simplified to traditional
implement of TextRank model
implement of tf-idf model