A simple Spark LDA example. This project contains a basic Document Clustering example in which data cleaning is also done.
We are going to perform these procedures for the document clustering, these steps include:
-
Spark RegexTokenizer : For Tokenization
-
Stanford NLP Morphology : For Stemming and lemmatization
-
Spark StopWordsRemover : For removing stop words and punctuation
-
Spark TF-IDF : For computing term frequencies or tf-idf
-
Spark LDA : For Clustering of documents.