knowledge repo

Kunkala · Nov 28, 2017 · ffe8ca9 · ffe8ca9
1 parent a834184
commit ffe8ca9
Show file tree

Hide file tree

Showing 37 changed files with 101 additions and 598,609 deletions.
diff --git a/Procfile b/Procfile
@@ -0,0 +1 @@
+web: knowledge_repo --repo knowledge-repo deploy
diff --git a/README.md b/README.md
@@ -130,18 +130,6 @@ Genetic Algorithm. Math-free explanation and code from scratch.
 - Start from a simple optimization problem and extending it to traveling salesman problem (tsp).
 - View [[nbviewer](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/ga/ga.ipynb)]
 
-#### h2o : 2016.01.24
-
-Walking through [H2O 2015 World Training GitBook](http://learn.h2o.ai/content/index.html).The walkthrough does basically zero feature engineering with the example dataset, as it is just browsing through its function calls and parameters. Apart from that, [H2o Resources](http://www.h2o.ai/resources/) also contains booklets on each of the models.
-
-- R's API:
-	- h2o’s deep learning. [[Rmarkdown](http://ethen8181.github.io/machine-learning/h2o/h2o_deep_learning/h2o_deep_learning.html)]
-	- h2o’s Ensemble Tree. [[Rmarkdown](http://ethen8181.github.io/machine-learning/h2o/h2o_ensemble_tree/h2o_ensemble_tree.html)]
-	- h2o’s Generalized Linear Model. [[Rmarkdown](http://ethen8181.github.io/machine-learning/h2o/h2o_glm/h2o_glm.html)]
-	- h2o’s super learner. [[R code](https://github.com/ethen8181/machine-learning/blob/master/h2o/h2o_super_learner/h2o_super_learner.R)]
-- Python's API:
-	- h2o's deep learning, Ensemble Tree. [[nbviewer](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/h2o/h2o_python.ipynb)]
-
 #### unbalanced : 2015.11.25
 
 Choosing the optimal cutoff value for logistic regression using cost-sensitive mistakes (meaning when the cost of misclassification might differ between the two classes) when your dataset consists of unbalanced binary classes. e.g. Majority of the data points in the dataset have a positive outcome, while few have negative, or vice versa. The notion can be extended to any other classification algorithm that can predict class’s probability, this documentation just uses logistic regression for illustration purpose.

diff --git a/association_rule/apriori.ipynb b/association_rule/apriori.ipynb
@@ -286,9 +286,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A large portion of the content is based on [Introduction to Data Mining Chapter6: Association Analysis](http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf), this documentation simply adds an educational implementation of the algorithm from scratch.\n",
+    "# Association Rule\n",
     "\n",
-    "# Association Analysis\n",
+    "A large portion of the content is based on [Introduction to Data Mining Chapter6: Association Analysis](http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf), this documentation simply adds an educational implementation of the algorithm from scratch.\n",
     "\n",
     "Many business enterprise accumulates marketing-basket transactions data. For example, a typical marketing-basket transactions may look like:\n",
     "\n",

diff --git a/clustering/GMM/GMM.ipynb b/clustering/GMM/GMM.ipynb
@@ -418,7 +418,7 @@
     "plt.figure(figsize = (15, 6))\n",
     "for i in range(3):\n",
     "    plt.subplot(1, 3, i + 1)\n",
-    "    z = multivariate_normal( [0, 0], covariances[i] ).pdf(position)\n",
+    "    z = multivariate_normal([0, 0], covariances[i]).pdf(position)\n",
     "    plt.contour(x, y, z)\n",
     "    plt.title('{}, {}'.format(titles[i], covariances[i]))\n",
     "    plt.xlim([-4, 4])\n",

diff --git a/clustering/topic_model/LDA.ipynb b/clustering/topic_model/LDA.ipynb
@@ -382,8 +382,7 @@
     "    id2word = texts_dictionary, \n",
     "    num_topics = 2,\n",
     "    passes = 5,\n",
-    "    iterations = 50\n",
-    ")"
+    "    iterations = 50)"
    ]
   },
   {
@@ -472,7 +471,8 @@
     "# we need to convert it to bag of words format first\n",
     "bow_water = ['bank', 'water', 'bank']\n",
     "bow = texts_model.id2word.doc2bow(bow_water) \n",
-    "doc_topics, word_topics, phi_values = texts_model.get_document_topics(bow, per_word_topics = True)\n",
+    "doc_topics, word_topics, phi_values = texts_model.get_document_topics(\n",
+    "    bow, per_word_topics = True)\n",
     "\n",
     "# note that doc_topics equivalent to simply calling model[bow]\n",
     "print('document topics: ', doc_topics)\n",
@@ -519,8 +519,9 @@
    ],
    "source": [
     "bow_finance = ['bank', 'finance']\n",
-    "bow = texts_model.id2word.doc2bow(bow_finance) # convert to bag of words format first\n",
-    "doc_topics, word_topics, phi_values = texts_model.get_document_topics(bow, per_word_topics = True)\n",
+    "bow = texts_model.id2word.doc2bow(bow_finance)  # convert to bag of words format first\n",
+    "doc_topics, word_topics, phi_values = texts_model.get_document_topics(\n",
+    "    bow, per_word_topics = True)\n",
     "word_topics"
    ]
   },
@@ -550,8 +551,9 @@
     }
    ],
    "source": [
-    "bow = texts_model.id2word.doc2bow([ 'the', 'bank', 'by', 'the', 'river', 'bank' ])\n",
-    "doc_topics, word_topics, phi_values = texts_model.get_document_topics(bow, per_word_topics = True)\n",
+    "bow = texts_model.id2word.doc2bow(['the', 'bank', 'by', 'the', 'river', 'bank'])\n",
+    "doc_topics, word_topics, phi_values = texts_model.get_document_topics(\n",
+    "    bow, per_word_topics = True)\n",
     "word_topics"
    ]
   },
@@ -687,13 +689,16 @@
     "            if stopword in dictionary.token2id]\n",
     "dictionary.filter_tokens(stop_ids)\n",
     "\n",
-    "# filter out words that appear in less than 2 documents (appear only once)\n",
+    "# filter out words that appear in less than 2 documents (appear only once),\n",
+    "# there's also a no_above argument that we could specify, e.g.\n",
+    "# no_above = 0.5 would remove words that appear in more than 50% of the documents\n",
     "dictionary.filter_extremes(no_below = 2)\n",
     "\n",
     "# remove gaps in id sequence after words that were removed\n",
     "dictionary.compactify()\n",
     "print('number of unique tokens: ', len(dictionary))\n",
     "\n",
+    "# convert words to the \"learned\" word id\n",
     "corpus = [dictionary.doc2bow(text) for text in texts]"
    ]
   },
@@ -707,10 +712,18 @@
     ">\n",
     "> With gensim we can run online LDA, which is an algorithm that takes a chunk of documents, updates the LDA model, takes another chunk, updates the model etc. Online LDA can be contrasted with batch LDA, which processes the whole corpus (one full pass), then updates the model, then another pass, another update... The difference is that given a reasonably stationary document stream (not much topic drift), the online updates over the smaller chunks (subcorpora) are pretty good in themselves, so that the model estimation converges faster. As a result, we will perhaps only need a single full pass over the corpus: if the corpus has 3 million articles, and we update once after every 10,000 articles, this means we will have done 300 updates in one pass, quite likely enough to have a very accurate topics estimate.\n",
     "\n",
-    "The default parameter for the `LdaModel` is chunksize=2000, passes=1, update_every=1. \n",
+    "The default parameter for the `LdaModel` is chunksize=2000, passes=1, update_every=1.\n",
     "\n",
-    "- The model will update once (`update_every`) every 1 chunk (10,000 documents). We can set `update_every` to 0 if we wanted to perform batch LDA.\n",
-    "- `passes` corresponds to how many times each mini-batch will be given to LDA for training. Setting it to higher value allows LDA to see our corpus multiple times and is very handy for smaller corpora.\n",
+    "- passes: Number of passes through the entire corpus. Setting it to higher value allows LDA to see our corpus multiple times and is very handy for smaller corpora.\n",
+    "- chunksize: Number of documents to load into memory at a time and process E step of EM.\n",
+    "- update_every: Number of chunks to process prior to moving onto the M step of EM.\n",
+    "\n",
+    "We'll not be discussing the EM algorithm here, but in general a chunksize of 100k and update_every set to 1 is equivalent to a chunksize of 50k and update_every set to 2. The primary difference is that we will save some memory using the smaller chunksize, but we will be doing multiple loading/processing steps prior to moving onto the maximization step. Passes are not related to chunksize or update_every. Passes is the number of times we want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.\n",
+    "\n",
+    "- chunksize = 100k, update_every=1, corpus = 1M docs, passes =1 : 10 updates total\n",
+    "- chunksize = 50k, update_every=2, corpus = 1M docs, passes =1 : 10 updates total\n",
+    "- chunksize = 100k, update_every=1, corpus = 1M docs, passes =2 : 20 updates total\n",
+    "- chunksize = 100k, update_every=1, corpus = 1M docs, passes =4 : 40 updates total\n",
     "\n",
     "The method used to fit the LDA model is a randomized algorithm, which means that it involves steps that are random. Because of these random steps, the algorithm will be expected to yield slighty different output for different runs on the same data. Hence to make sure that the output are consistent and to save some time, We will save the model without having to rebuild it every single time.\n",
     "\n",
@@ -731,12 +744,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# load the model if we've already trained it before, it\n",
-    "# takes  around 2 min 2 sec if we were to train it from scratch\n",
+    "# load the model if we've already trained it before\n",
     "n_topics = 10\n",
     "path = 'topic_model.lda'\n",
     "if not os.path.isfile(path):\n",
-    "    topic_model = LdaModel(corpus, id2word = dictionary, num_topics = 10, iterations = 200)\n",
+    "    # training LDA can take some time, we could set\n",
+    "    # eval_every = None to not evaluate the model perplexity\n",
+    "    topic_model = LdaModel(\n",
+    "        corpus, id2word = dictionary, num_topics = 10, iterations = 200)\n",
     "    topic_model.save(path)\n",
     "\n",
     "topic_model = LdaModel.load(path)"
@@ -973,7 +988,9 @@
     "\n",
     "# top 100 words by weight in each topic\n",
     "top_n_words = 100\n",
-    "topics = topic_model.show_topics(num_topics = n_topics, num_words = top_n_words, formatted = False)\n",
+    "topics = topic_model.show_topics(\n",
+    "    num_topics = n_topics, num_words = top_n_words, formatted = False)\n",
+    "\n",
     "for _, infos in topics:\n",
     "    probs = [prob for _, prob in infos]\n",
     "    plt.plot(range(top_n_words), probs)\n",
@@ -1348,11 +1365,23 @@
    "source": [
     "From these two plots we can see that the low `eta` model results in higher weight placed on the top words and lower weight placed on the bottom words for each topic (or more intuitively, topics are composed of few words). On the other hand, the high `eta` model places relatively less weight on the top words and more weight on the bottom words. Thus increasing `eta` results in topics that have a smoother distribution of weight across all the words in the vocabulary.\n",
     "\n",
-    "We have now seen how the hyperparameters influence the characteristics of our LDA topic model, but we haven't said anything about which settings are best. We know that these parameters are responsible for controlling the smoothness of the topic distributions for documents (`alpha`) and word distributions for topics (`eta`), but there's no simple conversion between smoothness of these distributions and quality of the topic model. \n",
+    "We have now seen how the hyperparameters influence the characteristics of our LDA topic model, but we haven't said anything about which settings are best. We know that these parameters are responsible for controlling the smoothness of the topic distributions for documents (`alpha`) and word distributions for topics (`eta`), but there's no simple conversion between smoothness of these distributions and quality of the topic model.\n",
+    "\n",
+    "## End Note\n",
+    "\n",
+    "**Hyperparamter:** Just like with all other models, there is no universally \"best\" choice for these hyperparameters. Finding a good topic model really requires some exploration of the output to see if it make sense (as we did by looking at the top words for each topics and checking some topic predictions for documents). If top words looks like complete gibberish, consider looking at the documents that got assigned to that topic and see if that helps decipher the contextual meaning of the topic. Or simply, scratch the whole thing and re-run the model, but during the re-run, add in uninterpretable words that appeared in the topic's top words to the stop words list so that they won't distort the interpretation again. If it still doesn't work ..., then try lemmatizing the words or use feature selection methods (the simplest being setting a cap on the number of words/tokens that the document-term matrix can use). If that still doesn't work ..., well, machine learning is garbage in garbage out, so maybe the data is simply way too outdated or messy to the utilized.\n",
+    "\n",
+    "**Word Representation:** Although LDA assumes the documents to be in bag of words (bow) representation, from this post [Quora: Why is the performance improved by using TFIDF instead of bag-of-words in LDA clustering?](https://www.quora.com/Why-is-the-performance-improved-by-using-TFIDF-instead-of-bag-of-words-in-LDA-clustering), it seems like people have also found sucess when using tf-idf representation as it can be considered a weighted bag of words.\n",
+    "\n",
+    "[**Memory Considerations:**](https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q5-i-am-getting-out-of-memory-errors-with-lsi-how-much-memory-do-i-need): Gensim can only do so much to limit the amount of memory used by our analysis. Our program may take an extended amount of time or possibly crash if we do not take into account the amount of memory the program will consume. Prior to training our model we can get a ballpark estimate of memory use by using the following formula:\n",
+    "\n",
+    "8 bytes * num_terms * num_topics * 3\n",
     "\n",
-    "Just like with all other models, there is no universally \"best\" choice for these hyperparameters. Finding a good topic model really requires some exploration of the output to see if it make sense (as we did by looking at the top words for each topics and checking some topic predictions for documents). If top words looks like complete gibberish, consider looking at the documents that got assigned to that topic and see if that helps decipher the contextual meaning of the topic. Or simply, scratch the whole thing and re-run the model, but during the re-run, add in uninterpretable words that appeared in the topic's top words to the stop words list so that they won't distort the interpretation again. If it still doesn't work ..., then try lemmatizing the words or use feature selection methods (the simplest being setting a cap on the number of words/tokens that the document-term matrix can use). If that still doesn't work ..., well, machine learning is garbage in garbage out, so maybe the data is simply way too outdated or messy to the utilized.\n",
+    "- 8 bytes: size of double precision float\n",
+    "- num_terms: number of terms in the dictionary\n",
+    "- num_topics: number of topics\n",
     "\n",
-    "Side note: Although LDA assumes the documents to be in bag of words (bow) representation, from this post [Quora: Why is the performance improved by using TFIDF instead of bag-of-words in LDA clustering?](https://www.quora.com/Why-is-the-performance-improved-by-using-TFIDF-instead-of-bag-of-words-in-LDA-clustering), it seems like people have also found sucess when using tf-idf representation."
+    "The magic number 3: The 8 bytes * num_terms * num_topic accounts for the model output, but Gensim will need to make temporary copies while modeling. The scaling factor of 3 gives you an idea of how much memory Gensim will be consuming while running with the temporary copies present. One quick way to quick down the memory usage is to limit the size of the token. After constructing the dictionary, we can do `print(dictionary)` to see the size of the our token and perform filtering to reduce the size if needed."
    ]
   },
   {
@@ -1361,6 +1390,8 @@
    "source": [
     "# Reference\n",
     "\n",
+    "- [Blog: Gensim LDA: Tips and Tricks](http://miningthedetails.com/blog/python/lda/GensimLDA/)\n",
+    "- [Notebook: Pre-processing and training LDA](http://nbviewer.jupyter.org/github/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb)\n",
     "- [Coursera Washington Clustering & Retrieval](https://www.coursera.org/learn/ml-clustering-and-retrieval)\n",
     "- [gensim documentation: Corpora and Vector Spaces](http://nbviewer.jupyter.org/github/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb)\n",
     "- [gensim documentation: New Term Topics Methods and Document Coloring](http://nbviewer.jupyter.org/github/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb)\n",
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		web: knowledge_repo --repo knowledge-repo deploy