Skip to content

An experiment with scikit-learn and kafka

Notifications You must be signed in to change notification settings

velotiotech/kafka-ml

 
 

Repository files navigation

Bag of Words and tf-idf implementation in scikit-learn

This project is a simple implementation of bag of words and tf-idf. It does document classification using the following dataset -

Categories for classification

  1. talk.politics.misc
  2. misc.forsale
  3. rec.motorcycles
  4. comp.sys.mac.hardware
  5. sci.med
  6. talk.religion.misc

Requirements -

  1. scikit-learn
  2. pickle
  3. Kafka

Install Kafka

  1. wget http://www-us.apache.org/dist/kafka/1.0.0/kafka_2.11-1.0.0.tgz
  2. tar -xvf kafka_2.11-1.0.0.tgz

Test Documents -

  1. test_bike_doc.txt
  2. test_med_doc.txt
  3. test_mac_doc.txt

How to run the project -

Start Kafka

  1. bin/zookeeper-server-start.sh config/zookeeper.properties
  2. bin/kafka-server-start.sh config/server.properties
  3. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic velotio

Start kafka producer 4. python twitter_kafka_prodcer.py

Start message classifier 5. python doc_classifier.py

Understanding the output -

"message" => category

NOTE - This project is under development. Help it grow by opening issues and pull requests! :)

About

An experiment with scikit-learn and kafka

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%