This is a really old script from IBM Watson times I'm trying to keep afresh :)
- 🎯 Efficient Processing: Tokenize and filter articles (now a little bit faster)
- 🧠 Pre-trained Embeddings:
- 🔮 Data Augment: Expand your dataset
- 💾 Storage: I/O HDF5
- 🛠 Customizable: Hopefully! ;)
-
Clone the repo:
git clone https://github.com/0101011/analitika.git
-
Install dependencies:
pip install -r requirements.txt
-
Run the script:
python analitika.py
- Place your
raw_data.json
in thedata/
directory - (Optional) Add pre-trained embeddings to
data/
- Run the script:
python analitika.py
- Find processed data in
data/
as HDF5 and pickle files
Customize the script by modifying these variables:
WHITELIST
: Allowed charactersVOCAB_SIZE
: Maximum vocabulary sizelimit
: Length constraints for articles
Here are some ways you can contribute:
- 💡 My goal was to develop a package or CLI tool out of it. Maybe we'll come up with something.
This project is licensed under the MIT License - see the LICENSE file for details.
- NLTK for natural language processing
- Gensim for word embeddings
- HDF5 for Python for efficient data storage
Made with ❤️ by [Your Name]