Skip to content

Telegram Data Clustering contest solution by Mindful Squirrel

License

Notifications You must be signed in to change notification settings

Zert0X/tgcontest

 
 

Repository files navigation

TGNews

Build Status

Links

Demo

Install

Prerequisites: CMake, Boost

$ sudo apt-get install cmake libboost-all-dev build-essential libjsoncpp-dev uuid-dev protobuf-compiler libprotobuf-dev

For MacOS

$ brew install boost jsoncpp ossp-uuid protobuf

If you got zip archive, just go to building binary

To download code and models:

$ git clone https://github.com/IlyaGusev/tgcontest
$ cd tgcontest
$ git submodule update --init --recursive
$ bash download_models.sh
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.5.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.5.0+cpu.zip

For MacOS use https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.5.0.zip

To build binary (in "tgcontest" dir):

$ mkdir build && cd build && Torch_DIR="../libtorch" cmake -DCMAKE_BUILD_TYPE=Release .. && make -j4

To download datasets:

$ bash download_data.sh

Run on sample:

./build/tgnews top data --ndocs 10000

Training

  • Russian FastText vectors training: VectorsRu.ipynb Open In Colab
  • Russian fasttext category classifier training: CatTrainRu.ipynb Open In Colab
  • Russian text embedder with triplet loss training (v3): Open In Colab
  • English FastText vectors training: VectorsEn.ipynb Open In Colab
  • English fasttext category classifier training: CatTrainEn.ipynb Open In Colab
  • English text embedder with triplet loss training (v3): Open In Colab
  • PageRank rating calculation: PageRankRating.ipynb Open In Colab
  • Russian ELMo-based sentence embedder training (not used): Open In Colab
  • XLM-RoBERTa pseudo-labeling for categorization: Open In Colab

Models

Data

Markup

Misc

Other contestants

Contacts

About

Telegram Data Clustering contest solution by Mindful Squirrel

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 33.2%
  • C++ 31.3%
  • Jupyter Notebook 22.0%
  • Python 9.7%
  • Shell 1.4%
  • CMake 1.1%
  • Other 1.3%