MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
A professional list on Large (Language) Models and Foundation Models (LLM, LM, FM) for Time Series, Spatiotemporal, and Event Data.
ChatYuan: Large Language Model for Dialogue in Chinese and English
ICME 2022 paper "Improving Image Paragraph Captioning with Dual Relations" code
阿布量化交易系统(股票,期权,期货,比特币,机器学习) 基于python的开源量化交易,量化投资架构
Python module to generate regular all expression matches
keras implement of transformers for humans
"Few-shot Text Classification with Distributional Signatures" ICLR 2020
A PyTorch implementation of the method found in "Adversarially Robust Few-Shot Learning: A Meta-Learning Approach"
BERT-based Seq2Seq architecture trained on SQuAD to generate questions given a text and an answer.
A Large-Scale Few-Shot Relation Extraction Dataset
Attention-based Induction Networks for Few-Shot Text Classification
DBSCAN clustering algorithm on top of Apache Spark
Focal loss for multiple class classification
thunderboom / ML-NLP
Forked from NLP-LOVE/ML-NLP此项目是机器学习(Machine Learning)、深度学习(Deep Learning)、NLP面试中常考到的知识点和代码实现,也是作为一个算法工程师必会的理论基础知识。
Must-read papers on neural relation extraction (NRE)
novel deep learning research works with PaddlePaddle
100+ Chinese Word Vectors 上百种预训练中文词向量