Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="./data/.logo图片/.img.jpg" width="180">

display: inline-block;
color: #999;
NLP: Paradise for migrant workers

The Most Powerful NLP-Weapon Arsenal

NLP Migrant Workers' Paradise: Almost the Most Complete Chinese NLP Resource Library

In the process of getting started and becoming familiar with NLP, I used a lot of packages on GitHub, so I sorted them out and shared them here.

Many of the bags are very interesting and worth collecting to satisfy your collecting addiction! If you find them useful, please share and star:star:, thank you!

Updates will be made irregularly over a long period of time. Welcome to watch and fork! ❤️❤️❤️

🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥

ChatGPT-like model evaluation comparison
* ChatGPT-like information
* Open source framework similar to ChatGPT
* LLM training_inference_low resources_efficient training
* Tips Engineering
* ChatGPT-like document Q&A
* ChatGPT-like industry applications
* ChatGPT-like course materials
* LLM security issues
* Multimodal LLM
* LLM dataset

🍆 🍒 🍐 🍊	🌻 🍓 🍈 🍅 🍍

Corpus
* Thesaurus and lexical tools
* Pre-trained language model
* Extract
* Knowledge Graph
* Text generation
* Text Abstract
* Smart Question and Answer
* Text correction | * Documentation
* Forms processing
* Text matching
* Text data enhancement
* Text search
* Reading comprehension
* Sentiment Analysis
* Common regular expressions
* Voice Processing
Common regular expressions
* Event extraction
* Machine translation
* Digital Conversion
* Referential Dissolution
* Text Clustering
* Text classification
* Knowledge Reasoning
* Explainable NLP
* Text adversarial attack | * Text Visualization
* Text annotation tools
* Comprehensive tools
* Funny and funny tools
* Course report interview etc.
* Contest
* Financial NLP
* Medical NLP
* Legal NLP
* Text to image
* other

Comparison of ChatGPT-like model evaluation

Resource Name	Description	Link
ChatALL: can chat with multiple AI robots at the same time (including products from Tsinghua University and iFlytek)	A tool that can talk to multiple AI chatbots at the same time (such as ChatGPT, Bing Chat, Bard, Alpaca, Vincuna, Claude, ChatGLM, MOSS, iFlytek Spark, ERNIE, etc.). It can send prompts to different AI robots in parallel to help users find the best answer	github-ChatALL
Chatbot Arena	Benchmarking LLM with Elo rating in real-world scenarios - Introduced Chatbot Arena, a benchmark platform for large language models (LLMs), which uses an anonymous, randomized adversarial evaluation method based on the Elo rating system widely used in competitive games such as chess. Elo ratings for 9 popular open source LLM models were released and a leaderboard was launched. The platform uses the FastChat multi-model serving system to provide an interactive interface in multiple languages, and the data comes from user voting. Summarized the advantages of Chatbot Arena and plans to provide better sampling algorithms, rankings, and serving systems	Ends May 3, 2023
ChatGPT-like model evaluation summary	Large language models (LLMs) have received widespread attention. These powerful models can understand complex information and provide human-like responses to a variety of questions. Among them, GPT-3 and GPT-4 performed best, and Flan-t5 and Lit-LLaMA also performed well. However, please note that commercial use of models may require payment and data sharing	blog
A review of Large Language Models (LLMs)		blog
Latest Research on Large Model Evaluation	Long text modeling has always been one of ChaGPT's amazing capabilities. We use [paragraph translation] as an experimental scenario to conduct a comprehensive and fine-grained test of the large model's paragraph modeling capabilities.	paper
Chinese large model evaluation tools & rankings	C-Eval is a comprehensive Chinese assessment suite for base models. It contains 13,948 multiple-choice questions covering 52 different subjects and four difficulty levels, as shown below. Please visit our website or consult our paper for more details.	github paper
OpenCompass Large Model Review	OpenCompass is an open-source, efficient, and comprehensive large-model evaluation system and open platform developed by Shanghai Artificial Intelligence Laboratory. It provides a complete, open-source, and reproducible evaluation framework, and supports one-stop evaluation of large language models, multimodal models, and other models. Using distributed technology, even models with hundreds of billions of parameters can be evaluated within a few hours. Based on multiple highly recognized data sets in different dimensions, it provides a variety of evaluation methods, including zero-sample evaluation, small-sample evaluation, and thought chain evaluation, to fully quantify the capabilities of each dimension of the model.	github website

ChatGPT-like information

Resource Name	Description	Link
Open LLMs: Open Large Language Models (LLMs) for commercial use	A list of open LLMs available for commercial use	github
LLM Zoo: A marketplace for data, models, and benchmarks for large language models	LLM Zoo: democratizing ChatGPT - a project that provides data, models, and evaluation benchmark for large language models	github
Large Language Model (LLM) Data Collection	List of related papers, including research work on guidance, reasoning, decision making, continuous improvement, and self-improvement	LLM information collection
DecryptPrompt	Summary Prompt & LLM papers, open source data & models, AIGC applications	github
SmartGPT	Designed to provide large language models (especially GPT-3.5 and GPT-4) with the ability to complete complex tasks by breaking them down into smaller problems and using the Internet and other external sources to collect information. Features include modular design, easy configuration, and high support for plug-ins. SmartGPT operates based on the concept of "Autos", including two types, "Runner" and "Assistant", both equipped with LLM agents that handle planning, reasoning, and task execution. In addition, SmartGPT also has a memory management system, as well as a plug-in system that can define various commands	github-SmartGPT
OpenGPT	A framework for creating instruction-based datasets and training large language models (LLMs) of experts in the conversational domain. It has been successfully applied to train the health care conversational model NHS-LLM, using data from the UK National Health Service (NHS) website to generate a large number of question-answer pairs and unique conversations.	github-OpenGPT
PaLM 2 Technical Report	Google has recently released PaLM 2, a new language model with better multilingual and reasoning capabilities while being more computationally efficient than its predecessor, PaLM. PaLM 2 combines a number of research advances, including computationally optimal model and data scale, more diverse and multilingual datasets, and more effective model architectures and objective functions. PaLM 2 achieves state-of-the-art performance on a variety of tasks and capabilities, including language proficiency tests, classification and question answering, reasoning, programming, translation, and natural language generation. PaLM 2 also demonstrates strong multilingual capabilities, able to handle hundreds of languages, and translate and interpret between different languages. PaLM 2 also considers issues of responsible use, including controlling toxicity during reasoning, reducing memoization, and assessing potential harm and bias.	PaLM 2 Technical Report
DB-GPT	An open source experimental project based on vicuna-13b and FastChat, it uses langchain and llama-index technologies for contextual learning and question-answering. The project is fully locally deployed to ensure data privacy and security, and can directly connect to private databases to process private data. Its functions include SQL generation, SQL diagnosis, database knowledge question-answering, etc.	github-DB-GPT
A large list of Transformers related literature resources	Contains a variety of Transformer models, such as BERT, GPT, Transformer-XL, etc. These models have been widely used in many natural language processing tasks. In addition, the list also provides relevant papers and code links for these models, providing a good reference resource for researchers and developers in the field of natural language processing.	github
The Ultimate Guide to GPT-4	A guide on how to use GPT3 and GPT4, including more than 100 resources to help learn how to use it to improve your life efficiency. Including how to learn the basics of ChatGPT, how to learn advanced knowledge of ChatGPT, how to use GPT-3 in language learning, how to use GPT-3 in teaching, how to use GPT-4, etc. It also provides how to upgrade to the ChatGPT+ plan to use GPT-4 and how to use GPT-4 for free. At the same time, it also provides a guide on how to use ChatGPT in business, productivity, benefits, money, etc.	link
Efficient fine-tuning of LLM parameters based on LoRA		link
Complex Reasoning: The North Star Capability of Large Language Models	In the GPT-4 release blog, the authors wrote: "In a casual conversation, the difference between GPT-3.5 and GPT-4 may be subtle. When the complexity of the task reaches a sufficient threshold, the difference will become apparent." This means that complex tasks are likely to be the key differentiating factor between large and small language models. In this article, we will carefully analyze and discuss how to make large language models have powerful complex reasoning capabilities.	blog
Is the emergent power of large language models a mirage?	The emergence of large language models has always been regarded as a magical phenomenon, as if it were a miracle caused by great effort, but this paper argues that this may just be an illusion.	paper
Probabilistic Summary of Large Language Models	Very detailed explanation and summary of LLM science	paper
A brief history of the LLaMA model	LLaMA is a language model released by Meta, which uses the Transformer architecture and has multiple versions with a maximum of 65B parameters. Similar to GPT, it can be used for further fine-tuning and is suitable for a variety of tasks. Unlike GPT, LLaMA is open source and can be run locally. Existing LLaMA models include: Alpaca, Vicuna, Koala, GPT4-x-Alpaca, and WizardLM. Each model has different training data and performance.	blog
Complex Reasoning with Large Language Models	This paper discusses how to train language models with powerful and complex reasoning capabilities, and explores how to effectively prompt the model to fully unleash its potential. In view of the similarities between language model and programming training, a three-stage training is proposed: continuous training, supervised fine-tuning, and reinforcement learning. A set of tasks for evaluating the reasoning capabilities of large language models is introduced. It also discusses how to perform prompt engineering to enable the model to achieve better learning results by providing various learning opportunities, ultimately achieving intelligence.	link
Large language model evolution tree		paper
Li Hongyi: How poor people can replicate their own ChatGPT with low resources		blog
Essential resources for training ChatGPT: A complete guide to corpus, models, and code libraries		Resource link paper address
GitHub treasure library, which organizes various open source projects related to GPT		github
ChatGPT Chinese Guide		gitlab
The application, advantages, limitations and future development direction of ChatGPT in natural language processing are discussed.	Ethical considerations and engineering tips when using this technology are highlighted.	paper
List of literature resources related to large language models		github
Literature Review on Large Language Models (Chinese Version)		github
A large list of ChatGPT related resources		github
Pre-Training to Learn in Context		paper
Langchain Architecture Diagram		image
Numbers every LLM developer should know		github
How to build powerful complex reasoning capabilities using large language models		blog
LLMs Nine-story Demon Tower	Share practical experience and experience in fighting monsters (ChatGLM, Chinese-LLaMA-Alpaca, MiniGPT-4, FastChat, LLaMA, gpt4all, etc.)	github

ChatGPT-like open source framework

Resource Name	Description	Link
LLM-As-Chatbot	This project makes all the LLMs available on the market into Chatbots, which can be run directly on Google Colab without having to build them yourself. It is very suitable for friends who want to experience LLM. I just tried it and it is really super simple. Some LLMs require more video memory, so it is best to have a Colab Pro subscription.	github
OpenBuddy	A powerful open source multilingual chatbot model, targeting global users, with a focus on conversational AI and fluent multilingual support, including English, Chinese and other languages. Based on Facebook's LLAMA model, it has been fine-tuned, including expanding the vocabulary, adding common characters, and enhancing token embeddings. With these improvements and a multi-round conversation dataset, OpenBuddy provides a powerful model that can answer questions and perform translation tasks between various languages. OpenBuddy's mission is to provide a free, open and offline AI model that can run on users' devices regardless of their language or cultural background. Currently, a demo version of OpenBuddy-13B can be found on the Discord server. Its key features include multilingual conversational AI (including Chinese, English, Japanese, Korean, French, etc.), enhanced vocabulary and support for common CJK characters, and two model versions: 7B and 13B	github-OpenBuddy
Panda: Overseas Chinese open source large language model	Based on Llama-7B, -13B, -33B, -65B, continuous pre-training in the Chinese domain, using nearly 15M data, and evaluating the reasoning ability on the Chinese benchmark	github - PandaLM
Dromedary: An open source self-aligned language model that can be trained with minimal human supervision		github-Dromedary
LaMini-LM is a collection of small and efficient language models for distillation	A collection of small, efficient language models distilled from ChatGPT, trained on a large dataset of 2.58M instructions	github
LLaMA-Adapter V2	LLaMA-Adapter V2 from Shanghai Artificial Intelligence Laboratory, with only 14M parameters injected, can be trained in 1 hour. The comparison results are really amazing, and it has multimodal functions (interpretation and question-answering of images)	github
HuggingChat	Hugging Face launched the first open source alternative to ChatGPT: HuggingChat. Based on the Open Assistant model, it supports Chinese conversations and code writing, but does not support Chinese replies. The app is now online and can be accessed by opening it without a proxy.	link
Open-Chinese-LLaMA	Based on LLaMA-7B, a Chinese large language model base generated by incremental pre-training of Chinese datasets	github
OpenLLaMA	An open-source reproduction of the LLaMA model, trained on the RedPajama dataset, using the same preprocessing steps and hyperparameters, model structure, context length, training steps, learning rate schedule, and optimizer as LLaMA. PyTorch and Jax weights for OpenLLaMA are available on Huggingface Hub. OpenLLaMA shows similar performance to LLaMA and GPT-J in various tasks, and performs better in some tasks.	github
replit-code-v1-3b	Released under BY-SA 4.0 license, which means commercial use is allowed	link
MOSS	MOSS is an open source conversational language model that supports Chinese and English and multiple plug-ins. The moss-moon series model has 16 billion parameters and can run on a single A100/A800 or two 3090 graphics cards at FP16 precision, and on a single 3090 graphics card at INT4/8 precision. The MOSS base language model is pre-trained on about 700 billion Chinese, English and code words, and is subsequently fine-tuned through conversational instructions, plug-in enhanced learning and human preference training to enable multi-round conversations and the ability to use multiple plug-ins.	github
RedPajama	1.2 Trillion Tokens Dataset	link
chinese_llama_alpaca_lora extraction framework		github
Scaling Transformer to 1M tokens and beyond with RMT	The paper proposes a new technology called RMT, which may expand the upper limit of Transform's tokens to 1 million or even more.	github
Open Assistant	Contains a large number of AI-generated and manually annotated corpora and a variety of models based on LLaMA and Pythia. The released dataset includes more than 161K high-quality, human assistant-type interactive dialogue corpora in up to 35 languages	data model
ChatGLM Efficient Tuning	Efficient ChatGLM fine-tuning based on PEFT	github
Dolly Introduction		news
Baize: An open source chat model for efficient parameter tuning of self-chat data	Baize is an open source chat model that can conduct multi-turn conversations. It was created by generating a high-quality multi-turn chat corpus using ChatGPT self-conversation and enhancing LLaMA (an open source large language model) with efficient parameter tuning. The Baize model shows good multi-turn conversation performance with minimal potential risks. It can run on a single GPU, making it accessible to a wider range of researchers. The Baize model and data are for research purposes only.	Paper address Source code address
GPTrillion--No open source code found	GPTrillion, a large model containing 1.5 trillion (1.5T) parameters, is now open source, claiming to be the world's largest open source LLM	google_doc
Cerebras-GPT-13B (commercially available)		hugging_face
Chinese-ChatLLaMA	Chinese ChatLLaMA dialogue model; pre-training/command fine-tuning dataset, built on TencentPretrain multimodal pre-training framework, supports simplified and traditional Chinese, English, Japanese and other languages	github
Lit-LLaMA	A fully open source independent LLaMA implementation based on the Apache 2.0 license, built on nanoGPT, aims to address the limitations of the original LLaMA code under the GPL license to enable wider academic and commercial applications	github
MosaicML	MPT-7B-StoryWriter, 65K tokens, you can throw the entire "The Great Gatsby" into it at once.	huggingface
Langchain	Large Language Models (LLMs) are becoming a transformative technology, enabling developers to build applications that were previously impossible. However, using these standalone LLMs alone is often not enough to create a truly powerful application - the real power comes from being able to combine them with other computational or knowledge sources.	github
Guidance	Bootstrapping enables more efficient control of modern language models than traditional prompting or chaining, and is more efficient. Bootstrapping allows you to interleave generation, prompting, and logic control into a single continuous stream, matching the way language models actually process text. Simple output structures like "Chain of Thought" and its many variants (e.g. ART, Auto-CoT, etc.) have been shown to improve the performance of language models. The advent of more powerful language models (like GPT-4) has made richer structures possible, and bootstrapping makes it easier and more economical to build such structures.	github
WizardLM	Gives large pre-trained language models the ability to follow complex instructions, using the WizardLM-7B model trained with the full set of evolutionary instructions (about 300k)	github

LLM training_inference_low resources_efficient training

Resource Name	Description	Link
QLoRA--Guanaco	An efficient fine-tuning method that can fine-tune a model with 65B parameters on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance and back-propagating gradients through a frozen, 4-bit quantized pre-trained language model to a low-rank adapter (LoRA) via QLoRA	github
Chinese-Guanaco	A Chinese low-resource quantitative training/deployment solution	github
DeepSpeed Chat: One-click RLHF training		github
LLMTune: Fine-tuning a large 65B+ LLM on a consumer GPU	4-bit fine-tuning can be performed on common consumer-grade GPUs, such as the largest 65B LLAMA model. LLMTune also implements the LoRA algorithm and the GPTQ algorithm to compress and quantize LLM, and process large models through data parallelism. In addition, LLMTune provides a command line interface and Python library for use	github
Fine-tuning based on ChatGLM-6B+LoRA on the instruction dataset	Based on deepspeed, it supports multi-card fine-tuning, which is 8-9 times faster than single card. For detailed settings, see Fine-tuning 3. Lora fine-tuning based on DeepSpeed	github
Microsoft releases DeepSpeed Chat, a RLHF training tool		github
LlamaChat: A chatbot based on LLaMa on Mac		github
ChatGPT/GPT4 open source "alternatives"		github
Practical tips and tricks for training large machine learning models	Helps you train large models (>1B parameters), avoid instabilities, and save failed experiments without restarting from scratch	link
Instruction Tuning with GPT-4		paper
xturing	A Python package for fine-tuning LLM models efficiently, quickly, and easily. It supports multiple models such as LLaMA, GPT-J, GPT-2, etc. It can be trained using single GPU and multi-GPU. It uses efficient fine-tuning techniques such as LoRA to reduce hardware costs by up to 90% and complete model training in a short time.	github
GPT4All	An open source project that allows running GPT locally on Macbook. Built on the LLaMa-7B large language model, including data, code and demo are all open source, and the conversation style is more like an AI assistant	github
Fine-tuning ChatGPT-like models with Alpaca-LoRA		link
LMFlow	A scalable, convenient and efficient toolbox for fine-tuning large machine learning models	github
Wenda: Large language model calling platform	Currently supports chatGLM-6B, chatRWKV, chatYuan and chatPDF under chatGLM-6B model (self-built knowledge base search)'	github
Micro Agent	Small autonomous agent open source project, powered by LLM (OpenAI GPT-4), can write software for you, just set a "purpose" and let it work on its own	github
Llama-X	Open source academic research project, through the joint efforts of the community, gradually improve the performance of LLaMA to the level of SOTA LLM, save duplication of work, and jointly create more and faster increments	github
Chinese-LLaMA-Alpaca	Chinese LLaMA & Alpaca LLMs - Open-source Chinese LLaMA model pre-trained with Chinese text data; open-source Chinese Alpaca model further fine-tuned with instructions; quickly deploy and experience the quantized version of the model locally using a laptop (personal PC)	github
Efficient Alpaca	An open source project based on LLaMA implementation, aiming to improve the performance of Stanford Alpaca by fine-tuning the LLaMA-7B model to consume less resources, be faster inference speed, and be more suitable for researchers	github
ChatGLM-6B-Slim	ChatGLM-6B with 20K image tokens removed, same performance, but smaller video memory usage	github
Chinese-Vicuna	A Chinese low-resource llama+lora solution	github
Alpaca-LoRA	Reproducing Stanford Alpaca's results on consumer hardware using LoRA	github
LLM Accelerator	LLM Accelerator is here to make basic large models smarter! Basic large models are playing an increasingly important role in many applications. Most large language models are trained in an autoregressive manner. Although the quality of text generated by the autoregressive model is guaranteed, it leads to high inference costs and long delays. Due to the huge number of parameters and high inference costs of large models, how to reduce costs and delays in the process of large-scale deployment of large models is a key issue. To address this issue, researchers at Microsoft Research Asia proposed a method called LLM Accelerator that uses reference text to losslessly accelerate the inference of large language models, which can achieve two to three times the acceleration in typical application scenarios of large models.	blog
Large Language Model (LLM) Fine-tuning Technical Notes		github
PyLLMs	A concise Python library for connecting to various LLMs (OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with built-in model performance benchmarks. Very suitable for rapid prototyping and evaluation of different models, with the following features: Connect to top LLMs with a small amount of code; Response metadata including processed tokens, costs and latencies, standardize each model; Support multiple models: get completions from different models at the same time; LLM benchmarks: evaluate the quality, speed and cost of models	github
Accelerating Large Language Models with Mixed Precision	By using low-precision floating-point operations, training and inference speed can be increased by up to 3 times without affecting model accuracy	blog
New LLM training method Federate	Duke University and Microsoft jointly released a new LLM training method, Federated GPT. This training method distributes the original centralized training method to different edge devices. After the training is completed, it is uploaded to the center to merge the sub-models.	github

Tips Engineering

Resource Name	Description	Link
OpenBuprompt-engineering-note	Prompt Engineering Notes (Course Summary) introduces the ChatGPT Prompt Engineering Learning Notes course for developers, which provides the working principles of language models and prompt engineering practices, and shows how to apply the language model API to applications for various tasks. The course includes content such as summarizing, inferring, transforming, expanding, and building chatbots, and tells how to design good prompts and build custom chatbots.	github - OpenBuprompt
Tip Engineering Guide		link
AIGC Prompt Engineering Learning Station Learn Prompt	ChatGPT/Midjourney/Runway	link
Prompts Featured - ChatGPT User Guide	ChatGPT usage guide to improve the playability and usability of ChatGPT	github
An unofficial list of resources for using ChatGPT.	Aims to aggregate resources such as apps, web apps, browser extensions, CLI tools, bots, integrations, packages, articles, etc. that use ChatGPT	github
Snack Prompt: ChatGPT Prompt prompt sharing community		link
ChatGPT Questioning Tips	How to ask ChatGPT questions to get high-quality answers: A complete guide to tips and tricks engineering	github
rompt-Engineering-Guide-Chinese - rompt-Engineering-Guide	Derived from the English version, but with the AIGC prompt added	github
OpenPrompt	An open shared prompt community, everyone recommends useful prompts	github
GPT-Prompts	Teach you how to generate prompts with GPT	github

ChatGPT-like document Q&A

Resource Name	Description	Link
privateGPT	The private deployment document question-and-answer platform based on GPT4All-J does not require an Internet connection and can 100% guarantee that the user's privacy is not leaked. It provides an API that allows users to use their own documents for interactive question-and-answer and text generation. In addition, the platform supports custom training data and model parameters to meet personalized needs.	github-privateGPT
Auto-evaluator	Automatic evaluation of document question answering;	github
PDF GP	An open source PDF document chat solution based on GPT, which mainly implements the following functions: one-on-one conversation with PDF documents; automatically segment content and use a powerful deep average network encoder to generate embeddings; perform semantic search on PDF content and pass the most relevant embeddings to Open AI; customize logic to generate more accurate response information, faster than OpenAI.	github
Redis-LLM-Document-Chat	Interacting with PDF Documents with LlamaIndex, Redis, and OpenAI, contains a Jupyter notebook that demonstrates how to use Redis as a vector database to store and retrieve document vectors. It also shows how to use LlamaIndex to perform semantic search in documents and how to leverage OpenAI to provide a chatbot-like experience.	github
doc-chatbot	A document chatbot implemented by GPT-4 + Pinecone + LangChain + MongoDB, which can chat with multiple files, multiple topics and multiple windows, and the chat history is saved by MongoDB	github
document.ai	A universal local knowledge base solution based on vector database and GPT3.5	github
DocsGPT	DocsGPT is a cutting-edge open source solution that simplifies the process of finding information in project documentation. By integrating a powerful GPT model, developers can easily ask questions about a project and get accurate answers.	github
ChatGPT Retrieval Plugin	The ChatGPT retrieval plugin repository provides a flexible solution for semantic search and retrieval of personal or organizational documents using natural language queries.	github
LamaIndex	lamaIndex (GPT index) is the data frame for your LLM application.	github
chatWeb	ChatWeb can crawl any web page or PDF, DOCX, TXT file and extract the text, generate an embedded summary, and answer your questions based on the text content. It is based on the chatAPI and embeddingAPI of gpt3.5, as well as the vector database implementation.	github

ChatGPT-like industry applications

Resource Name	Description	Link
Sentiment analysis of news reports	Using ChatGPT to perform sentiment analysis on news reports of listed companies, a 500% return was generated in the stock market (trading options) within 15 months (tested on historical data) - The potential of ChatGPT in predicting stock market returns using sentiment analysis of news headlines was explored. It was found that ChatGPT's sentiment analysis capabilities exceeded traditional methods and were positively correlated with stock market returns. It was proposed that ChatGPT has great value in the field of finance and economics, and some insights and suggestions were made for future research and application	paper
Programming language generation model StarCoder	BigCode is a collaboration between ServiceNow Inc. and Hugging Face Inc. StarCoder has multiple versions. The core version StarCoderBase has 15.5 billion parameters, supports more than 80 programming languages, and has 8,192 token contexts. The video shows the effect of its vscode plugin.	github
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages	code generation	paper
MedicalGPT-zh: Chinese Medical General Language Model	The Chinese medical universal language model is based on the medical consensus and clinical guidelines of 28 departments to improve the model's medical field knowledge and dialogue capabilities	github
MagicSlides	AI self-made PPT is what many people dream of. The free version can make 3 PPTs per month and supports 2,500 words of input.	link
SalesGPT	Use LLM to implement a context-aware sales assistant that automates sales development rep activities, such as outbound sales calls	github
HuaTuo: LLaMA fine-tuning model based on Chinese medical knowledge		github
ai-code-translator	Helping you translate code from one language to another is something that ChatGPT is really good at, especially GPT-4, which has a very high translation quality and can have longer tokens.	github
ChatGenTitle	A paper title generation model fine-tuned on the LLaMA model using information from millions of arXiv papers	github
Regex.ai	A WYSIWYG, AI-based regular expression automatic generation tool. Just select the data, it can help you write regular expressions and provide multiple ways to extract data.	video
ChatDoctor	A medical chat model based on fine-tuning LLaMA based on medical domain knowledge. The medical data includes data on about 700 diseases and about 5,000 conversation records between doctors and patients.	paper
CodeGPT	The key to improving programming skills is data. CodeGPT is a code dialogue dataset for GPT generated by GPT. Now 32K Chinese data are publicly available, making the model better at programming	github
LaWGPT	A series of open source large language models based on Chinese legal knowledge	github
LangChain-ChatGLM-Webui	Inspired by langchain-ChatGLM, the WebUI made with LangChain and ChatGLM-6B series models provides large model applications based on local knowledge. Currently, it supports uploading text format files such as txt, docx, md, pdf, etc., and provides model files including ChatGLM-6B series, Belle series, and Embedding models such as GanymedeNil/text2vec-large-chinese, nghuyong/ernie-3.0-base-zh, nghuyong/ernie-3.0-nano-zh.	github

ChatGPT-like course materials

Resource Name	Description	Link
Databricks	(The author of the Dolly model) has released two free courses on edX, the second of which is about how the LLM is structured.	link
Large Language Model Technology Sharing Series	Natural Language Processing Laboratory, Northeastern University	video
How does GPT-4 work? How can we use GPT-4 to build intelligent programs?	Harvard University CS50 Open Course	video
Tip Engineering Best Practices: Andrew Ng Tip Engineering New Course Summary + LangChain Experience Summary		medium_blog
Fine-tuning the LLM model	If you are interested in fine-tuning the LLM model, be sure to follow this YouTube blogger, who has made public the fine-tuning methods for almost all LLM models on the market.	YouTuber Sam Witteveen
Transformer Architecture	Easy-to-understand introduction	youtube1 youtube2 youtube3
Video of Transformer multi head mechanism	If you want to really understand every detail of the entire Transform, including the mathematical principles behind it, you can watch this video, which is really a very detailed analysis.	youtube
Introduction to Large Language Models	Introduction to Large Language Model	Introduced the concepts, usage scenarios, prompt adjustments, and Google's Gen AI development tools of Large Language Models (LLMs).

Safety issues of LLM

Resource Name	Description	Link
Research on the Security of LLM Model		link
Chatbot Injections & Exploit	A collection of examples of Chatbot injections and vulnerabilities to help people understand the potential vulnerabilities and vulnerabilities of Chatbots. Injections and attacks include command injection, character encoding, social engineering, emojis, Unicode, etc. The repository provides some examples, some of which include a list of emojis that can be used to attack Chatbots.	github
GPTSecurity	A community covering cutting-edge academic research and practical experience sharing, integrating knowledge on security applications such as Generative Pre-trained Transformer (GPT), Artificial Intelligence Generated Content (AIGC), and Large Language Model (LLM). Here you can find the latest research papers, blog posts, practical tools, and preset instructions (Prompts) on GPT/AIGC/LLM.	github

Multimodal LLM

Resource Name	Description	Link
DeepFloyd IF	The latest open source text-to-image model with high realism and language understanding capabilities, consisting of a frozen text encoder and three sequential pixel diffusion modules, is an efficient model that surpasses the current state-of-the-art models and achieves a zero-shot FID score of 6.66 on the COCO dataset.	github
Multi-modal GPT	Use multimodal GPT to train a chatbot that can receive visual and language instructions at the same time. Based on the OpenFlamingo multimodal model, various open data sets are used to create various visual guidance data, and visual and language guidance are jointly trained to effectively improve model performance	github
AudioGPT	Understanding and Generating Speech, Music, Sound, and Talking Head' by AIGC-Audio	github
text2image-prompt-generator	A small model trained with 250,000 Midjourney prompts based on GPT-2 can generate high-quality Midjourney prompts	link data
Here are 6 free text-to-image services other than Midjourney:		Bing Image Creator Playground AI DreamStudio Pixlr Leonardo AI Craiyon
BARK	A very powerful TTS (text-to-speech) project. The feature of this project is that it can add prompt words to the text, such as "laugh". This prompt word will become the sound of laughter and then synthesize it into the speech. It can also mix "male voice" and "female voice", so that you don't need to do the splicing operation again.	github
whisper	Whisper is the best and fastest library I have ever used for speech-to-text (STT, also known as ASR). I didn't expect that such a fast model could be optimized 70x. I plan to deploy this model and make it available to everyone for transcription of large speech files and translation. This model is multilingual and can automatically identify the language, which is really powerful.	github
OFA-Chinese: Chinese Multimodal Unified Pre-training Model	Chinese OFA model with transformers structure	github
Wenshengtu Open Source Model Proving Ground	Images can be generated using stable-diffusion 1.5, stable-diffusion 2.1, DALL-E, kandinsky-2 and other models based on the input text, which is convenient for testing and comparison	link
LLMScore	LLMScore is a new framework that provides evaluation scores with multi-granular compositionality. It uses a large language model (LLM) to evaluate text-to-image generation models. First, the image is converted into image-level and object-level visual descriptions, and then the evaluation instructions are fed into the LLM to measure the alignment of the synthesized image with the text, and finally a score and explanation are generated. Our extensive analysis shows that LLMScore has the highest correlation with human judgment on a wide range of datasets, significantly outperforming the commonly used text-image matching metrics CLIP and BLIP.	paper github
VisualGLM-6B	VisualGLM-6B is an open source, multimodal conversational language model that supports images, Chinese, and English. The language model is based on ChatGLM-6B and has 6.2 billion parameters. The image part builds a bridge between the visual model and the language model by training BLIP2-Qformer. The overall model has a total of 7.8 billion parameters.	github

LLM Dataset

Resource Name	Description	Link
Ambiguous Dataset	Whether it is possible to correctly eliminate ambiguity is an important indicator for measuring large language models. However, there has been no standardized measurement method. This paper proposes a dataset containing 1,645 different types of ambiguity and a corresponding evaluation method.	github paper
thu instruction training data	We designed a process to automatically generate diverse and high-quality multi-round command conversation data UltraChat, and carried out meticulous manual post-processing. All English data has now been open sourced, totaling more than 1.5 million records, making it one of the largest number of high-quality command data in the open source community.	github
Multimodal dataset MMC4	580 million images, 100 million documents, 40 billion tokens	github
EleutherAI Data	800g of text corpus is integrated for you to download for free. I don’t know the quality of the model produced by trian, but I plan to try it:	pile data paper
UltraChat	Large-scale, information-rich, and diverse multi-turn conversation data	github
ConvFinQA Financial Data Question Answering		github
The botbots dataset	A dataset containing conversations from two ChatGPT instances (gpt-3.5-turbo), CLT commands and dialogue prompts from GPT-4, covering a variety of contexts and tasks, with a generation cost of about $35, which can be used for research and training smaller dialogue models (such as Alpaca)	github
alpaca_chinese_dataset - A manually tuned Chinese conversation dataset		github
CodeGPT-data	The key to improving programming skills is data. CodeGPT is a code dialogue dataset for GPT generated by GPT. Now 32K Chinese data are publicly available, making the model better at programming	github

Corpus

Resource Name	Description	Link
Name Corpus		wainshine/Chinese-Names-Corpus
Chinese-Word-Vectors	Various Chinese word vectors	github repo
Chinese chat corpus	The database collects Douban multi-round, PTT gossip corpus, Qingyun corpus, TV drama dialogue corpus, Tieba forum reply corpus, Weibo corpus, Xiaohuangji corpus	link
Chinese rumor data	In this data file, each line is a rumor data in json format.	github
Chinese Question Answering Dataset		Link extraction code 2dva
WeChat public account corpus	3G corpus, including some WeChat official account articles captured from the web, with HTML removed and only plain text. Each article is in JSON format, with name being the WeChat official account name, account being the WeChat official account ID, title being the title, and content being the text.	github
Chinese natural language processing corpus and datasets		github
Task-based dialogue English dataset	【The Most Complete Task-based Dialogue Dataset】 mainly introduces a complete set of task-based dialogue datasets, which covers the main information of all commonly used datasets in the field of task-based dialogue. In addition, in order to help researchers better grasp the context of the progress of the field, we provide the state-of-the-art experimental results on several datasets in the form of Leaderboard.	github
Speech recognition corpus generation tool	Creating Automatic Speech Recognition (ASR) corpora from online videos with audio/captions	github
LitBankNLP dataset	A corpus of 100 labeled English novels to support natural language processing and computational humanities tasks	github
Chinese ULMFiT	Sentiment Analysis Text Classification Corpus and Model	github
Administrative division data of provinces, cities, districts and towns with pinyin annotations		github
Education Industry News Automatic Summarization Corpus		github
Chinese Natural Language Processing Dataset		github
Wikipedia Massively Parallel Text Corpus	85 languages, 1620 language pairs, 135M contrastive sentences	github
Ancient Poetry Library		github repo More complete ancient poetry library
Low memory loading of Wikipedia data	Loading 17GB+ English Wikipedia corpus with the new version of nlp library only takes up 9MB of memory and the traversal speed is 2-3 Gbit/s	github
Couplet data	700,000 couplets	github
Color Dictionary Dataset		github
42GB of JD Customer Service Dialogue Data (CSDD)		github
700,000 couplet data		link
Username blacklist		github
Dependency parsing corpus	40,000 sentences of high-quality annotated data	Homepage
People's Daily Corpus Processing Toolset		github
Fake news dataset fake news corpus		github
Poetry Quality Evaluation/Fine-Grained Emotional Poetry Corpus		github
Open tasks related to Chinese natural language processing	Datasets and current best results	github
Chinese Abbreviation Dataset		github
Chinese Task Benchmark Assessment	Representative datasets - Benchmark (pre-trained) models - Corpus - Baseline - Toolkit - Leaderboard	github
Chinese Rumor Database		github
CLUEDatasetSearch	Chinese and English NLP datasets Search all Chinese NLP datasets, with commonly used English NLP datasets	github
Multi-document summarization dataset		github
Make everyone "polite" courtesy transfer task	Convert impolite sentences to polite sentences while preserving meaning, providing a dataset of 139M+ instances	Paper and code
Cantonese/English Conversation Bilingual Corpus		github
List of Chinese NLP datasets		github
Name recognition dataset of person names, place names, and organization names		github
Chinese Language Comprehension Assessment Benchmark	Including representative datasets & benchmark models & corpora & rankings	github
OpenCLaP multi-domain open source Chinese pre-trained language model warehouse	Civil documents, criminal documents, Baidu Encyclopedia	github
Chinese full word coverage BERT and two reading comprehension data	DRCD dataset: released by Delta Research Institute in Taiwan, China, its format is the same as SQuAD, and it is an extractive reading comprehension dataset based on traditional Chinese. CMRC 2018 dataset: Chinese machine reading comprehension data released by the Harbin Institute of Technology iFlytek Joint Laboratory. Based on the given question, the system needs to extract fragments from the passage as answers, in the same format as SQuAD.	github
Dakshina Dataset	Latin/native script parallel dataset for twelve South Asian languages	github
OPUS-100	Multilingual (100 languages) parallel corpus centered on English	github
Chinese reading comprehension dataset		github
Chinese Natural Language Processing Vector Collection		github
Chinese Language Comprehension Assessment Benchmark	Includes representative datasets, benchmark (pre-trained) models, corpora, and leaderboards	github
A large list of NLP datasets/benchmark tasks		github
LitBankNLP dataset	A corpus of 100 labeled English novels to support natural language processing and computational humanities tasks	github
700,000 couplet data		github
Classical Chinese (ancient Chinese) - Modern Chinese Parallel Corpus	The short chapters include short ancient books such as "The Analects of Confucius", "Mencius" and "Zuo Zhuan", which have been merged with "Zizhi Tongjian"	github
COLDDateset, Chinese offensive language detection dataset	Covers topics such as race, gender, and region. Data will be released after the paper is published.	paper
GAOKAO-bench: Using Chinese college entrance examination questions as a dataset	Using the Chinese college entrance examination questions as a data set, the evaluation framework for evaluating the language comprehension and logical reasoning ability of large language models includes 1,781 multiple-choice questions, 218 fill-in-the-blank questions, and 812 answer questions.	github
Zero to NLP - Chinese NLP application data, models, training, reasoning		github

Thesaurus and lexical tools

Resource Name	Description	Link
textfilter	Chinese and English sensitive word filtering	observerss/textfilter
Name extraction function	Chinese (modern, ancient) names, Japanese names, Chinese surnames and given names, titles (aunt, aunt, etc.), English->Chinese names (John Lee), idiom dictionary	cocoNLP
Chinese abbreviations database	NPC: National People's Congress; China: People's Republic of China; Women's Tennis: Women's/n Tennis/n Match/vn	github
Chinese Character Dictionary	Chinese character split method (I) split method (II) split method (III) split 手诲扌诲才诲	kfcd/chaizi
Vocabulary sentiment value	Spring water: 0.400704566541 Abundant: 0.37006739587	rainarch/SentiBridge
Chinese vocabulary, stop words, sensitive words		dongxiexidian/Chinese
python-pinyin	Convert Chinese characters to Pinyin	mozillazg/python-pinyin
zhtools	Convert between Traditional and Simplified Chinese	skydark/nstools
English-like Chinese pronunciation engine	say wo i ni #say: I love you	tinyfool/ChineseWithEnglish
chinese_dictionary	Synonyms, antonyms, negation thesaurus	guotong1988/chinese_dictionary
wordninja	Split and extract words from English strings without spaces	wordninja
Car brands, car parts related words		data
THU's vocabulary	IT thesaurus, financial thesaurus, idiom thesaurus, place name thesaurus, historical celebrity thesaurus, poetry thesaurus, medical thesaurus, diet thesaurus, legal thesaurus, automobile thesaurus, animal thesaurus	link
Crime legal terms and classification model	Contains 856 crime knowledge graphs, crime prediction based on a 2.8 million crime training database, 13 types of question classification and legal information Q&A function based on 200,000 legal Q&A pairs	github
Word segmentation corpus + code		Baidu Netdisk link - Extraction code pea6
Chinese word segmentation + part-of-speech tagging based on Bi-LSTM + CRF	Keras implementation	link
Chinese word segmentation and part-of-speech tagging based on Universal Transformer + CRF		link
Fast Neural Network Segmentation Package	java version
chinese-xinhua	Chinese Xinhua Dictionary database and API, including commonly used allegorical sayings, idioms, words and Chinese characters	github
SpaCy Chinese Model	Contains functions such as Parser, NER, syntax tree, etc. Some English packages use spacy's English model. If you want to adapt to Chinese, you may need to use spacy's Chinese model.	github
Chinese character data		github
Synonyms Chinese synonyms toolkit		github
HarvestText	Domain-adaptive text mining tools (new word discovery - sentiment analysis - entity linking, etc.)	github
word2word	Convenient and easy-to-use multilingual word-word pair collection 62 languages / 3,564 multilingual pairs	github
Polyphonetic dictionary data and codes		github
Chinese characters, words, and idioms query interface		github
103976 English word library packages	(sql version, csv version, Excel version)	github
List of English swear words		github
Word Pinyin Data		github
Number name library in 186 languages		github
Large-scale name database of countries around the world		github
Chinese character feature extractor (featurizer)	Extract the features of Chinese characters (pronunciation features, glyph features) for use as features for deep learning	github
char_featurizer - Chinese character feature extraction tool		github
Python interface library for the Chinese, Japanese and Korean word segmentation library mecab		github
g2pC Context-based Chinese pronunciation automatic tagging module		github
ssc, Sound Shape Code	Sound and Shape Code - A Chinese string similarity calculation method based on "Sound and Shape Code"	version 1 version 2 blog/introduction
Acquisition of multiple meanings/meanings of Chinese words and semantic disambiguation of words in specific sentences based on encyclopedic knowledge base		github
Tokenizer is a fast and customizable text tokenization library		github
Tokenizers	The most advanced tokenizer with emphasis on performance and versatility	github
Transform text by replacing synonyms		github
token2index is a powerful and lightweight term indexing library compatible with PyTorch/Tensorflow		github
Traditional and Simplified Chinese Conversion		github
Cantonese NLP Tools		github
Domain Dictionary	Professional dictionary knowledge base covering 68 fields and a total of 9.16 million words	github

Pre-trained language models & large models

Resource Name	Description	Link
BMList	Big Model Big List	github
Bert's paper Chinese translation		link
Slides by Bert's original author		link
Text Classification Practice		github
bert tutorial text classification tutorial		github
BERT PyTorch Implementation		github
BERT PyTorch Implementation		github
BERT generates sentence vectors, BERT performs text classification and text similarity calculation		github
Illustration of BERT and ELMO		github
BERT Pre-trained models and downstream applications		github
Language/knowledge representation tools BERT & ERNIE		github
Using the gpt-2 language model in Kashgari		github
Facebook LAMA	Probes for analyzing facts and common sense knowledge contained in pre-trained language models. Language model analysis, providing a unified access interface for Transformer-XL/BERT/ELMo/GPT pre-trained language models	github
GPT2 training code in Chinese		github
XLMFacebook's cross-language pre-trained language model		github
Massive Chinese pre-trained ALBERT model		github
Transformers 20	Supports TensorFlow 20 and PyTorch's natural language processing pre-trained language models (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) 8 architectures/33 pre-trained models/102 languages	github
8 papers sort out the progress and reflections of BERT-related models		github
French RoBERTa pre-trained language model	French RoBERTa pre-trained language model trained with 138GB corpus	link
Chinese pre-trained ELECTREA model	Pretrain Chinese Model based on adversarial learning	github
albert-chinese-ner	Using the pre-trained language model ALBERT for Chinese NER	github
A collection of open source pre-trained language models		github
Chinese ELECTRA pre-trained model		github
Predicting the next word with Transformers (BERT, XLNet, Bart, Electra, Roberta, XLM-Roberta) (model comparison)		github
TensorFlow Hub	New language models for 40+ languages (including Chinese)	link
UER	A repository of Chinese pre-trained models based on different corpora, encoders, and target tasks (including BERT, GPT, ELMO, etc.)	github
A collection of open source pre-trained language models		github
Multilingual sentence vector pack		github
Language Model as a Service (LMaaS)	Language Model as a Service	github
Open source language model GPT-NeoX-20B	With 20 billion parameters, it is the largest publicly accessible pre-trained general autoregressive language model.	github
Chinese Scientific Literature Dataset (CSL)	Contains meta information (title, abstract, keywords, discipline, category) of 396,209 Chinese core journal papers. The CSL dataset can be used as a pre-training corpus, and can also be used to construct many NLP tasks, such as text summarization (title prediction), keyword generation, and text classification.	github
Large model development tool		github

Extraction

Resource Name	Description	Link
Time extraction	It has been integrated into the python package cocoNLP . Welcome to try it.	java version Python version
Neural Network Relation Extraction PyTorch	Chinese is not supported yet	github
Named Entity Recognition PyTorch based on BERT	Chinese is not supported yet	github
Keyphrase extraction package pke		github
BLINK is the most advanced entity link library		github
Named Entity Recognition with BERT/CRF		github
LatticeLSTM Chinese named entity recognition supporting batch parallelism		github
Building a model for medical entity recognition	Contains dictionary and corpus annotation, based on Python	github
Pipeline entity and relation extraction based on TensorFlow and BERT	- Entity and Relation Extraction Based on TensorFlow and BERT Pipeline entity and relation extraction based on TensorFlow and BERT, 2019 Language and Intelligent Technology Competition Information Extraction Task Solution. Schema based Knowledge Extraction, SKE 2019	github
Chinese named entity recognition NeuroNER vs BertNER		github
Chinese named entity recognition based on BERT		github
Chinese Key Phrase Extraction Tool		github
bert	Tensorflow version for Chinese named entity recognition	github
bert-Kashgari	Kashgari, a keras-based encapsulated classification and annotation framework, can build a classification or sequence annotation model in a few minutes	github
cocoNLP	Extraction of information such as name, address, email address, mobile phone number, mobile phone location, etc., rake phrase extraction algorithm.	github
Microsoft Multilingual Number/Unit/Date/Time Recognition Pack		github
Baidu's open source benchmark information extraction system		github
Chinese address segmentation (address element identification and extraction), NER through sequence labeling		github
Open domain text knowledge triple extraction and knowledge base construction based on dependency syntax		github
Chinese keyword extraction method based on pre-training model		github
chinese_keyphrase_extractor (CKPE)	A tool for Chinese keyphrase extraction A tool for quickly extracting and identifying key phrases from natural language text	github
A simple resume parser to extract key information from resumes		github
BERT-NER-Pytorch BERT Chinese NER experiments in three different modes		github

Knowledge Graph

Resource Name	Description	Link
Tsinghua University XLORE Chinese and English cross-language encyclopedia knowledge graph	Baidu, Chinese Wiki, English Wiki	link
Automatically generate document graph		github
Question answering system based on medical knowledge graph		github This repo refers to github
Chinese Character Relationship Knowledge Graph Project		github
AmpliGraph Knowledge Graph Representation Learning (Python) Library Knowledge Graph Concept Link Prediction		github
Chinese knowledge graph materials, data and tools		github
Chinese knowledge graph based on Baidu Encyclopedia	Extract triple information and build Chinese knowledge graph	github
Zincbase Knowledge Graph Construction Toolkit		github
Question answering system based on knowledge graph		github
Knowledge graph deep learning related materials collation		github
Southeast University "Knowledge Graph" Postgraduate Course (Materials)		github
Knowledge Graph Car Audio Project		github
One Piece Knowledge Graph		github
A dataset of 132 knowledge graphs	Covers common sense, cities, finance, agriculture, geography, meteorology, social networking, Internet of Things, medical care, entertainment, life, business, travel, science and education	link
Large-scale, structured, bilingual COVID-19 knowledge graph (COKG-19)		link
Event triple extraction based on dependency syntax and semantic role labeling		github
Abstract Knowledge Graph	The current scale is 500,000, supporting abstraction of noun entities, state descriptions, and event actions.	github
Large-scale Chinese knowledge graph data with 1.4 billion entities		github
Jiagu Natural Language Processing Tool	Based on models such as BiLSTM, it provides functions such as knowledge graph, relationship extraction, Chinese word segmentation, part-of-speech tagging, named entity recognition, sentiment analysis, new word discovery, keyword text summarization, text clustering, etc.	github
medical_NER - Named Entity Recognition in Chinese Medical Knowledge Graph		github
A large list of learning materials/datasets/tool resources related to knowledge graphs		github
LibKGE: A knowledge graph embedding library for reproducible research		github
Military knowledge graph question answering project based on mongodb storage	The military weapons knowledge base includes 8 major categories such as aircraft and space equipment, more than 100 subcategories, and a total of 5,800 items. This project does not use a graph database for storage. It uses Jieba to parse questions and identify question entities. It completes queries for multiple types of questions based on query templates. It mainly provides an industrial question-answering idea demo.	github
JD Product Knowledge Graph		github
Chinese relation extraction based on distant supervision		github
Intelligent question answering system based on medical knowledge graph		github
BLINK is the most advanced entity link library		github
A small securities knowledge graph/knowledge base		github
dstlr unstructured text scalable knowledge graph construction platform		github
Baidu Encyclopedia Character Entry Attribute Extraction	Knowledge graphing with BERT-based fine-tuning and feature extraction	github
COVID-19 related data	Chinese medical dialogue dataset of COVID-19 and other types of pneumonia; open data sources from Tsinghua University and other institutions (COVID-19)	github github
DGL-KE graph embedding representation learning algorithm		github
Cause and effect diagram		method data
Causal event pairing based on multi-domain text datasets		link

Text Generation

Resource Name	Description	Link
Texar	Toolkit for Text Generation and Beyond	github
Professor Ehud Reiter's blog		linkHighly recommended by Professor Wan Xiaojun of Peking University, this blog provides an in-depth discussion and reflection on NLG technology, evaluation and application.
A large list of resources related to text generation		github
Open Domain Dialogue Generation and Its Practice in Microsoft XiaoIce	Natural language generation allows machines to master the ability to automatically create	link
Text generation control		github
A large list of resources related to natural language generation		github
Evaluating Natural Language Generation with BLEURT		link
Automatic couplet data and robots		Codelink 700,000 couplet data
Automatically generate comments	Generate comments based on Hacker News article titles using the Transformer encoder-decoder model	github
Natural language generation SQL statements (English)		github
Natural Language Generation Resources		github
Chinese Generation Task Benchmark Evaluation		github
Specific topic text generation/text augmentation based on GPT2		github
Encoding, marking and implementing a controllable and efficient text generation method		github
TextFooler: Adversarial text generation module for text classification/reasoning		github
SimBERT	The BERT model is based on the UniLM concept and integrates retrieval and generation.	github
New word generation and sentence making	Non-existent words are generated from scratch using GPT-2 variants along with their definitions and examples	github
Automatically generate multiple-choice questions from text		github
Synthetic Data Generation Benchmark		github

Text Summarization

Resource Name	Description	Link
Chinese text summarization/keyword extraction		github
Automatic resume summarization based on named entity recognition		github
Text automatic summarization library TextTeaser	English only	github
Extractive summarization based on the latest language models such as BERT		github
A Comprehensive Guide to Text Summarization with Deep Learning in Python		link
(Colab) Abstract Text Summarization Implementation Collection (Tutorial)		github

Smart Question and Answer

Resource Name	Description	Link
Chinese chatbot	Train the chatbot you want based on your own corpus, which can be used in scenarios such as intelligent customer service, online Q&A, and intelligent chat.	github
Interesting fun robot qingyun	qingyun trained Chinese chatbot	github
Open conversational robots, knowledge graphs, semantic understanding, natural language processing tools and data		github
QA robot	Amodel-for-Retrivalchatbot - Customer service robot, Chinese Retreival chatbot (Chinese retrieval robot)	git
ConvLab open source multi-domain end-to-end dialogue system platform		github
A dialogue system built on the latest version of rasa		github
A chatbot based on the finance-judicial field (also with the nature of small talk)		github
End-to-end closed domain dialogue system		github
MiningZhiDaoQACorpus	5.8 million Baidu Zhidao Q&A data mining project, Baidu Zhidao Q&A corpus, including more than 5.8 million questions, each with a question label. Based on this Q&A corpus, it can support a variety of applications, such as logic mining	github
GPT2 model for Chinese small talk GPT2-chitchat		github
Select a list of relevant resources (Leaderboards, Datasets, Papers) based on multiple rounds of responses from the retrieval chatbot		github
Microsoft Conversational Bot Framework		github
chatbot-list	Industry-wide sharing and introduction of intelligent customer service, chatbot applications, architecture, and algorithms	github
Chinese medical dialogue data Chinese medical dialogue data set		github
A large-scale medical conversation dataset	Contains 1.1 million medical consultations and 4 million doctor-patient conversations	github
CrossWOZ: A large-scale cross-domain Chinese task-oriented multi-turn dialogue dataset and model		paper & data
Open source conversational information search platform		github
DSTC9 2020		github
Paraphrase of T5 questions trained with Quora question pairs (Paraphrase)		github
Google releases Taskmaster-2 natural language task dialogue dataset		github
Haystack is a flexible, powerful and scalable question answering (QA) framework		github
End-to-end closed domain dialogue system		github
Amazon releases knowledge-based human-human open-domain conversation dataset		github
Albert Large QA model trained based on Baidu webqa and dureader datasets		github
CommonsenseQA: Common sense English QA challenge		link
MedQuAD (English) medical question answering dataset		github
A question-answering engine based on Albert and Electra, using Wikipedia text as context		github
A question-answering attempt based on a 140,000 song knowledge base	Features include lyrics chain, finding songs with known lyrics, and questions and answers about the triangle relationship between songs, singers, and lyrics.	github

Text Correction

Resource Name	Description	Link
Chinese text error correction module code		github
English spelling check library		github
Python spell checking library		github
GitHub Typo Corpus Large-scale GitHub multi-language spelling error/grammar error dataset		github
BertPunc is a state-of-the-art punctuation repair model based on BERT		github
Chinese Writing Proofreading Tools		github
Text Correction Reference List	Chinese Spell Checking (CSC) and Grammatical Error Correction (GEC)	github
The champion solution of the Text Intelligent Proofreading Competition	Already implemented, from Suzhou University and DAMO Academy team	link

Multimodality

Resource Name	Description	Link
Chinese multimodal dataset "Wukong"	Huawei Noah's Ark Lab opens a large-scale open-source database containing 100 million image and text pairs	github
Chinese-CLIP: A pre-trained model for Chinese text and image representation	Chinese version of CLIP pre-trained model, open source multiple model scales, a few lines of code to handle Chinese image and text representation extraction & image and text retrieval	github

Speech Processing

Resource Name	Description	Link
ASR speech dataset + Chinese speech recognition system based on deep learning		github
Tsinghua University THCHS30 Chinese speech dataset		data_thchs30tgz-OpenSLR domestic mirror data_thchs30tgz test-noisetgz - OpenSLR domestic mirror test-noisetgz resourcetgz-OpenSLR domestic mirror resourcetgz Free ST Chinese Mandarin Corpus Free ST Chinese Mandarin Corpus AIShell-1 open source dataset-OpenSLR domestic mirror AIShell-1 open source dataset Primewords Chinese Corpus Set 1-OpenSLR domestic mirror Primewords Chinese Corpus Set 1
Laughter Detector		github
New version of Common Voice speech recognition dataset	Includes over 1,400 hours of speech samples from 42,000 contributors, including	link
speech-aligner	A tool for generating phoneme-level time-aligned annotations from "human voice" and its "language text"	github
ASR Phonetic Dictionary/Dictionary		github
Speech Sentiment Analysis		github
masr	Chinese speech recognition, providing pre-trained models and high recognition rate	github
Chinese Text Normalization for Speech Recognition		github
Speech quality evaluation indicators (MOSNet, BSSEval, STOI, PESQ, SRMR)		github
Chinese/English pronunciation dictionary for speech recognition		github
CoVoST Facebook released a multilingual speech-to-text translation corpus	Includes audio, text transcription and English translation in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese)	github
Parakeet Text-to-speech Synthesis based on PaddlePaddle		github
(Java) Accurate Speech Natural Language Detection Library		github
CoVoST Facebook released a multilingual speech-to-text translation corpus		github
Text-to-speech synthesis implemented in TensorFlow 2		github
Python audio feature extraction package		github
ViSQOL is an objective and complete reference index for audio quality perception, with two modes: audio and voice.		github
zhrtvc	Easy-to-use Chinese voice cloning and Chinese speech synthesis system	github
aukit	A useful speech processing toolbox, including speech noise reduction, audio format conversion, feature spectrum generation and other modules	github
phkit	A useful phoneme processing toolbox, including Chinese phonemes, English phonemes, text-to-pinyin, text regularization and other modules	github
zhvoice	Chinese speech corpus, with clearer and more natural speech, including 8 open source data sets, 3,200 speakers, 900 hours of speech, and 13 million words	github
Audio for speech behavior detection	, binarization, speaker recognition, automatic speech recognition, emotion recognition and other tasks	github
Deep Learning Emotional Text-to-Speech Synthesis		github
Python Audio Data Augmentation Library		github
Audio Enhancement Based on Large-Scale Audio Dataset		github
Voice transfer		github

Document Processing

Resource Name	Description	Link
LayoutLM-v3 document understanding model		github
PyLaia is a deep learning toolkit for handwritten document analysis		github
Single document unsupervised keyword extraction		github
DocSearch Free Document Search Engine		github
fdfgen	Able to automatically create PDF documents and fill in information	link
pdfx	Automatically extract cited references and download the corresponding pdf files	link
invoice2data	Invoice pdf information extraction	invoice2data
PDF document information extraction		github
PDFMiner	PDFMiner can get the exact location of text in the page, as well as other information such as fonts or lines. It also has a PDF converter that can convert PDF files into other text formats (such as HTML). There is also an extensible parser PDF that can be used for other purposes besides text analysis.	link
PyPDF2	PyPDF 2 is a python PDF library that can split, merge, crop, and convert the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs, and can also merge entire files together.	link
PyPDF2	PyPDF 2 is a python PDF library that can split, merge, crop, and convert the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs, and can also merge entire files together.	link
ReportLab	ReportLab is a fast way to create PDF documents. A time-proven, super easy-to-use open source project for creating complex, data-driven PDF documents and custom vector graphics. It's free, open source, and written in Python. The package is downloaded more than 50,000 times a month, is part of the standard Linux distribution, embedded in many products, and was chosen to power Wikipedia's print/export functionality.	link
SIMPdfA simple PDF file text editor written in Python		github
pdf-diff	PDF file diff tool can display the differences between two PDF documents	github

Form Processing

Resource Name	Description	Link
Use unet to automatically detect and rebuild document tables		github
pdftabextract	Used for table information analysis after OCR recognition, very powerful	link
tabula-py	Directly convert the table information in PDF to pandas dataframe, there are two versions of code: Java and Python
Camelot	PDF table analysis	link
pdfplumber	PDF table analysis
PubLayNet	Able to divide paragraphs, recognize tables and pictures	link
Extracting tabular data from papers		github
Finding answers in tables with BERT		github
Series of articles on Form Q&A		Introduction Model Final Chapter
Generate tabular data using GAN (English only)		github
carefree-learn(PyTorch)	Tabular Dataset Automated Machine Learning (AutoML) Package	github
Closed field fine-tuning table detection		github
PDF table data extraction tool		github
TaBERT: A new model for understanding queries on tabular data		paper
Form Processing	Awesome-Table-Recognition	github

Text Matching

Resource Name	Description	Link
Sentence, QA Similarity Matching MatchZoo	A collection of text similarity matching algorithms, including multiple deep learning methods, which are worth trying.	github
Chinese Question Sentence Similarity Calculation Competition and Solution Summary		github
Similarity calculation toolkit	Written in java, it is used for similarity calculations related to words, phrases, sentences, lexical analysis, sentiment analysis, semantic analysis, etc.	github
Chinese word similarity calculation method	It combines the word similarity calculation methods of the extended version of Synonymous Cilin and Hownet, with wider vocabulary coverage and more accurate results.	gihtub
Python string similarity algorithm library		github
Similar sentence judgment model based on Siamese bilstm model, providing training data set and test data set	Provided 100,000 training samples	github

Text Data Augmentation

Resource Name	Description	Link
Chinese NLP Data Enhancement (EDA) Tool		github
English NLP Data Enhancement Tools		github
One-click Chinese data enhancement tool		github
The application and effect of data enhancement in machine translation and other NLP tasks		link
NLP Data Augmentation Resource Set		github

Common regular expressions

Resource Name	Description	Link
Regular expression for extracting email		It has been integrated into the python package cocoNLP . Welcome to try it.
Extract phone_number		It has been integrated into the python package cocoNLP . Welcome to try it.
Regular expression for extracting ID number	IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]\|1[012])(0[1-9]\|[12][0 -9]\|3[01])\d{3}[0-9xX]) IDs = re.findall(IDCards_pattern, text, flags=0)
IP address regular expression	(25[0-5]\| 2[0-4]\d\| [0-1]\d{2}\| [1-9]?\d).(25[0-5]\| 2[0- 4]\d\| [0-1]\d{2}\| [1-9]?\d).(25[0-5]\| 2[0-4]\d\| [0-1]\d {2}\| [1-9]?\d).(25[0-5]\| 2[0-4]\d\| [0-1]\d{2}\| [1-9]?\d )
Tencent QQ number regular expression	[1-9]([0-9]{5,11})
Domestic landline number regular expression	[0-9-()()]{7,18}
Username regular expression	[A-Za-z0-9_-\u4e00-\u9fa5]+
Domestic phone number regular expression matching (three major operators + virtual, etc.)		github
Regular Expression Tutorial		github

Text Retrieval

Resource Name	Description	Link
Efficient fuzzy search tool		github
A large list/search engine of BERT models for various languages/tasks		link
Deepmatch is a deep matching model library for recommendation, advertising and search		github
wwsearch is a full-text search engine developed by WeChat for Enterprise		github
aili - the fastest in-memory index in the East		github
Efficient string matching tool RapidFuzz	a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy	github

Reading comprehension

Resource Name	Description	Link
Efficient fuzzy search tool		github
A large list/search engine of BERT models for various languages/tasks		link
Deepmatch is a deep matching model library for recommendation, advertising and search		github
AllenNLP reading comprehension supports a variety of data and models		github

Sentiment Analysis

Resource Name	Description	Link
Aspect Sentiment Analysis Package		github
awesome-nlp-sentiment-analysis	Sentiment analysis, emotion cause identification, evaluation object and evaluation word extraction	github
Sentiment analysis technology enables intelligent customer service to better understand human emotions		github

Event Extraction

Resource Name	Description	Link
Chinese event extraction		github
List of literature resources on NLP event extraction		github
BERT Event Extraction (ACE 2005 corpus) implemented in PyTorch		github
News event clue extraction		github

Machine Translation

Resource Name	Description	Link
Wudao Dictionary	The command line version of Youdao Dictionary, supporting English-Chinese and online search	github
NLLB	NLLB language model that supports translation between 200+ languages	link
Easy-Translate	Script for translating large text files locally, based on Facebook/Meta AI's M2M100 model and NLLB200 model, supporting 200+ languages	github

Digital transformation

Resource Name	Description	Link
The best Chinese character number (Chinese numerals) - Arabic numerals conversion tool		github
Quickly convert "Chinese numbers" and "Arabic numbers"		github
Parse natural language numeric strings into integers and floating point numbers		github

Reference resolution

Resource Name	Description	Link
Chinese reference resolution data		github baidu ink code a0qq

Text Clustering

Resource Name	Description	Link
TextCluster Short text cluster preprocessing module		github

Text Classification

Resource Name	Description	Link
NeuralNLP-NeuralClassifier Tencent open source deep learning text classification tool		github

Knowledge Reasoning

Resource Name	Description	Link
GraphbrainAI is an open source software library and research tool that aims to facilitate automatic meaning extraction and text understanding as well as knowledge exploration and inference.		github
(Harvard) Free book on causal reasoning		pdf

Explainable Natural Language Processing

Resource Name	Description	Link
A library of state-of-the-art interpreters for textual machine learning models		github

Text Attack

Resource Name	Description	Link
TextAttack: A framework for adversarial attacks on natural language processing models		github
OpenBackdoor: Text backdoor attack and defense toolkit	OpenBackdoor is developed based on Python and PyTorch, which can be used to reproduce, evaluate and develop algorithms related to text backdoor attack and defense	github

Text Visualization

Resource Name	Description	Link
Scattertext text visualization (python)		github
interactive visualization of whatlies word vectors		spacy tools
PySS3 SS3 text classifier machine visualization tool for explainable AI		github
Rendering 3D images with Notepad		github
attnvis Visualization of attention interactions of transformer language models such as GPT2 and BERT		github
Texthero text data efficient processing package	Including preprocessing, keyword extraction, named entity recognition, vector space analysis, text visualization, etc.	github

Text Annotation Tools

Resource Name	Description	Link
A review of NLP annotation platforms		github
brat rapid annotation tool sequence annotation tool		link
Poplar web version natural language annotation tool		github
LIDA lightweight interactive dialogue annotation tool		github
doccano is an open source collaborative multilingual text annotation tool based on the web		github
Datasaurai online data annotation workflow management tool		link

Language detection

Resource Name	Description	Link
langid	97 languages detected	https://github.com/saffsd/langid.py
langdetect	Language Detection	https://code.google.com/archive/p/language-detection/

Comprehensive Tools

Resource Name	Description	Link
jieba		jieba
hanlp		hanlp
nlp4han	Chinese natural language processing toolset (sentence segmentation/word segmentation/part-of-speech tagging/chunking/syntactic analysis/semantic analysis/NER/N-grammar/HMM/pronoun resolution/sentiment analysis/spelling check	github
Progress in Hate Speech Detection		link
Bert application based on Pytorch	Including named entity recognition, sentiment analysis, text classification, and text similarity	github
nlp4han Chinese Natural Language Processing Toolset	Sentence segmentation/word segmentation/part-of-speech tagging/chunking/syntactic analysis/semantic analysis/NER/N-grammar/HMM/pronoun resolution/sentiment analysis/spelling check	github
Some basic models about natural language		github
Template code for sequence labeling and text classification using BERT		github
jieba_fast accelerated version of jieba		github
StanfordNLP	Pure Python version of natural language processing package	link
Python Spoken Natural Language Processing Toolkit (English)		github
PreNLP natural language preprocessing library		github
Some papers and codes related to nlp	Including topic model, word embedding, named entity recognition (NER), text classification, text generation, text similarity calculation, etc., involving various NLP-related algorithms, based on keras and tensorflow	github
Python Text Mining/NLP Practical Examples		github
Forte is a flexible and powerful natural language processing pipeline toolkit		github
stanza Stanford team NLP tool	Can handle more than 60 languages	github
Fancy-NLP is a text knowledge mining tool for building product portraits		github
A comprehensive and easy-to-use Chinese NLP toolkit		github
The industry often uses DSSM-based vectorized recall pipeline to reproduce		github
Texthero text data efficient processing package	Including preprocessing, keyword extraction, named entity recognition, vector space analysis, text visualization, etc.	github
NLPGNN graph neural network natural language processing toolbox		github
Macadam	A natural language processing toolkit based on Tensorflow (Keras) and bert4keras, focusing on text classification, sequence labeling and relation extraction	github
LineFlow is an efficient NLP data loader for all deep learning frameworks		github
Arabica: Python text data exploratory analysis toolkit		github
Python stress testing tool: SMSBoom		github

Funny tools

Resource Name	Description	Link
Wang Feng Lyrics Generator		phunterlau/wangfeng-rnn
Analysis of Girlfriend's Emotional Fluctuations		github
NLP is too difficult series		github
Variable naming artifact		github link
Image text removal, can be used for comic translation		github
CoupletAI - Couplet Generation	Automatic couplet system based on CNN+Bi-LSTM+Attention	github
Solving complex mathematical equations using neural network symbolic reasoning		github
Question-answering robot based on 140,000 song knowledge base	Features include lyrics chain, finding songs with known lyrics, and questions and answers about the triangle relationship between songs, singers, and lyrics.	github
COPE - Metrical Poetry Editing Program		github
Paper2GUI	An AI desktop APP toolbox for ordinary people, which can be used immediately after installation. It supports 18+ AI models, covering speech synthesis, video frame interpolation, video super-resolution, object detection, image stylization, OCR recognition and other fields.	github
Politeness Estimator (Trained using Sina Weibo data)		github paper
Getting Started with Python	Chinese programming language	homepage gitee

Course report, interview, etc.

Resource Name	Description	Link
Natural Language Processing Report		link
Knowledge Graph Report		link
Data mining report		link
Autonomous Driving Report		link
Machine Translation Report		link
Blockchain Report		link
Robot Report		link
Computer Graphics Report		link
3D Printing Report		link
Face Recognition Report		link
Artificial Intelligence Chip Report		link
CS224N Deep Learning Natural Language Processing Course		linkPyTorch implementation of the model in the courselink
A hands-on tutorial on natural language processing for deep learning researchers		github
"Natural Language Processing" by Jacob Eisenstein		github
ML-NLP	Knowledge points and code implementations commonly tested in machine learning and NLP interviews	github
NLP task example project code set		github
Review of NLP highlights in 2019		download
nlp-recipes Microsoft produced - Natural Language Processing Best Practices and Examples		github
A hands-on tutorial on natural language processing for deep learning researchers		github
Transfer Learning in Natural Language Processing (NLP)		youtube
Machine Learning Systems Book		link github

Contest

Resource Name	Description	Link
NLPer-Arsenal	NLP competition, including current competition information, past competition plans, etc., continuously updated	github
Review the top solutions of all NLP competitions		github
Baidu's 2019 Triple Extraction Competition, "Science Space Team" source code (7th place)		github

Financial Natural Language Processing

Resource Name	Description	Link
BDCI2019 Financial Negative Information Determination		github
Open source financial investment data extraction tool		github
A large list of natural language processing research resources in the financial field		github
A chatbot based on the finance-judicial field (also for small talk)		github
Demonstration of the process of constructing a small financial knowledge graph		github

Medical Natural Language Processing

Resource Name	Description	Link
Chinese Medical NLP Public Resources		github
spaCy Medical Text Mining and Information Extraction		github
Building a model for medical entity recognition	Contains dictionary and corpus annotation, based on Python	github
Question answering system based on medical knowledge graph		githubThis repo refers to github
Chinese medical dialogue data Chinese medical dialogue data set		github
A large-scale medical conversation dataset	Contains 1.1 million medical consultations and 4 million doctor-patient conversations	github
COVID-19 related data	Chinese medical dialogue dataset of COVID-19 and other types of pneumonia; open data sources from Tsinghua University and other institutions (COVID-19)	github github

Legal Natural Language Processing

Resource Name	Description	Link
Blackstone’s spaCy pipeline and NLP models for unstructured legal text		github
Legal Intelligence Literature Resource List		github
A chatbot based on the finance-judicial field (also with the nature of small talk)		github
Crime legal terms and classification model	Contains 856 crime knowledge graphs, crime prediction based on a 2.8 million crime training database, 13 types of question classification and legal information Q&A function based on 200,000 legal Q&A pairs	github
A large list of legal NLP related resources		github

Text to Image

Resource Name	Description	Link
Dalle-mini	A mini version of DALL·E that generates images based on text prompts	github

other

Resource Name	Description	Link
phone	China Mobile Location Query	ls0f/phone
phone	International mobile phone and phone location query	AfterShip/phone
ngender	Determine gender based on name	observerss/ngender
An overview of the differences between Chinese and English natural language processing (NLP)		link
Technical documents PDF or PPT shared by experts in major companies		github
comparxiv is a command for comparing the differences between two submitted versions on arXiv		pypi
Meta-architecture of CHAMELEON deep learning news recommendation system		github
Automatic resume screening system		github
Multiple text readability evaluation indicators implemented in Python		github

Files

readme

Directory actions

More options