Skip to content

Latest commit

 

History

History
 
 

readme

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="./data/.logo图片/.img.jpg" width="180">
display: inline-block;
color: #999;
NLP: Paradise for migrant workers

The Most Powerful NLP-Weapon Arsenal

NLP Migrant Workers' Paradise: Almost the Most Complete Chinese NLP Resource Library

In the process of getting started and becoming familiar with NLP, I used a lot of packages on GitHub, so I sorted them out and shared them here.

Many of the bags are very interesting and worth collecting to satisfy your collecting addiction! If you find them useful, please share and star:star:, thank you!

Updates will be made irregularly over a long period of time. Welcome to watch and fork! ❤️❤️❤️

🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥

🍆 🍒 🍐 🍊 🌻 🍓 🍈 🍅 🍍

Comparison of ChatGPT-like model evaluation

Resource Name Description Link
ChatALL: can chat with multiple AI robots at the same time (including products from Tsinghua University and iFlytek) A tool that can talk to multiple AI chatbots at the same time (such as ChatGPT, Bing Chat, Bard, Alpaca, Vincuna, Claude, ChatGLM, MOSS, iFlytek Spark, ERNIE, etc.). It can send prompts to different AI robots in parallel to help users find the best answer github-ChatALL
Chatbot Arena Benchmarking LLM with Elo rating in real-world scenarios - Introduced Chatbot Arena, a benchmark platform for large language models (LLMs), which uses an anonymous, randomized adversarial evaluation method based on the Elo rating system widely used in competitive games such as chess. Elo ratings for 9 popular open source LLM models were released and a leaderboard was launched. The platform uses the FastChat multi-model serving system to provide an interactive interface in multiple languages, and the data comes from user voting. Summarized the advantages of Chatbot Arena and plans to provide better sampling algorithms, rankings, and serving systems Ends May 3, 2023
ChatGPT-like model evaluation summary Large language models (LLMs) have received widespread attention. These powerful models can understand complex information and provide human-like responses to a variety of questions. Among them, GPT-3 and GPT-4 performed best, and Flan-t5 and Lit-LLaMA also performed well. However, please note that commercial use of models may require payment and data sharing blog
A review of Large Language Models (LLMs) blog
Latest Research on Large Model Evaluation Long text modeling has always been one of ChaGPT's amazing capabilities. We use [paragraph translation] as an experimental scenario to conduct a comprehensive and fine-grained test of the large model's paragraph modeling capabilities. paper
Chinese large model evaluation tools & rankings C-Eval is a comprehensive Chinese assessment suite for base models. It contains 13,948 multiple-choice questions covering 52 different subjects and four difficulty levels, as shown below. Please visit our website or consult our paper for more details. github paper
OpenCompass Large Model Review OpenCompass is an open-source, efficient, and comprehensive large-model evaluation system and open platform developed by Shanghai Artificial Intelligence Laboratory. It provides a complete, open-source, and reproducible evaluation framework, and supports one-stop evaluation of large language models, multimodal models, and other models. Using distributed technology, even models with hundreds of billions of parameters can be evaluated within a few hours. Based on multiple highly recognized data sets in different dimensions, it provides a variety of evaluation methods, including zero-sample evaluation, small-sample evaluation, and thought chain evaluation, to fully quantify the capabilities of each dimension of the model. github website

ChatGPT-like information

Resource Name Description Link
Open LLMs: Open Large Language Models (LLMs) for commercial use A list of open LLMs available for commercial use github
LLM Zoo: A marketplace for data, models, and benchmarks for large language models LLM Zoo: democratizing ChatGPT - a project that provides data, models, and evaluation benchmark for large language models github
Large Language Model (LLM) Data Collection List of related papers, including research work on guidance, reasoning, decision making, continuous improvement, and self-improvement LLM information collection
DecryptPrompt Summary Prompt & LLM papers, open source data & models, AIGC applications github
SmartGPT Designed to provide large language models (especially GPT-3.5 and GPT-4) with the ability to complete complex tasks by breaking them down into smaller problems and using the Internet and other external sources to collect information. Features include modular design, easy configuration, and high support for plug-ins. SmartGPT operates based on the concept of "Autos", including two types, "Runner" and "Assistant", both equipped with LLM agents that handle planning, reasoning, and task execution. In addition, SmartGPT also has a memory management system, as well as a plug-in system that can define various commands github-SmartGPT
OpenGPT A framework for creating instruction-based datasets and training large language models (LLMs) of experts in the conversational domain. It has been successfully applied to train the health care conversational model NHS-LLM, using data from the UK National Health Service (NHS) website to generate a large number of question-answer pairs and unique conversations. github-OpenGPT
PaLM 2 Technical Report Google has recently released PaLM 2, a new language model with better multilingual and reasoning capabilities while being more computationally efficient than its predecessor, PaLM. PaLM 2 combines a number of research advances, including computationally optimal model and data scale, more diverse and multilingual datasets, and more effective model architectures and objective functions. PaLM 2 achieves state-of-the-art performance on a variety of tasks and capabilities, including language proficiency tests, classification and question answering, reasoning, programming, translation, and natural language generation. PaLM 2 also demonstrates strong multilingual capabilities, able to handle hundreds of languages, and translate and interpret between different languages. PaLM 2 also considers issues of responsible use, including controlling toxicity during reasoning, reducing memoization, and assessing potential harm and bias. PaLM 2 Technical Report
DB-GPT An open source experimental project based on vicuna-13b and FastChat, it uses langchain and llama-index technologies for contextual learning and question-answering. The project is fully locally deployed to ensure data privacy and security, and can directly connect to private databases to process private data. Its functions include SQL generation, SQL diagnosis, database knowledge question-answering, etc. github-DB-GPT
A large list of Transformers related literature resources Contains a variety of Transformer models, such as BERT, GPT, Transformer-XL, etc. These models have been widely used in many natural language processing tasks. In addition, the list also provides relevant papers and code links for these models, providing a good reference resource for researchers and developers in the field of natural language processing. github
The Ultimate Guide to GPT-4 A guide on how to use GPT3 and GPT4, including more than 100 resources to help learn how to use it to improve your life efficiency. Including how to learn the basics of ChatGPT, how to learn advanced knowledge of ChatGPT, how to use GPT-3 in language learning, how to use GPT-3 in teaching, how to use GPT-4, etc. It also provides how to upgrade to the ChatGPT+ plan to use GPT-4 and how to use GPT-4 for free. At the same time, it also provides a guide on how to use ChatGPT in business, productivity, benefits, money, etc. link
Efficient fine-tuning of LLM parameters based on LoRA link
Complex Reasoning: The North Star Capability of Large Language Models In the GPT-4 release blog, the authors wrote: "In a casual conversation, the difference between GPT-3.5 and GPT-4 may be subtle. When the complexity of the task reaches a sufficient threshold, the difference will become apparent." This means that complex tasks are likely to be the key differentiating factor between large and small language models. In this article, we will carefully analyze and discuss how to make large language models have powerful complex reasoning capabilities. blog
Is the emergent power of large language models a mirage? The emergence of large language models has always been regarded as a magical phenomenon, as if it were a miracle caused by great effort, but this paper argues that this may just be an illusion. paper
Probabilistic Summary of Large Language Models Very detailed explanation and summary of LLM science paper
A brief history of the LLaMA model LLaMA is a language model released by Meta, which uses the Transformer architecture and has multiple versions with a maximum of 65B parameters. Similar to GPT, it can be used for further fine-tuning and is suitable for a variety of tasks. Unlike GPT, LLaMA is open source and can be run locally. Existing LLaMA models include: Alpaca, Vicuna, Koala, GPT4-x-Alpaca, and WizardLM. Each model has different training data and performance. blog
Complex Reasoning with Large Language Models This paper discusses how to train language models with powerful and complex reasoning capabilities, and explores how to effectively prompt the model to fully unleash its potential. In view of the similarities between language model and programming training, a three-stage training is proposed: continuous training, supervised fine-tuning, and reinforcement learning. A set of tasks for evaluating the reasoning capabilities of large language models is introduced. It also discusses how to perform prompt engineering to enable the model to achieve better learning results by providing various learning opportunities, ultimately achieving intelligence. link
Large language model evolution tree paper
Li Hongyi: How poor people can replicate their own ChatGPT with low resources blog
Essential resources for training ChatGPT: A complete guide to corpus, models, and code libraries Resource link paper address
GitHub treasure library, which organizes various open source projects related to GPT github
ChatGPT Chinese Guide gitlab
The application, advantages, limitations and future development direction of ChatGPT in natural language processing are discussed. Ethical considerations and engineering tips when using this technology are highlighted. paper
List of literature resources related to large language models github
Literature Review on Large Language Models (Chinese Version) github
A large list of ChatGPT related resources github
Pre-Training to Learn in Context paper
Langchain Architecture Diagram image
Numbers every LLM developer should know github
How to build powerful complex reasoning capabilities using large language models blog
LLMs Nine-story Demon Tower Share practical experience and experience in fighting monsters (ChatGLM, Chinese-LLaMA-Alpaca, MiniGPT-4, FastChat, LLaMA, gpt4all, etc.) github

ChatGPT-like open source framework

Resource Name Description Link
LLM-As-Chatbot This project makes all the LLMs available on the market into Chatbots, which can be run directly on Google Colab without having to build them yourself. It is very suitable for friends who want to experience LLM. I just tried it and it is really super simple. Some LLMs require more video memory, so it is best to have a Colab Pro subscription. github
OpenBuddy A powerful open source multilingual chatbot model, targeting global users, with a focus on conversational AI and fluent multilingual support, including English, Chinese and other languages. Based on Facebook's LLAMA model, it has been fine-tuned, including expanding the vocabulary, adding common characters, and enhancing token embeddings. With these improvements and a multi-round conversation dataset, OpenBuddy provides a powerful model that can answer questions and perform translation tasks between various languages. OpenBuddy's mission is to provide a free, open and offline AI model that can run on users' devices regardless of their language or cultural background. Currently, a demo version of OpenBuddy-13B can be found on the Discord server. Its key features include multilingual conversational AI (including Chinese, English, Japanese, Korean, French, etc.), enhanced vocabulary and support for common CJK characters, and two model versions: 7B and 13B github-OpenBuddy
Panda: Overseas Chinese open source large language model Based on Llama-7B, -13B, -33B, -65B, continuous pre-training in the Chinese domain, using nearly 15M data, and evaluating the reasoning ability on the Chinese benchmark github - PandaLM
Dromedary: An open source self-aligned language model that can be trained with minimal human supervision github-Dromedary
LaMini-LM is a collection of small and efficient language models for distillation A collection of small, efficient language models distilled from ChatGPT, trained on a large dataset of 2.58M instructions github
LLaMA-Adapter V2 LLaMA-Adapter V2 from Shanghai Artificial Intelligence Laboratory, with only 14M parameters injected, can be trained in 1 hour. The comparison results are really amazing, and it has multimodal functions (interpretation and question-answering of images) github
HuggingChat Hugging Face launched the first open source alternative to ChatGPT: HuggingChat. Based on the Open Assistant model, it supports Chinese conversations and code writing, but does not support Chinese replies. The app is now online and can be accessed by opening it without a proxy. link
Open-Chinese-LLaMA Based on LLaMA-7B, a Chinese large language model base generated by incremental pre-training of Chinese datasets github
OpenLLaMA An open-source reproduction of the LLaMA model, trained on the RedPajama dataset, using the same preprocessing steps and hyperparameters, model structure, context length, training steps, learning rate schedule, and optimizer as LLaMA. PyTorch and Jax weights for OpenLLaMA are available on Huggingface Hub. OpenLLaMA shows similar performance to LLaMA and GPT-J in various tasks, and performs better in some tasks. github
replit-code-v1-3b Released under BY-SA 4.0 license, which means commercial use is allowed link
MOSS MOSS is an open source conversational language model that supports Chinese and English and multiple plug-ins. The moss-moon series model has 16 billion parameters and can run on a single A100/A800 or two 3090 graphics cards at FP16 precision, and on a single 3090 graphics card at INT4/8 precision. The MOSS base language model is pre-trained on about 700 billion Chinese, English and code words, and is subsequently fine-tuned through conversational instructions, plug-in enhanced learning and human preference training to enable multi-round conversations and the ability to use multiple plug-ins. github
RedPajama 1.2 Trillion Tokens Dataset link
chinese_llama_alpaca_lora extraction framework github
Scaling Transformer to 1M tokens and beyond with RMT The paper proposes a new technology called RMT, which may expand the upper limit of Transform's tokens to 1 million or even more. github
Open Assistant Contains a large number of AI-generated and manually annotated corpora and a variety of models based on LLaMA and Pythia. The released dataset includes more than 161K high-quality, human assistant-type interactive dialogue corpora in up to 35 languages data model
ChatGLM Efficient Tuning Efficient ChatGLM fine-tuning based on PEFT github
Dolly Introduction news
Baize: An open source chat model for efficient parameter tuning of self-chat data Baize is an open source chat model that can conduct multi-turn conversations. It was created by generating a high-quality multi-turn chat corpus using ChatGPT self-conversation and enhancing LLaMA (an open source large language model) with efficient parameter tuning. The Baize model shows good multi-turn conversation performance with minimal potential risks. It can run on a single GPU, making it accessible to a wider range of researchers. The Baize model and data are for research purposes only. Paper address Source code address
GPTrillion--No open source code found GPTrillion, a large model containing 1.5 trillion (1.5T) parameters, is now open source, claiming to be the world's largest open source LLM google_doc
Cerebras-GPT-13B (commercially available) hugging_face
Chinese-ChatLLaMA Chinese ChatLLaMA dialogue model; pre-training/command fine-tuning dataset, built on TencentPretrain multimodal pre-training framework, supports simplified and traditional Chinese, English, Japanese and other languages github
Lit-LLaMA A fully open source independent LLaMA implementation based on the Apache 2.0 license, built on nanoGPT, aims to address the limitations of the original LLaMA code under the GPL license to enable wider academic and commercial applications github
MosaicML MPT-7B-StoryWriter, 65K tokens, you can throw the entire "The Great Gatsby" into it at once. huggingface
Langchain Large Language Models (LLMs) are becoming a transformative technology, enabling developers to build applications that were previously impossible. However, using these standalone LLMs alone is often not enough to create a truly powerful application - the real power comes from being able to combine them with other computational or knowledge sources. github
Guidance Bootstrapping enables more efficient control of modern language models than traditional prompting or chaining, and is more efficient. Bootstrapping allows you to interleave generation, prompting, and logic control into a single continuous stream, matching the way language models actually process text. Simple output structures like "Chain of Thought" and its many variants (e.g. ART, Auto-CoT, etc.) have been shown to improve the performance of language models. The advent of more powerful language models (like GPT-4) has made richer structures possible, and bootstrapping makes it easier and more economical to build such structures. github
WizardLM Gives large pre-trained language models the ability to follow complex instructions, using the WizardLM-7B model trained with the full set of evolutionary instructions (about 300k) github

LLM training_inference_low resources_efficient training

Resource Name Description Link
QLoRA--Guanaco An efficient fine-tuning method that can fine-tune a model with 65B parameters on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance and back-propagating gradients through a frozen, 4-bit quantized pre-trained language model to a low-rank adapter (LoRA) via QLoRA github
Chinese-Guanaco A Chinese low-resource quantitative training/deployment solution github
DeepSpeed Chat: One-click RLHF training github
LLMTune: Fine-tuning a large 65B+ LLM on a consumer GPU 4-bit fine-tuning can be performed on common consumer-grade GPUs, such as the largest 65B LLAMA model. LLMTune also implements the LoRA algorithm and the GPTQ algorithm to compress and quantize LLM, and process large models through data parallelism. In addition, LLMTune provides a command line interface and Python library for use github
Fine-tuning based on ChatGLM-6B+LoRA on the instruction dataset Based on deepspeed, it supports multi-card fine-tuning, which is 8-9 times faster than single card. For detailed settings, see Fine-tuning 3. Lora fine-tuning based on DeepSpeed github
Microsoft releases DeepSpeed Chat, a RLHF training tool github
LlamaChat: A chatbot based on LLaMa on Mac github
ChatGPT/GPT4 open source "alternatives" github
Practical tips and tricks for training large machine learning models Helps you train large models (>1B parameters), avoid instabilities, and save failed experiments without restarting from scratch link
Instruction Tuning with GPT-4 paper
xturing A Python package for fine-tuning LLM models efficiently, quickly, and easily. It supports multiple models such as LLaMA, GPT-J, GPT-2, etc. It can be trained using single GPU and multi-GPU. It uses efficient fine-tuning techniques such as LoRA to reduce hardware costs by up to 90% and complete model training in a short time. github
GPT4All An open source project that allows running GPT locally on Macbook. Built on the LLaMa-7B large language model, including data, code and demo are all open source, and the conversation style is more like an AI assistant github
Fine-tuning ChatGPT-like models with Alpaca-LoRA link
LMFlow A scalable, convenient and efficient toolbox for fine-tuning large machine learning models github
Wenda: Large language model calling platform Currently supports chatGLM-6B, chatRWKV, chatYuan and chatPDF under chatGLM-6B model (self-built knowledge base search)' github
Micro Agent Small autonomous agent open source project, powered by LLM (OpenAI GPT-4), can write software for you, just set a "purpose" and let it work on its own github
Llama-X Open source academic research project, through the joint efforts of the community, gradually improve the performance of LLaMA to the level of SOTA LLM, save duplication of work, and jointly create more and faster increments github
Chinese-LLaMA-Alpaca Chinese LLaMA & Alpaca LLMs - Open-source Chinese LLaMA model pre-trained with Chinese text data; open-source Chinese Alpaca model further fine-tuned with instructions; quickly deploy and experience the quantized version of the model locally using a laptop (personal PC) github
Efficient Alpaca An open source project based on LLaMA implementation, aiming to improve the performance of Stanford Alpaca by fine-tuning the LLaMA-7B model to consume less resources, be faster inference speed, and be more suitable for researchers github
ChatGLM-6B-Slim ChatGLM-6B with 20K image tokens removed, same performance, but smaller video memory usage github
Chinese-Vicuna A Chinese low-resource llama+lora solution github
Alpaca-LoRA Reproducing Stanford Alpaca's results on consumer hardware using LoRA github
LLM Accelerator LLM Accelerator is here to make basic large models smarter! Basic large models are playing an increasingly important role in many applications. Most large language models are trained in an autoregressive manner. Although the quality of text generated by the autoregressive model is guaranteed, it leads to high inference costs and long delays. Due to the huge number of parameters and high inference costs of large models, how to reduce costs and delays in the process of large-scale deployment of large models is a key issue. To address this issue, researchers at Microsoft Research Asia proposed a method called LLM Accelerator that uses reference text to losslessly accelerate the inference of large language models, which can achieve two to three times the acceleration in typical application scenarios of large models. blog
Large Language Model (LLM) Fine-tuning Technical Notes github
PyLLMs A concise Python library for connecting to various LLMs (OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with built-in model performance benchmarks. Very suitable for rapid prototyping and evaluation of different models, with the following features: Connect to top LLMs with a small amount of code; Response metadata including processed tokens, costs and latencies, standardize each model; Support multiple models: get completions from different models at the same time; LLM benchmarks: evaluate the quality, speed and cost of models github
Accelerating Large Language Models with Mixed Precision By using low-precision floating-point operations, training and inference speed can be increased by up to 3 times without affecting model accuracy blog
New LLM training method Federate Duke University and Microsoft jointly released a new LLM training method, Federated GPT. This training method distributes the original centralized training method to different edge devices. After the training is completed, it is uploaded to the center to merge the sub-models. github

Tips Engineering

Resource Name Description Link
OpenBuprompt-engineering-note Prompt Engineering Notes (Course Summary) introduces the ChatGPT Prompt Engineering Learning Notes course for developers, which provides the working principles of language models and prompt engineering practices, and shows how to apply the language model API to applications for various tasks. The course includes content such as summarizing, inferring, transforming, expanding, and building chatbots, and tells how to design good prompts and build custom chatbots. github - OpenBuprompt
Tip Engineering Guide link
AIGC Prompt Engineering Learning Station Learn Prompt ChatGPT/Midjourney/Runway link
Prompts Featured - ChatGPT User Guide ChatGPT usage guide to improve the playability and usability of ChatGPT github
An unofficial list of resources for using ChatGPT. Aims to aggregate resources such as apps, web apps, browser extensions, CLI tools, bots, integrations, packages, articles, etc. that use ChatGPT github
Snack Prompt: ChatGPT Prompt prompt sharing community link
ChatGPT Questioning Tips How to ask ChatGPT questions to get high-quality answers: A complete guide to tips and tricks engineering github
rompt-Engineering-Guide-Chinese - rompt-Engineering-Guide Derived from the English version, but with the AIGC prompt added github
OpenPrompt An open shared prompt community, everyone recommends useful prompts github
GPT-Prompts Teach you how to generate prompts with GPT github

ChatGPT-like document Q&A

Resource Name Description Link
privateGPT The private deployment document question-and-answer platform based on GPT4All-J does not require an Internet connection and can 100% guarantee that the user's privacy is not leaked. It provides an API that allows users to use their own documents for interactive question-and-answer and text generation. In addition, the platform supports custom training data and model parameters to meet personalized needs. github-privateGPT
Auto-evaluator Automatic evaluation of document question answering; github
PDF GP An open source PDF document chat solution based on GPT, which mainly implements the following functions: one-on-one conversation with PDF documents; automatically segment content and use a powerful deep average network encoder to generate embeddings; perform semantic search on PDF content and pass the most relevant embeddings to Open AI; customize logic to generate more accurate response information, faster than OpenAI. github
Redis-LLM-Document-Chat Interacting with PDF Documents with LlamaIndex, Redis, and OpenAI, contains a Jupyter notebook that demonstrates how to use Redis as a vector database to store and retrieve document vectors. It also shows how to use LlamaIndex to perform semantic search in documents and how to leverage OpenAI to provide a chatbot-like experience. github
doc-chatbot A document chatbot implemented by GPT-4 + Pinecone + LangChain + MongoDB, which can chat with multiple files, multiple topics and multiple windows, and the chat history is saved by MongoDB github
document.ai A universal local knowledge base solution based on vector database and GPT3.5 github
DocsGPT DocsGPT is a cutting-edge open source solution that simplifies the process of finding information in project documentation. By integrating a powerful GPT model, developers can easily ask questions about a project and get accurate answers. github
ChatGPT Retrieval Plugin The ChatGPT retrieval plugin repository provides a flexible solution for semantic search and retrieval of personal or organizational documents using natural language queries. github
LamaIndex lamaIndex (GPT index) is the data frame for your LLM application. github
chatWeb ChatWeb can crawl any web page or PDF, DOCX, TXT file and extract the text, generate an embedded summary, and answer your questions based on the text content. It is based on the chatAPI and embeddingAPI of gpt3.5, as well as the vector database implementation. github

ChatGPT-like industry applications

Resource Name Description Link
Sentiment analysis of news reports Using ChatGPT to perform sentiment analysis on news reports of listed companies, a 500% return was generated in the stock market (trading options) within 15 months (tested on historical data) - The potential of ChatGPT in predicting stock market returns using sentiment analysis of news headlines was explored. It was found that ChatGPT's sentiment analysis capabilities exceeded traditional methods and were positively correlated with stock market returns. It was proposed that ChatGPT has great value in the field of finance and economics, and some insights and suggestions were made for future research and application paper
Programming language generation model StarCoder BigCode is a collaboration between ServiceNow Inc. and Hugging Face Inc. StarCoder has multiple versions. The core version StarCoderBase has 15.5 billion parameters, supports more than 80 programming languages, and has 8,192 token contexts. The video shows the effect of its vscode plugin. github
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages code generation paper
MedicalGPT-zh: Chinese Medical General Language Model The Chinese medical universal language model is based on the medical consensus and clinical guidelines of 28 departments to improve the model's medical field knowledge and dialogue capabilities github
MagicSlides AI self-made PPT is what many people dream of. The free version can make 3 PPTs per month and supports 2,500 words of input. link
SalesGPT Use LLM to implement a context-aware sales assistant that automates sales development rep activities, such as outbound sales calls github
HuaTuo: LLaMA fine-tuning model based on Chinese medical knowledge github
ai-code-translator Helping you translate code from one language to another is something that ChatGPT is really good at, especially GPT-4, which has a very high translation quality and can have longer tokens. github
ChatGenTitle A paper title generation model fine-tuned on the LLaMA model using information from millions of arXiv papers github
Regex.ai A WYSIWYG, AI-based regular expression automatic generation tool. Just select the data, it can help you write regular expressions and provide multiple ways to extract data. video
ChatDoctor A medical chat model based on fine-tuning LLaMA based on medical domain knowledge. The medical data includes data on about 700 diseases and about 5,000 conversation records between doctors and patients. paper
CodeGPT The key to improving programming skills is data. CodeGPT is a code dialogue dataset for GPT generated by GPT. Now 32K Chinese data are publicly available, making the model better at programming github
LaWGPT A series of open source large language models based on Chinese legal knowledge github
LangChain-ChatGLM-Webui Inspired by langchain-ChatGLM, the WebUI made with LangChain and ChatGLM-6B series models provides large model applications based on local knowledge. Currently, it supports uploading text format files such as txt, docx, md, pdf, etc., and provides model files including ChatGLM-6B series, Belle series, and Embedding models such as GanymedeNil/text2vec-large-chinese, nghuyong/ernie-3.0-base-zh, nghuyong/ernie-3.0-nano-zh. github

ChatGPT-like course materials

Resource Name Description Link
Databricks (The author of the Dolly model) has released two free courses on edX, the second of which is about how the LLM is structured. link
Large Language Model Technology Sharing Series Natural Language Processing Laboratory, Northeastern University video
How does GPT-4 work? How can we use GPT-4 to build intelligent programs? Harvard University CS50 Open Course video
Tip Engineering Best Practices: Andrew Ng Tip Engineering New Course Summary + LangChain Experience Summary medium_blog
Fine-tuning the LLM model If you are interested in fine-tuning the LLM model, be sure to follow this YouTube blogger, who has made public the fine-tuning methods for almost all LLM models on the market. YouTuber Sam Witteveen
Transformer Architecture Easy-to-understand introduction youtube1 youtube2 youtube3
Video of Transformer multi head mechanism If you want to really understand every detail of the entire Transform, including the mathematical principles behind it, you can watch this video, which is really a very detailed analysis. youtube
Introduction to Large Language Models Introduction to Large Language Model Introduced the concepts, usage scenarios, prompt adjustments, and Google's Gen AI development tools of Large Language Models (LLMs).

Safety issues of LLM

Resource Name Description Link
Research on the Security of LLM Model link
Chatbot Injections & Exploit A collection of examples of Chatbot injections and vulnerabilities to help people understand the potential vulnerabilities and vulnerabilities of Chatbots. Injections and attacks include command injection, character encoding, social engineering, emojis, Unicode, etc. The repository provides some examples, some of which include a list of emojis that can be used to attack Chatbots. github
GPTSecurity A community covering cutting-edge academic research and practical experience sharing, integrating knowledge on security applications such as Generative Pre-trained Transformer (GPT), Artificial Intelligence Generated Content (AIGC), and Large Language Model (LLM). Here you can find the latest research papers, blog posts, practical tools, and preset instructions (Prompts) on GPT/AIGC/LLM. github

Multimodal LLM

Resource Name Description Link
DeepFloyd IF The latest open source text-to-image model with high realism and language understanding capabilities, consisting of a frozen text encoder and three sequential pixel diffusion modules, is an efficient model that surpasses the current state-of-the-art models and achieves a zero-shot FID score of 6.66 on the COCO dataset. github
Multi-modal GPT Use multimodal GPT to train a chatbot that can receive visual and language instructions at the same time. Based on the OpenFlamingo multimodal model, various open data sets are used to create various visual guidance data, and visual and language guidance are jointly trained to effectively improve model performance github
AudioGPT Understanding and Generating Speech, Music, Sound, and Talking Head' by AIGC-Audio github
text2image-prompt-generator A small model trained with 250,000 Midjourney prompts based on GPT-2 can generate high-quality Midjourney prompts link data
Here are 6 free text-to-image services other than Midjourney: Bing Image Creator Playground AI DreamStudio Pixlr Leonardo AI Craiyon
BARK A very powerful TTS (text-to-speech) project. The feature of this project is that it can add prompt words to the text, such as "laugh". This prompt word will become the sound of laughter and then synthesize it into the speech. It can also mix "male voice" and "female voice", so that you don't need to do the splicing operation again. github
whisper Whisper is the best and fastest library I have ever used for speech-to-text (STT, also known as ASR). I didn't expect that such a fast model could be optimized 70x. I plan to deploy this model and make it available to everyone for transcription of large speech files and translation. This model is multilingual and can automatically identify the language, which is really powerful. github
OFA-Chinese: Chinese Multimodal Unified Pre-training Model Chinese OFA model with transformers structure github
Wenshengtu Open Source Model Proving Ground Images can be generated using stable-diffusion 1.5, stable-diffusion 2.1, DALL-E, kandinsky-2 and other models based on the input text, which is convenient for testing and comparison link
LLMScore LLMScore is a new framework that provides evaluation scores with multi-granular compositionality. It uses a large language model (LLM) to evaluate text-to-image generation models. First, the image is converted into image-level and object-level visual descriptions, and then the evaluation instructions are fed into the LLM to measure the alignment of the synthesized image with the text, and finally a score and explanation are generated. Our extensive analysis shows that LLMScore has the highest correlation with human judgment on a wide range of datasets, significantly outperforming the commonly used text-image matching metrics CLIP and BLIP. paper github
VisualGLM-6B VisualGLM-6B is an open source, multimodal conversational language model that supports images, Chinese, and English. The language model is based on ChatGLM-6B and has 6.2 billion parameters. The image part builds a bridge between the visual model and the language model by training BLIP2-Qformer. The overall model has a total of 7.8 billion parameters. github

LLM Dataset

Resource Name Description Link
Ambiguous Dataset Whether it is possible to correctly eliminate ambiguity is an important indicator for measuring large language models. However, there has been no standardized measurement method. This paper proposes a dataset containing 1,645 different types of ambiguity and a corresponding evaluation method. github paper
thu instruction training data We designed a process to automatically generate diverse and high-quality multi-round command conversation data UltraChat, and carried out meticulous manual post-processing. All English data has now been open sourced, totaling more than 1.5 million records, making it one of the largest number of high-quality command data in the open source community. github
Multimodal dataset MMC4 580 million images, 100 million documents, 40 billion tokens github
EleutherAI Data 800g of text corpus is integrated for you to download for free. I don’t know the quality of the model produced by trian, but I plan to try it: pile data paper
UltraChat Large-scale, information-rich, and diverse multi-turn conversation data github
ConvFinQA Financial Data Question Answering github
The botbots dataset A dataset containing conversations from two ChatGPT instances (gpt-3.5-turbo), CLT commands and dialogue prompts from GPT-4, covering a variety of contexts and tasks, with a generation cost of about $35, which can be used for research and training smaller dialogue models (such as Alpaca) github
alpaca_chinese_dataset - A manually tuned Chinese conversation dataset github
CodeGPT-data The key to improving programming skills is data. CodeGPT is a code dialogue dataset for GPT generated by GPT. Now 32K Chinese data are publicly available, making the model better at programming github

Corpus

Resource Name Description Link
Name Corpus wainshine/Chinese-Names-Corpus
Chinese-Word-Vectors Various Chinese word vectors github repo
Chinese chat corpus The database collects Douban multi-round, PTT gossip corpus, Qingyun corpus, TV drama dialogue corpus, Tieba forum reply corpus, Weibo corpus, Xiaohuangji corpus link
Chinese rumor data In this data file, each line is a rumor data in json format. github
Chinese Question Answering Dataset Link extraction code 2dva
WeChat public account corpus 3G corpus, including some WeChat official account articles captured from the web, with HTML removed and only plain text. Each article is in JSON format, with name being the WeChat official account name, account being the WeChat official account ID, title being the title, and content being the text. github
Chinese natural language processing corpus and datasets github
Task-based dialogue English dataset 【The Most Complete Task-based Dialogue Dataset】 mainly introduces a complete set of task-based dialogue datasets, which covers the main information of all commonly used datasets in the field of task-based dialogue. In addition, in order to help researchers better grasp the context of the progress of the field, we provide the state-of-the-art experimental results on several datasets in the form of Leaderboard. github
Speech recognition corpus generation tool Creating Automatic Speech Recognition (ASR) corpora from online videos with audio/captions github
LitBankNLP dataset A corpus of 100 labeled English novels to support natural language processing and computational humanities tasks github
Chinese ULMFiT Sentiment Analysis Text Classification Corpus and Model github
Administrative division data of provinces, cities, districts and towns with pinyin annotations github
Education Industry News Automatic Summarization Corpus github
Chinese Natural Language Processing Dataset github
Wikipedia Massively Parallel Text Corpus 85 languages, 1620 language pairs, 135M contrastive sentences github
Ancient Poetry Library github repo
More complete ancient poetry library
Low memory loading of Wikipedia data Loading 17GB+ English Wikipedia corpus with the new version of nlp library only takes up 9MB of memory and the traversal speed is 2-3 Gbit/s github
Couplet data 700,000 couplets github
Color Dictionary Dataset github
42GB of JD Customer Service Dialogue Data (CSDD) github
700,000 couplet data link
Username blacklist github
Dependency parsing corpus 40,000 sentences of high-quality annotated data Homepage
People's Daily Corpus Processing Toolset github
Fake news dataset fake news corpus github
Poetry Quality Evaluation/Fine-Grained Emotional Poetry Corpus github
Open tasks related to Chinese natural language processing Datasets and current best results github
Chinese Abbreviation Dataset github
Chinese Task Benchmark Assessment Representative datasets - Benchmark (pre-trained) models - Corpus - Baseline - Toolkit - Leaderboard github
Chinese Rumor Database github
CLUEDatasetSearch Chinese and English NLP datasets Search all Chinese NLP datasets, with commonly used English NLP datasets github
Multi-document summarization dataset github
Make everyone "polite" courtesy transfer task Convert impolite sentences to polite sentences while preserving meaning, providing a dataset of 139M+ instances Paper and code
Cantonese/English Conversation Bilingual Corpus github
List of Chinese NLP datasets github
Name recognition dataset of person names, place names, and organization names github
Chinese Language Comprehension Assessment Benchmark Including representative datasets & benchmark models & corpora & rankings github
OpenCLaP multi-domain open source Chinese pre-trained language model warehouse Civil documents, criminal documents, Baidu Encyclopedia github
Chinese full word coverage BERT and two reading comprehension data DRCD dataset: released by Delta Research Institute in Taiwan, China, its format is the same as SQuAD, and it is an extractive reading comprehension dataset based on traditional Chinese.
CMRC 2018 dataset: Chinese machine reading comprehension data released by the Harbin Institute of Technology iFlytek Joint Laboratory. Based on the given question, the system needs to extract fragments from the passage as answers, in the same format as SQuAD.
github
Dakshina Dataset Latin/native script parallel dataset for twelve South Asian languages github
OPUS-100 Multilingual (100 languages) parallel corpus centered on English github
Chinese reading comprehension dataset github
Chinese Natural Language Processing Vector Collection github
Chinese Language Comprehension Assessment Benchmark Includes representative datasets, benchmark (pre-trained) models, corpora, and leaderboards github
A large list of NLP datasets/benchmark tasks github
LitBankNLP dataset A corpus of 100 labeled English novels to support natural language processing and computational humanities tasks github
700,000 couplet data github
Classical Chinese (ancient Chinese) - Modern Chinese Parallel Corpus The short chapters include short ancient books such as "The Analects of Confucius", "Mencius" and "Zuo Zhuan", which have been merged with "Zizhi Tongjian" github
COLDDateset, Chinese offensive language detection dataset Covers topics such as race, gender, and region. Data will be released after the paper is published. paper
GAOKAO-bench: Using Chinese college entrance examination questions as a dataset Using the Chinese college entrance examination questions as a data set, the evaluation framework for evaluating the language comprehension and logical reasoning ability of large language models includes 1,781 multiple-choice questions, 218 fill-in-the-blank questions, and 812 answer questions. github
Zero to NLP - Chinese NLP application data, models, training, reasoning github

Thesaurus and lexical tools

Resource Name Description Link
textfilter Chinese and English sensitive word filtering observerss/textfilter
Name extraction function Chinese (modern, ancient) names, Japanese names, Chinese surnames and given names, titles (aunt, aunt, etc.), English->Chinese names (John Lee), idiom dictionary cocoNLP
Chinese abbreviations database NPC: National People's Congress; China: People's Republic of China; Women's Tennis: Women's/n Tennis/n Match/vn github
Chinese Character Dictionary Chinese character split method (I) split method (II) split method (III) split 手诲 扌诲 才诲 kfcd/chaizi
Vocabulary sentiment value Spring water: 0.400704566541
Abundant: 0.37006739587
rainarch/SentiBridge
Chinese vocabulary, stop words, sensitive words dongxiexidian/Chinese
python-pinyin Convert Chinese characters to Pinyin mozillazg/python-pinyin
zhtools Convert between Traditional and Simplified Chinese skydark/nstools
English-like Chinese pronunciation engine say wo i ni #say: I love you tinyfool/ChineseWithEnglish
chinese_dictionary Synonyms, antonyms, negation thesaurus guotong1988/chinese_dictionary
wordninja Split and extract words from English strings without spaces wordninja
Car brands, car parts related words data
THU's vocabulary IT thesaurus, financial thesaurus, idiom thesaurus, place name thesaurus, historical celebrity thesaurus, poetry thesaurus, medical thesaurus, diet thesaurus, legal thesaurus, automobile thesaurus, animal thesaurus link
Crime legal terms and classification model Contains 856 crime knowledge graphs, crime prediction based on a 2.8 million crime training database, 13 types of question classification and legal information Q&A function based on 200,000 legal Q&A pairs github
Word segmentation corpus + code Baidu Netdisk link - Extraction code pea6
Chinese word segmentation + part-of-speech tagging based on Bi-LSTM + CRF Keras implementation link
Chinese word segmentation and part-of-speech tagging based on Universal Transformer + CRF link
Fast Neural Network Segmentation Package java version
chinese-xinhua Chinese Xinhua Dictionary database and API, including commonly used allegorical sayings, idioms, words and Chinese characters github
SpaCy Chinese Model Contains functions such as Parser, NER, syntax tree, etc. Some English packages use spacy's English model. If you want to adapt to Chinese, you may need to use spacy's Chinese model. github
Chinese character data github
Synonyms Chinese synonyms toolkit github
HarvestText Domain-adaptive text mining tools (new word discovery - sentiment analysis - entity linking, etc.) github
word2word Convenient and easy-to-use multilingual word-word pair collection 62 languages / 3,564 multilingual pairs github
Polyphonetic dictionary data and codes github
Chinese characters, words, and idioms query interface github
103976 English word library packages (sql version, csv version, Excel version) github
List of English swear words github
Word Pinyin Data github
Number name library in 186 languages github
Large-scale name database of countries around the world github
Chinese character feature extractor (featurizer) Extract the features of Chinese characters (pronunciation features, glyph features) for use as features for deep learning github
char_featurizer - Chinese character feature extraction tool github
Python interface library for the Chinese, Japanese and Korean word segmentation library mecab github
g2pC Context-based Chinese pronunciation automatic tagging module github
ssc, Sound Shape Code Sound and Shape Code - A Chinese string similarity calculation method based on "Sound and Shape Code" version 1
version 2
blog/introduction
Acquisition of multiple meanings/meanings of Chinese words and semantic disambiguation of words in specific sentences based on encyclopedic knowledge base github
Tokenizer is a fast and customizable text tokenization library github
Tokenizers The most advanced tokenizer with emphasis on performance and versatility github
Transform text by replacing synonyms github
token2index is a powerful and lightweight term indexing library compatible with PyTorch/Tensorflow github
Traditional and Simplified Chinese Conversion github
Cantonese NLP Tools github
Domain Dictionary Professional dictionary knowledge base covering 68 fields and a total of 9.16 million words github

Pre-trained language models & large models

Resource Name Description Link
BMList Big Model Big List github
Bert's paper Chinese translation link
Slides by Bert's original author link
Text Classification Practice github
bert tutorial text classification tutorial github
BERT PyTorch Implementation github
BERT PyTorch Implementation github
BERT generates sentence vectors, BERT performs text classification and text similarity calculation github
Illustration of BERT and ELMO github
BERT Pre-trained models and downstream applications github
Language/knowledge representation tools BERT & ERNIE github
Using the gpt-2 language model in Kashgari github
Facebook LAMA Probes for analyzing facts and common sense knowledge contained in pre-trained language models. Language model analysis, providing a unified access interface for Transformer-XL/BERT/ELMo/GPT pre-trained language models github
GPT2 training code in Chinese github
XLMFacebook's cross-language pre-trained language model github
Massive Chinese pre-trained ALBERT model github
Transformers 20 Supports TensorFlow 20 and PyTorch's natural language processing pre-trained language models (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) 8 architectures/33 pre-trained models/102 languages github
8 papers sort out the progress and reflections of BERT-related models github
French RoBERTa pre-trained language model French RoBERTa pre-trained language model trained with 138GB corpus link
Chinese pre-trained ELECTREA model Pretrain Chinese Model based on adversarial learning github
albert-chinese-ner Using the pre-trained language model ALBERT for Chinese NER github
A collection of open source pre-trained language models github
Chinese ELECTRA pre-trained model github
Predicting the next word with Transformers (BERT, XLNet, Bart, Electra, Roberta, XLM-Roberta) (model comparison) github
TensorFlow Hub New language models for 40+ languages (including Chinese) link
UER A repository of Chinese pre-trained models based on different corpora, encoders, and target tasks (including BERT, GPT, ELMO, etc.) github
A collection of open source pre-trained language models github
Multilingual sentence vector pack github
Language Model as a Service (LMaaS) Language Model as a Service github
Open source language model GPT-NeoX-20B With 20 billion parameters, it is the largest publicly accessible pre-trained general autoregressive language model. github
Chinese Scientific Literature Dataset (CSL) Contains meta information (title, abstract, keywords, discipline, category) of 396,209 Chinese core journal papers. The CSL dataset can be used as a pre-training corpus, and can also be used to construct many NLP tasks, such as text summarization (title prediction), keyword generation, and text classification. github
Large model development tool github

Extraction

Resource Name Description Link
Time extraction It has been integrated into the python package cocoNLP . Welcome to try it. java version
Python version
Neural Network Relation Extraction PyTorch Chinese is not supported yet github
Named Entity Recognition PyTorch based on BERT Chinese is not supported yet github
Keyphrase extraction package pke github
BLINK is the most advanced entity link library github
Named Entity Recognition with BERT/CRF github
LatticeLSTM Chinese named entity recognition supporting batch parallelism github
Building a model for medical entity recognition Contains dictionary and corpus annotation, based on Python github
Pipeline entity and relation extraction based on TensorFlow and BERT - Entity and Relation Extraction Based on TensorFlow and BERT Pipeline entity and relation extraction based on TensorFlow and BERT, 2019 Language and Intelligent Technology Competition Information Extraction Task Solution. Schema based Knowledge Extraction, SKE 2019 github
Chinese named entity recognition NeuroNER vs BertNER github
Chinese named entity recognition based on BERT github
Chinese Key Phrase Extraction Tool github
bert Tensorflow version for Chinese named entity recognition github
bert-Kashgari Kashgari, a keras-based encapsulated classification and annotation framework, can build a classification or sequence annotation model in a few minutes github
cocoNLP Extraction of information such as name, address, email address, mobile phone number, mobile phone location, etc., rake phrase extraction algorithm. github
Microsoft Multilingual Number/Unit/Date/Time Recognition Pack github
Baidu's open source benchmark information extraction system github
Chinese address segmentation (address element identification and extraction), NER through sequence labeling github
Open domain text knowledge triple extraction and knowledge base construction based on dependency syntax github
Chinese keyword extraction method based on pre-training model github
chinese_keyphrase_extractor (CKPE) A tool for Chinese keyphrase extraction A tool for quickly extracting and identifying key phrases from natural language text github
A simple resume parser to extract key information from resumes github
BERT-NER-Pytorch BERT Chinese NER experiments in three different modes github

Knowledge Graph

Resource Name Description Link
Tsinghua University XLORE Chinese and English cross-language encyclopedia knowledge graph Baidu, Chinese Wiki, English Wiki link
Automatically generate document graph github
Question answering system based on medical knowledge graph github
This repo refers to github
Chinese Character Relationship Knowledge Graph Project github
AmpliGraph Knowledge Graph Representation Learning (Python) Library Knowledge Graph Concept Link Prediction github
Chinese knowledge graph materials, data and tools github
Chinese knowledge graph based on Baidu Encyclopedia Extract triple information and build Chinese knowledge graph github
Zincbase Knowledge Graph Construction Toolkit github
Question answering system based on knowledge graph github
Knowledge graph deep learning related materials collation github
Southeast University "Knowledge Graph" Postgraduate Course (Materials) github
Knowledge Graph Car Audio Project github
One Piece Knowledge Graph github
A dataset of 132 knowledge graphs Covers common sense, cities, finance, agriculture, geography, meteorology, social networking, Internet of Things, medical care, entertainment, life, business, travel, science and education link
Large-scale, structured, bilingual COVID-19 knowledge graph (COKG-19) link
Event triple extraction based on dependency syntax and semantic role labeling github
Abstract Knowledge Graph The current scale is 500,000, supporting abstraction of noun entities, state descriptions, and event actions. github
Large-scale Chinese knowledge graph data with 1.4 billion entities github
Jiagu Natural Language Processing Tool Based on models such as BiLSTM, it provides functions such as knowledge graph, relationship extraction, Chinese word segmentation, part-of-speech tagging, named entity recognition, sentiment analysis, new word discovery, keyword text summarization, text clustering, etc. github
medical_NER - Named Entity Recognition in Chinese Medical Knowledge Graph github
A large list of learning materials/datasets/tool resources related to knowledge graphs github
LibKGE: A knowledge graph embedding library for reproducible research github
Military knowledge graph question answering project based on mongodb storage The military weapons knowledge base includes 8 major categories such as aircraft and space equipment, more than 100 subcategories, and a total of 5,800 items. This project does not use a graph database for storage. It uses Jieba to parse questions and identify question entities. It completes queries for multiple types of questions based on query templates. It mainly provides an industrial question-answering idea demo. github
JD Product Knowledge Graph github
Chinese relation extraction based on distant supervision github
Intelligent question answering system based on medical knowledge graph github
BLINK is the most advanced entity link library github
A small securities knowledge graph/knowledge base github
dstlr unstructured text scalable knowledge graph construction platform github
Baidu Encyclopedia Character Entry Attribute Extraction Knowledge graphing with BERT-based fine-tuning and feature extraction github
COVID-19 related data Chinese medical dialogue dataset of COVID-19 and other types of pneumonia; open data sources from Tsinghua University and other institutions (COVID-19) github
github
DGL-KE graph embedding representation learning algorithm github
Cause and effect diagram method data
Causal event pairing based on multi-domain text datasets link

Text Generation

Resource Name Description Link
Texar Toolkit for Text Generation and Beyond github
Professor Ehud Reiter's blog linkHighly recommended by Professor Wan Xiaojun of Peking University, this blog provides an in-depth discussion and reflection on NLG technology, evaluation and application.
A large list of resources related to text generation github
Open Domain Dialogue Generation and Its Practice in Microsoft XiaoIce Natural language generation allows machines to master the ability to automatically create link
Text generation control github
A large list of resources related to natural language generation github
Evaluating Natural Language Generation with BLEURT link
Automatic couplet data and robots Codelink
700,000 couplet data
Automatically generate comments Generate comments based on Hacker News article titles using the Transformer encoder-decoder model github
Natural language generation SQL statements (English) github
Natural Language Generation Resources github
Chinese Generation Task Benchmark Evaluation github
Specific topic text generation/text augmentation based on GPT2 github
Encoding, marking and implementing a controllable and efficient text generation method github
TextFooler: Adversarial text generation module for text classification/reasoning github
SimBERT The BERT model is based on the UniLM concept and integrates retrieval and generation. github
New word generation and sentence making Non-existent words are generated from scratch using GPT-2 variants along with their definitions and examples github
Automatically generate multiple-choice questions from text github
Synthetic Data Generation Benchmark github

Text Summarization

Resource Name Description Link
Chinese text summarization/keyword extraction github
Automatic resume summarization based on named entity recognition github
Text automatic summarization library TextTeaser English only github
Extractive summarization based on the latest language models such as BERT github
A Comprehensive Guide to Text Summarization with Deep Learning in Python link
(Colab) Abstract Text Summarization Implementation Collection (Tutorial) github

Smart Question and Answer

Resource Name Description Link
Chinese chatbot Train the chatbot you want based on your own corpus, which can be used in scenarios such as intelligent customer service, online Q&A, and intelligent chat. github
Interesting fun robot qingyun qingyun trained Chinese chatbot github
Open conversational robots, knowledge graphs, semantic understanding, natural language processing tools and data github
QA robot Amodel-for-Retrivalchatbot - Customer service robot, Chinese Retreival chatbot (Chinese retrieval robot) git
ConvLab open source multi-domain end-to-end dialogue system platform github
A dialogue system built on the latest version of rasa github
A chatbot based on the finance-judicial field (also with the nature of small talk) github
End-to-end closed domain dialogue system github
MiningZhiDaoQACorpus 5.8 million Baidu Zhidao Q&A data mining project, Baidu Zhidao Q&A corpus, including more than 5.8 million questions, each with a question label. Based on this Q&A corpus, it can support a variety of applications, such as logic mining github
GPT2 model for Chinese small talk GPT2-chitchat github
Select a list of relevant resources (Leaderboards, Datasets, Papers) based on multiple rounds of responses from the retrieval chatbot github
Microsoft Conversational Bot Framework github
chatbot-list Industry-wide sharing and introduction of intelligent customer service, chatbot applications, architecture, and algorithms github
Chinese medical dialogue data Chinese medical dialogue data set github
A large-scale medical conversation dataset Contains 1.1 million medical consultations and 4 million doctor-patient conversations github
CrossWOZ: A large-scale cross-domain Chinese task-oriented multi-turn dialogue dataset and model paper & data
Open source conversational information search platform github
DSTC9 2020 github
Paraphrase of T5 questions trained with Quora question pairs (Paraphrase) github
Google releases Taskmaster-2 natural language task dialogue dataset github
Haystack is a flexible, powerful and scalable question answering (QA) framework github
End-to-end closed domain dialogue system github
Amazon releases knowledge-based human-human open-domain conversation dataset github
Albert Large QA model trained based on Baidu webqa and dureader datasets github
CommonsenseQA: Common sense English QA challenge link
MedQuAD (English) medical question answering dataset github
A question-answering engine based on Albert and Electra, using Wikipedia text as context github
A question-answering attempt based on a 140,000 song knowledge base Features include lyrics chain, finding songs with known lyrics, and questions and answers about the triangle relationship between songs, singers, and lyrics. github

Text Correction

Resource Name Description Link
Chinese text error correction module code github
English spelling check library github
Python spell checking library github
GitHub Typo Corpus Large-scale GitHub multi-language spelling error/grammar error dataset github
BertPunc is a state-of-the-art punctuation repair model based on BERT github
Chinese Writing Proofreading Tools github
Text Correction Reference List Chinese Spell Checking (CSC) and Grammatical Error Correction (GEC) github
The champion solution of the Text Intelligent Proofreading Competition Already implemented, from Suzhou University and DAMO Academy team link

Multimodality

Resource Name Description Link
Chinese multimodal dataset "Wukong" Huawei Noah's Ark Lab opens a large-scale open-source database containing 100 million image and text pairs github
Chinese-CLIP: A pre-trained model for Chinese text and image representation Chinese version of CLIP pre-trained model, open source multiple model scales, a few lines of code to handle Chinese image and text representation extraction & image and text retrieval github

Speech Processing

Resource Name Description Link
ASR speech dataset + Chinese speech recognition system based on deep learning github
Tsinghua University THCHS30 Chinese speech dataset data_thchs30tgz-OpenSLR domestic mirror
data_thchs30tgz
test-noisetgz - OpenSLR domestic mirror test-noisetgz
resourcetgz-OpenSLR domestic mirror
resourcetgz
Free ST Chinese Mandarin Corpus
Free ST Chinese Mandarin Corpus
AIShell-1 open source dataset-OpenSLR domestic mirror
AIShell-1 open source dataset
Primewords Chinese Corpus Set 1-OpenSLR domestic mirror
Primewords Chinese Corpus Set 1
Laughter Detector github
New version of Common Voice speech recognition dataset Includes over 1,400 hours of speech samples from 42,000 contributors, including link
speech-aligner A tool for generating phoneme-level time-aligned annotations from "human voice" and its "language text" github
ASR Phonetic Dictionary/Dictionary github
Speech Sentiment Analysis github
masr Chinese speech recognition, providing pre-trained models and high recognition rate github
Chinese Text Normalization for Speech Recognition github
Speech quality evaluation indicators (MOSNet, BSSEval, STOI, PESQ, SRMR) github
Chinese/English pronunciation dictionary for speech recognition github
CoVoST Facebook released a multilingual speech-to-text translation corpus Includes audio, text transcription and English translation in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese) github
Parakeet Text-to-speech Synthesis based on PaddlePaddle github
(Java) Accurate Speech Natural Language Detection Library github
CoVoST Facebook released a multilingual speech-to-text translation corpus github
Text-to-speech synthesis implemented in TensorFlow 2 github
Python audio feature extraction package github
ViSQOL is an objective and complete reference index for audio quality perception, with two modes: audio and voice. github
zhrtvc Easy-to-use Chinese voice cloning and Chinese speech synthesis system github
aukit A useful speech processing toolbox, including speech noise reduction, audio format conversion, feature spectrum generation and other modules github
phkit A useful phoneme processing toolbox, including Chinese phonemes, English phonemes, text-to-pinyin, text regularization and other modules github
zhvoice Chinese speech corpus, with clearer and more natural speech, including 8 open source data sets, 3,200 speakers, 900 hours of speech, and 13 million words github
Audio for speech behavior detection , binarization, speaker recognition, automatic speech recognition, emotion recognition and other tasks github
Deep Learning Emotional Text-to-Speech Synthesis github
Python Audio Data Augmentation Library github
Audio Enhancement Based on Large-Scale Audio Dataset github
Voice transfer github

Document Processing

Resource Name Description Link
LayoutLM-v3 document understanding model github
PyLaia is a deep learning toolkit for handwritten document analysis github
Single document unsupervised keyword extraction github
DocSearch Free Document Search Engine github
fdfgen Able to automatically create PDF documents and fill in information link
pdfx Automatically extract cited references and download the corresponding pdf files link
invoice2data Invoice pdf information extraction invoice2data
PDF document information extraction github
PDFMiner PDFMiner can get the exact location of text in the page, as well as other information such as fonts or lines. It also has a PDF converter that can convert PDF files into other text formats (such as HTML). There is also an extensible parser PDF that can be used for other purposes besides text analysis. link
PyPDF2 PyPDF 2 is a python PDF library that can split, merge, crop, and convert the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs, and can also merge entire files together. link
PyPDF2 PyPDF 2 is a python PDF library that can split, merge, crop, and convert the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs, and can also merge entire files together. link
ReportLab ReportLab is a fast way to create PDF documents. A time-proven, super easy-to-use open source project for creating complex, data-driven PDF documents and custom vector graphics. It's free, open source, and written in Python. The package is downloaded more than 50,000 times a month, is part of the standard Linux distribution, embedded in many products, and was chosen to power Wikipedia's print/export functionality. link
SIMPdfA simple PDF file text editor written in Python github
pdf-diff PDF file diff tool can display the differences between two PDF documents github

Form Processing

Resource Name Description Link
Use unet to automatically detect and rebuild document tables github
pdftabextract Used for table information analysis after OCR recognition, very powerful link
tabula-py Directly convert the table information in PDF to pandas dataframe, there are two versions of code: Java and Python
Camelot PDF table analysis link
pdfplumber PDF table analysis
PubLayNet Able to divide paragraphs, recognize tables and pictures link
Extracting tabular data from papers github
Finding answers in tables with BERT github
Series of articles on Form Q&A Introduction
Model
Final Chapter
Generate tabular data using GAN (English only) github
carefree-learn(PyTorch) Tabular Dataset Automated Machine Learning (AutoML) Package github
Closed field fine-tuning table detection github
PDF table data extraction tool github
TaBERT: A new model for understanding queries on tabular data paper
Form Processing Awesome-Table-Recognition github

Text Matching

Resource Name Description Link
Sentence, QA Similarity Matching MatchZoo A collection of text similarity matching algorithms, including multiple deep learning methods, which are worth trying. github
Chinese Question Sentence Similarity Calculation Competition and Solution Summary github
Similarity calculation toolkit Written in java, it is used for similarity calculations related to words, phrases, sentences, lexical analysis, sentiment analysis, semantic analysis, etc. github
Chinese word similarity calculation method It combines the word similarity calculation methods of the extended version of Synonymous Cilin and Hownet, with wider vocabulary coverage and more accurate results. gihtub
Python string similarity algorithm library github
Similar sentence judgment model based on Siamese bilstm model, providing training data set and test data set Provided 100,000 training samples github

Text Data Augmentation

Resource Name Description Link
Chinese NLP Data Enhancement (EDA) Tool github
English NLP Data Enhancement Tools github
One-click Chinese data enhancement tool github
The application and effect of data enhancement in machine translation and other NLP tasks link
NLP Data Augmentation Resource Set github

Common regular expressions

Resource Name Description Link
Regular expression for extracting email It has been integrated into the python package cocoNLP . Welcome to try it.
Extract phone_number It has been integrated into the python package cocoNLP . Welcome to try it.
Regular expression for extracting ID number IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0 -9]|3[01])\d{3}[0-9xX])
IDs = re.findall(IDCards_pattern, text, flags=0)
IP address regular expression (25[0-5]| 2[0-4]\d| [0-1]\d{2}| [1-9]?\d).(25[0-5]| 2[0- 4]\d| [0-1]\d{2}| [1-9]?\d).(25[0-5]| 2[0-4]\d| [0-1]\d {2}| [1-9]?\d).(25[0-5]| 2[0-4]\d| [0-1]\d{2}| [1-9]?\d )
Tencent QQ number regular expression [1-9]([0-9]{5,11})
Domestic landline number regular expression [0-9-()()]{7,18}
Username regular expression [A-Za-z0-9_-\u4e00-\u9fa5]+
Domestic phone number regular expression matching (three major operators + virtual, etc.) github
Regular Expression Tutorial github

Text Retrieval

Resource Name Description Link
Efficient fuzzy search tool github
A large list/search engine of BERT models for various languages/tasks link
Deepmatch is a deep matching model library for recommendation, advertising and search github
wwsearch is a full-text search engine developed by WeChat for Enterprise github
aili - the fastest in-memory index in the East github
Efficient string matching tool RapidFuzz a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy github

Reading comprehension

Resource Name Description Link
Efficient fuzzy search tool github
A large list/search engine of BERT models for various languages/tasks link
Deepmatch is a deep matching model library for recommendation, advertising and search github
AllenNLP reading comprehension supports a variety of data and models github

Sentiment Analysis

Resource Name Description Link
Aspect Sentiment Analysis Package github
awesome-nlp-sentiment-analysis Sentiment analysis, emotion cause identification, evaluation object and evaluation word extraction github
Sentiment analysis technology enables intelligent customer service to better understand human emotions github

Event Extraction

Resource Name Description Link
Chinese event extraction github
List of literature resources on NLP event extraction github
BERT Event Extraction (ACE 2005 corpus) implemented in PyTorch github
News event clue extraction github

Machine Translation

Resource Name Description Link
Wudao Dictionary The command line version of Youdao Dictionary, supporting English-Chinese and online search github
NLLB NLLB language model that supports translation between 200+ languages link
Easy-Translate Script for translating large text files locally, based on Facebook/Meta AI's M2M100 model and NLLB200 model, supporting 200+ languages github

Digital transformation

Resource Name Description Link
The best Chinese character number (Chinese numerals) - Arabic numerals conversion tool github
Quickly convert "Chinese numbers" and "Arabic numbers" github
Parse natural language numeric strings into integers and floating point numbers github

Reference resolution

Resource Name Description Link
Chinese reference resolution data github
baidu ink code a0qq

Text Clustering

Resource Name Description Link
TextCluster Short text cluster preprocessing module github

Text Classification

Resource Name Description Link
NeuralNLP-NeuralClassifier Tencent open source deep learning text classification tool github

Knowledge Reasoning

Resource Name Description Link
GraphbrainAI is an open source software library and research tool that aims to facilitate automatic meaning extraction and text understanding as well as knowledge exploration and inference. github
(Harvard) Free book on causal reasoning pdf

Explainable Natural Language Processing

Resource Name Description Link
A library of state-of-the-art interpreters for textual machine learning models github

Text Attack

Resource Name Description Link
TextAttack: A framework for adversarial attacks on natural language processing models github
OpenBackdoor: Text backdoor attack and defense toolkit OpenBackdoor is developed based on Python and PyTorch, which can be used to reproduce, evaluate and develop algorithms related to text backdoor attack and defense github

Text Visualization

Resource Name Description Link
Scattertext text visualization (python) github
interactive visualization of whatlies word vectors spacy tools
PySS3 SS3 text classifier machine visualization tool for explainable AI github
Rendering 3D images with Notepad github
attnvis Visualization of attention interactions of transformer language models such as GPT2 and BERT github
Texthero text data efficient processing package Including preprocessing, keyword extraction, named entity recognition, vector space analysis, text visualization, etc. github

Text Annotation Tools

Resource Name Description Link
A review of NLP annotation platforms github
brat rapid annotation tool sequence annotation tool link
Poplar web version natural language annotation tool github
LIDA lightweight interactive dialogue annotation tool github
doccano is an open source collaborative multilingual text annotation tool based on the web github
Datasaurai online data annotation workflow management tool link

Language detection

Resource Name Description Link
langid 97 languages detected https://github.com/saffsd/langid.py
langdetect Language Detection https://code.google.com/archive/p/language-detection/

Comprehensive Tools

Resource Name Description Link
jieba jieba
hanlp hanlp
nlp4han Chinese natural language processing toolset (sentence segmentation/word segmentation/part-of-speech tagging/chunking/syntactic analysis/semantic analysis/NER/N-grammar/HMM/pronoun resolution/sentiment analysis/spelling check github
Progress in Hate Speech Detection link
Bert application based on Pytorch Including named entity recognition, sentiment analysis, text classification, and text similarity github
nlp4han Chinese Natural Language Processing Toolset Sentence segmentation/word segmentation/part-of-speech tagging/chunking/syntactic analysis/semantic analysis/NER/N-grammar/HMM/pronoun resolution/sentiment analysis/spelling check github
Some basic models about natural language github
Template code for sequence labeling and text classification using BERT github
jieba_fast accelerated version of jieba github
StanfordNLP Pure Python version of natural language processing package link
Python Spoken Natural Language Processing Toolkit (English) github
PreNLP natural language preprocessing library github
Some papers and codes related to nlp Including topic model, word embedding, named entity recognition (NER), text classification, text generation, text similarity calculation, etc., involving various NLP-related algorithms, based on keras and tensorflow github
Python Text Mining/NLP Practical Examples github
Forte is a flexible and powerful natural language processing pipeline toolkit github
stanza Stanford team NLP tool Can handle more than 60 languages github
Fancy-NLP is a text knowledge mining tool for building product portraits github
A comprehensive and easy-to-use Chinese NLP toolkit github
The industry often uses DSSM-based vectorized recall pipeline to reproduce github
Texthero text data efficient processing package Including preprocessing, keyword extraction, named entity recognition, vector space analysis, text visualization, etc. github
NLPGNN graph neural network natural language processing toolbox github
Macadam A natural language processing toolkit based on Tensorflow (Keras) and bert4keras, focusing on text classification, sequence labeling and relation extraction github
LineFlow is an efficient NLP data loader for all deep learning frameworks github
Arabica: Python text data exploratory analysis toolkit github
Python stress testing tool: SMSBoom github

Funny tools

Resource Name Description Link
Wang Feng Lyrics Generator phunterlau/wangfeng-rnn
Analysis of Girlfriend's Emotional Fluctuations github
NLP is too difficult series github
Variable naming artifact github link
Image text removal, can be used for comic translation github
CoupletAI - Couplet Generation Automatic couplet system based on CNN+Bi-LSTM+Attention github
Solving complex mathematical equations using neural network symbolic reasoning github
Question-answering robot based on 140,000 song knowledge base Features include lyrics chain, finding songs with known lyrics, and questions and answers about the triangle relationship between songs, singers, and lyrics. github
COPE - Metrical Poetry Editing Program github
Paper2GUI An AI desktop APP toolbox for ordinary people, which can be used immediately after installation. It supports 18+ AI models, covering speech synthesis, video frame interpolation, video super-resolution, object detection, image stylization, OCR recognition and other fields. github
Politeness Estimator (Trained using Sina Weibo data) github paper
Getting Started with Python Chinese programming language homepage gitee

Course report, interview, etc.

Resource Name Description Link
Natural Language Processing Report link
Knowledge Graph Report link
Data mining report link
Autonomous Driving Report link
Machine Translation Report link
Blockchain Report link
Robot Report link
Computer Graphics Report link
3D Printing Report link
Face Recognition Report link
Artificial Intelligence Chip Report link
CS224N Deep Learning Natural Language Processing Course linkPyTorch implementation of the model in the courselink
A hands-on tutorial on natural language processing for deep learning researchers github
"Natural Language Processing" by Jacob Eisenstein github
ML-NLP Knowledge points and code implementations commonly tested in machine learning and NLP interviews github
NLP task example project code set github
Review of NLP highlights in 2019 download
nlp-recipes Microsoft produced - Natural Language Processing Best Practices and Examples github
A hands-on tutorial on natural language processing for deep learning researchers github
Transfer Learning in Natural Language Processing (NLP) youtube
Machine Learning Systems Book link github

Contest

Resource Name Description Link
NLPer-Arsenal NLP competition, including current competition information, past competition plans, etc., continuously updated github
Review the top solutions of all NLP competitions github
Baidu's 2019 Triple Extraction Competition, "Science Space Team" source code (7th place) github

Financial Natural Language Processing

Resource Name Description Link
BDCI2019 Financial Negative Information Determination github
Open source financial investment data extraction tool github
A large list of natural language processing research resources in the financial field github
A chatbot based on the finance-judicial field (also for small talk) github
Demonstration of the process of constructing a small financial knowledge graph github

Medical Natural Language Processing

Resource Name Description Link
Chinese Medical NLP Public Resources github
spaCy Medical Text Mining and Information Extraction github
Building a model for medical entity recognition Contains dictionary and corpus annotation, based on Python github
Question answering system based on medical knowledge graph githubThis repo refers to github
Chinese medical dialogue data Chinese medical dialogue data set github
A large-scale medical conversation dataset Contains 1.1 million medical consultations and 4 million doctor-patient conversations github
COVID-19 related data Chinese medical dialogue dataset of COVID-19 and other types of pneumonia; open data sources from Tsinghua University and other institutions (COVID-19) github
github

Legal Natural Language Processing

Resource Name Description Link
Blackstone’s spaCy pipeline and NLP models for unstructured legal text github
Legal Intelligence Literature Resource List github
A chatbot based on the finance-judicial field (also with the nature of small talk) github
Crime legal terms and classification model Contains 856 crime knowledge graphs, crime prediction based on a 2.8 million crime training database, 13 types of question classification and legal information Q&A function based on 200,000 legal Q&A pairs github
A large list of legal NLP related resources github

Text to Image

Resource Name Description Link
Dalle-mini A mini version of DALL·E that generates images based on text prompts github

other

Resource Name Description Link
phone China Mobile Location Query ls0f/phone
phone International mobile phone and phone location query AfterShip/phone
ngender Determine gender based on name observerss/ngender
An overview of the differences between Chinese and English natural language processing (NLP) link
Technical documents PDF or PPT shared by experts in major companies github
comparxiv is a command for comparing the differences between two submitted versions on arXiv pypi
Meta-architecture of CHAMELEON deep learning news recommendation system github
Automatic resume screening system github
Multiple text readability evaluation indicators implemented in Python github