display: inline-block;
color: #999;
NLP: Paradise for migrant workers
color: #999;
NLP: Paradise for migrant workers
In the process of getting started and becoming familiar with NLP, I used a lot of packages on GitHub, so I sorted them out and shared them here.
Many of the bags are very interesting and worth collecting to satisfy your collecting addiction! If you find them useful, please share and star:star:, thank you!
Updates will be made irregularly over a long period of time. Welcome to watch and fork! ❤️❤️❤️
- ChatGPT-like model evaluation comparison
* ChatGPT-like information
* Open source framework similar to ChatGPT
* LLM training_inference_low resources_efficient training
* Tips Engineering
* ChatGPT-like document Q&A
* ChatGPT-like industry applications
* ChatGPT-like course materials
* LLM security issues
* Multimodal LLM
* LLM dataset
🍆 🍒 🍐 🍊 | 🌻 🍓 🍈 🍅 🍍 |
---|
- Corpus
* Thesaurus and lexical tools
* Pre-trained language model
* Extract
* Knowledge Graph
* Text generation
* Text Abstract
* Smart Question and Answer
* Text correction | * Documentation
* Forms processing
* Text matching
* Text data enhancement
* Text search
* Reading comprehension
* Sentiment Analysis
* Common regular expressions
* Voice Processing - Common regular expressions
* Event extraction
* Machine translation
* Digital Conversion
* Referential Dissolution
* Text Clustering
* Text classification
* Knowledge Reasoning
* Explainable NLP
* Text adversarial attack | * Text Visualization
* Text annotation tools
* Comprehensive tools
* Funny and funny tools
* Course report interview etc.
* Contest
* Financial NLP
* Medical NLP
* Legal NLP
* Text to image
* other
Resource Name | Description | Link |
---|---|---|
ChatALL: can chat with multiple AI robots at the same time (including products from Tsinghua University and iFlytek) | A tool that can talk to multiple AI chatbots at the same time (such as ChatGPT, Bing Chat, Bard, Alpaca, Vincuna, Claude, ChatGLM, MOSS, iFlytek Spark, ERNIE, etc.). It can send prompts to different AI robots in parallel to help users find the best answer | github-ChatALL |
Chatbot Arena | Benchmarking LLM with Elo rating in real-world scenarios - Introduced Chatbot Arena, a benchmark platform for large language models (LLMs), which uses an anonymous, randomized adversarial evaluation method based on the Elo rating system widely used in competitive games such as chess. Elo ratings for 9 popular open source LLM models were released and a leaderboard was launched. The platform uses the FastChat multi-model serving system to provide an interactive interface in multiple languages, and the data comes from user voting. Summarized the advantages of Chatbot Arena and plans to provide better sampling algorithms, rankings, and serving systems | Ends May 3, 2023 |
ChatGPT-like model evaluation summary | Large language models (LLMs) have received widespread attention. These powerful models can understand complex information and provide human-like responses to a variety of questions. Among them, GPT-3 and GPT-4 performed best, and Flan-t5 and Lit-LLaMA also performed well. However, please note that commercial use of models may require payment and data sharing | blog |
A review of Large Language Models (LLMs) | blog | |
Latest Research on Large Model Evaluation | Long text modeling has always been one of ChaGPT's amazing capabilities. We use [paragraph translation] as an experimental scenario to conduct a comprehensive and fine-grained test of the large model's paragraph modeling capabilities. | paper |
Chinese large model evaluation tools & rankings | C-Eval is a comprehensive Chinese assessment suite for base models. It contains 13,948 multiple-choice questions covering 52 different subjects and four difficulty levels, as shown below. Please visit our website or consult our paper for more details. | github paper |
OpenCompass Large Model Review | OpenCompass is an open-source, efficient, and comprehensive large-model evaluation system and open platform developed by Shanghai Artificial Intelligence Laboratory. It provides a complete, open-source, and reproducible evaluation framework, and supports one-stop evaluation of large language models, multimodal models, and other models. Using distributed technology, even models with hundreds of billions of parameters can be evaluated within a few hours. Based on multiple highly recognized data sets in different dimensions, it provides a variety of evaluation methods, including zero-sample evaluation, small-sample evaluation, and thought chain evaluation, to fully quantify the capabilities of each dimension of the model. | github website |
Resource Name | Description | Link |
---|---|---|
Open LLMs: Open Large Language Models (LLMs) for commercial use | A list of open LLMs available for commercial use | github |
LLM Zoo: A marketplace for data, models, and benchmarks for large language models | LLM Zoo: democratizing ChatGPT - a project that provides data, models, and evaluation benchmark for large language models | github |
Large Language Model (LLM) Data Collection | List of related papers, including research work on guidance, reasoning, decision making, continuous improvement, and self-improvement | LLM information collection |
DecryptPrompt | Summary Prompt & LLM papers, open source data & models, AIGC applications | github |
SmartGPT | Designed to provide large language models (especially GPT-3.5 and GPT-4) with the ability to complete complex tasks by breaking them down into smaller problems and using the Internet and other external sources to collect information. Features include modular design, easy configuration, and high support for plug-ins. SmartGPT operates based on the concept of "Autos", including two types, "Runner" and "Assistant", both equipped with LLM agents that handle planning, reasoning, and task execution. In addition, SmartGPT also has a memory management system, as well as a plug-in system that can define various commands | github-SmartGPT |
OpenGPT | A framework for creating instruction-based datasets and training large language models (LLMs) of experts in the conversational domain. It has been successfully applied to train the health care conversational model NHS-LLM, using data from the UK National Health Service (NHS) website to generate a large number of question-answer pairs and unique conversations. | github-OpenGPT |
PaLM 2 Technical Report | Google has recently released PaLM 2, a new language model with better multilingual and reasoning capabilities while being more computationally efficient than its predecessor, PaLM. PaLM 2 combines a number of research advances, including computationally optimal model and data scale, more diverse and multilingual datasets, and more effective model architectures and objective functions. PaLM 2 achieves state-of-the-art performance on a variety of tasks and capabilities, including language proficiency tests, classification and question answering, reasoning, programming, translation, and natural language generation. PaLM 2 also demonstrates strong multilingual capabilities, able to handle hundreds of languages, and translate and interpret between different languages. PaLM 2 also considers issues of responsible use, including controlling toxicity during reasoning, reducing memoization, and assessing potential harm and bias. | PaLM 2 Technical Report |
DB-GPT | An open source experimental project based on vicuna-13b and FastChat, it uses langchain and llama-index technologies for contextual learning and question-answering. The project is fully locally deployed to ensure data privacy and security, and can directly connect to private databases to process private data. Its functions include SQL generation, SQL diagnosis, database knowledge question-answering, etc. | github-DB-GPT |
A large list of Transformers related literature resources | Contains a variety of Transformer models, such as BERT, GPT, Transformer-XL, etc. These models have been widely used in many natural language processing tasks. In addition, the list also provides relevant papers and code links for these models, providing a good reference resource for researchers and developers in the field of natural language processing. | github |
The Ultimate Guide to GPT-4 | A guide on how to use GPT3 and GPT4, including more than 100 resources to help learn how to use it to improve your life efficiency. Including how to learn the basics of ChatGPT, how to learn advanced knowledge of ChatGPT, how to use GPT-3 in language learning, how to use GPT-3 in teaching, how to use GPT-4, etc. It also provides how to upgrade to the ChatGPT+ plan to use GPT-4 and how to use GPT-4 for free. At the same time, it also provides a guide on how to use ChatGPT in business, productivity, benefits, money, etc. | link |
Efficient fine-tuning of LLM parameters based on LoRA | link | |
Complex Reasoning: The North Star Capability of Large Language Models | In the GPT-4 release blog, the authors wrote: "In a casual conversation, the difference between GPT-3.5 and GPT-4 may be subtle. When the complexity of the task reaches a sufficient threshold, the difference will become apparent." This means that complex tasks are likely to be the key differentiating factor between large and small language models. In this article, we will carefully analyze and discuss how to make large language models have powerful complex reasoning capabilities. | blog |
Is the emergent power of large language models a mirage? | The emergence of large language models has always been regarded as a magical phenomenon, as if it were a miracle caused by great effort, but this paper argues that this may just be an illusion. | paper |
Probabilistic Summary of Large Language Models | Very detailed explanation and summary of LLM science | paper |
A brief history of the LLaMA model | LLaMA is a language model released by Meta, which uses the Transformer architecture and has multiple versions with a maximum of 65B parameters. Similar to GPT, it can be used for further fine-tuning and is suitable for a variety of tasks. Unlike GPT, LLaMA is open source and can be run locally. Existing LLaMA models include: Alpaca, Vicuna, Koala, GPT4-x-Alpaca, and WizardLM. Each model has different training data and performance. | blog |
Complex Reasoning with Large Language Models | This paper discusses how to train language models with powerful and complex reasoning capabilities, and explores how to effectively prompt the model to fully unleash its potential. In view of the similarities between language model and programming training, a three-stage training is proposed: continuous training, supervised fine-tuning, and reinforcement learning. A set of tasks for evaluating the reasoning capabilities of large language models is introduced. It also discusses how to perform prompt engineering to enable the model to achieve better learning results by providing various learning opportunities, ultimately achieving intelligence. | link |
Large language model evolution tree | paper | |
Li Hongyi: How poor people can replicate their own ChatGPT with low resources | blog | |
Essential resources for training ChatGPT: A complete guide to corpus, models, and code libraries | Resource link paper address | |
GitHub treasure library, which organizes various open source projects related to GPT | github | |
ChatGPT Chinese Guide | gitlab | |
The application, advantages, limitations and future development direction of ChatGPT in natural language processing are discussed. | Ethical considerations and engineering tips when using this technology are highlighted. | paper |
List of literature resources related to large language models | github | |
Literature Review on Large Language Models (Chinese Version) | github | |
A large list of ChatGPT related resources | github | |
Pre-Training to Learn in Context | paper | |
Langchain Architecture Diagram | image | |
Numbers every LLM developer should know | github | |
How to build powerful complex reasoning capabilities using large language models | blog | |
LLMs Nine-story Demon Tower | Share practical experience and experience in fighting monsters (ChatGLM, Chinese-LLaMA-Alpaca, MiniGPT-4, FastChat, LLaMA, gpt4all, etc.) | github |
Resource Name | Description | Link |
---|---|---|
LLM-As-Chatbot | This project makes all the LLMs available on the market into Chatbots, which can be run directly on Google Colab without having to build them yourself. It is very suitable for friends who want to experience LLM. I just tried it and it is really super simple. Some LLMs require more video memory, so it is best to have a Colab Pro subscription. | github |
OpenBuddy | A powerful open source multilingual chatbot model, targeting global users, with a focus on conversational AI and fluent multilingual support, including English, Chinese and other languages. Based on Facebook's LLAMA model, it has been fine-tuned, including expanding the vocabulary, adding common characters, and enhancing token embeddings. With these improvements and a multi-round conversation dataset, OpenBuddy provides a powerful model that can answer questions and perform translation tasks between various languages. OpenBuddy's mission is to provide a free, open and offline AI model that can run on users' devices regardless of their language or cultural background. Currently, a demo version of OpenBuddy-13B can be found on the Discord server. Its key features include multilingual conversational AI (including Chinese, English, Japanese, Korean, French, etc.), enhanced vocabulary and support for common CJK characters, and two model versions: 7B and 13B | github-OpenBuddy |
Panda: Overseas Chinese open source large language model | Based on Llama-7B, -13B, -33B, -65B, continuous pre-training in the Chinese domain, using nearly 15M data, and evaluating the reasoning ability on the Chinese benchmark | github - PandaLM |
Dromedary: An open source self-aligned language model that can be trained with minimal human supervision | github-Dromedary | |
LaMini-LM is a collection of small and efficient language models for distillation | A collection of small, efficient language models distilled from ChatGPT, trained on a large dataset of 2.58M instructions | github |
LLaMA-Adapter V2 | LLaMA-Adapter V2 from Shanghai Artificial Intelligence Laboratory, with only 14M parameters injected, can be trained in 1 hour. The comparison results are really amazing, and it has multimodal functions (interpretation and question-answering of images) | github |
HuggingChat | Hugging Face launched the first open source alternative to ChatGPT: HuggingChat. Based on the Open Assistant model, it supports Chinese conversations and code writing, but does not support Chinese replies. The app is now online and can be accessed by opening it without a proxy. | link |
Open-Chinese-LLaMA | Based on LLaMA-7B, a Chinese large language model base generated by incremental pre-training of Chinese datasets | github |
OpenLLaMA | An open-source reproduction of the LLaMA model, trained on the RedPajama dataset, using the same preprocessing steps and hyperparameters, model structure, context length, training steps, learning rate schedule, and optimizer as LLaMA. PyTorch and Jax weights for OpenLLaMA are available on Huggingface Hub. OpenLLaMA shows similar performance to LLaMA and GPT-J in various tasks, and performs better in some tasks. | github |
replit-code-v1-3b | Released under BY-SA 4.0 license, which means commercial use is allowed | link |
MOSS | MOSS is an open source conversational language model that supports Chinese and English and multiple plug-ins. The moss-moon series model has 16 billion parameters and can run on a single A100/A800 or two 3090 graphics cards at FP16 precision, and on a single 3090 graphics card at INT4/8 precision. The MOSS base language model is pre-trained on about 700 billion Chinese, English and code words, and is subsequently fine-tuned through conversational instructions, plug-in enhanced learning and human preference training to enable multi-round conversations and the ability to use multiple plug-ins. | github |
RedPajama | 1.2 Trillion Tokens Dataset | link |
chinese_llama_alpaca_lora extraction framework | github | |
Scaling Transformer to 1M tokens and beyond with RMT | The paper proposes a new technology called RMT, which may expand the upper limit of Transform's tokens to 1 million or even more. | github |
Open Assistant | Contains a large number of AI-generated and manually annotated corpora and a variety of models based on LLaMA and Pythia. The released dataset includes more than 161K high-quality, human assistant-type interactive dialogue corpora in up to 35 languages | data model |
ChatGLM Efficient Tuning | Efficient ChatGLM fine-tuning based on PEFT | github |
Dolly Introduction | news | |
Baize: An open source chat model for efficient parameter tuning of self-chat data | Baize is an open source chat model that can conduct multi-turn conversations. It was created by generating a high-quality multi-turn chat corpus using ChatGPT self-conversation and enhancing LLaMA (an open source large language model) with efficient parameter tuning. The Baize model shows good multi-turn conversation performance with minimal potential risks. It can run on a single GPU, making it accessible to a wider range of researchers. The Baize model and data are for research purposes only. | Paper address Source code address |
GPTrillion--No open source code found | GPTrillion, a large model containing 1.5 trillion (1.5T) parameters, is now open source, claiming to be the world's largest open source LLM | google_doc |
Cerebras-GPT-13B (commercially available) | hugging_face | |
Chinese-ChatLLaMA | Chinese ChatLLaMA dialogue model; pre-training/command fine-tuning dataset, built on TencentPretrain multimodal pre-training framework, supports simplified and traditional Chinese, English, Japanese and other languages | github |
Lit-LLaMA | A fully open source independent LLaMA implementation based on the Apache 2.0 license, built on nanoGPT, aims to address the limitations of the original LLaMA code under the GPL license to enable wider academic and commercial applications | github |
MosaicML | MPT-7B-StoryWriter, 65K tokens, you can throw the entire "The Great Gatsby" into it at once. | huggingface |
Langchain | Large Language Models (LLMs) are becoming a transformative technology, enabling developers to build applications that were previously impossible. However, using these standalone LLMs alone is often not enough to create a truly powerful application - the real power comes from being able to combine them with other computational or knowledge sources. | github |
Guidance | Bootstrapping enables more efficient control of modern language models than traditional prompting or chaining, and is more efficient. Bootstrapping allows you to interleave generation, prompting, and logic control into a single continuous stream, matching the way language models actually process text. Simple output structures like "Chain of Thought" and its many variants (e.g. ART, Auto-CoT, etc.) have been shown to improve the performance of language models. The advent of more powerful language models (like GPT-4) has made richer structures possible, and bootstrapping makes it easier and more economical to build such structures. | github |
WizardLM | Gives large pre-trained language models the ability to follow complex instructions, using the WizardLM-7B model trained with the full set of evolutionary instructions (about 300k) | github |
Resource Name | Description | Link |
---|---|---|
QLoRA--Guanaco | An efficient fine-tuning method that can fine-tune a model with 65B parameters on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance and back-propagating gradients through a frozen, 4-bit quantized pre-trained language model to a low-rank adapter (LoRA) via QLoRA | github |
Chinese-Guanaco | A Chinese low-resource quantitative training/deployment solution | github |
DeepSpeed Chat: One-click RLHF training | github | |
LLMTune: Fine-tuning a large 65B+ LLM on a consumer GPU | 4-bit fine-tuning can be performed on common consumer-grade GPUs, such as the largest 65B LLAMA model. LLMTune also implements the LoRA algorithm and the GPTQ algorithm to compress and quantize LLM, and process large models through data parallelism. In addition, LLMTune provides a command line interface and Python library for use | github |
Fine-tuning based on ChatGLM-6B+LoRA on the instruction dataset | Based on deepspeed, it supports multi-card fine-tuning, which is 8-9 times faster than single card. For detailed settings, see Fine-tuning 3. Lora fine-tuning based on DeepSpeed | github |
Microsoft releases DeepSpeed Chat, a RLHF training tool | github | |
LlamaChat: A chatbot based on LLaMa on Mac | github | |
ChatGPT/GPT4 open source "alternatives" | github | |
Practical tips and tricks for training large machine learning models | Helps you train large models (>1B parameters), avoid instabilities, and save failed experiments without restarting from scratch | link |
Instruction Tuning with GPT-4 | paper | |
xturing | A Python package for fine-tuning LLM models efficiently, quickly, and easily. It supports multiple models such as LLaMA, GPT-J, GPT-2, etc. It can be trained using single GPU and multi-GPU. It uses efficient fine-tuning techniques such as LoRA to reduce hardware costs by up to 90% and complete model training in a short time. | github |
GPT4All | An open source project that allows running GPT locally on Macbook. Built on the LLaMa-7B large language model, including data, code and demo are all open source, and the conversation style is more like an AI assistant | github |
Fine-tuning ChatGPT-like models with Alpaca-LoRA | link | |
LMFlow | A scalable, convenient and efficient toolbox for fine-tuning large machine learning models | github |
Wenda: Large language model calling platform | Currently supports chatGLM-6B, chatRWKV, chatYuan and chatPDF under chatGLM-6B model (self-built knowledge base search)' | github |
Micro Agent | Small autonomous agent open source project, powered by LLM (OpenAI GPT-4), can write software for you, just set a "purpose" and let it work on its own | github |
Llama-X | Open source academic research project, through the joint efforts of the community, gradually improve the performance of LLaMA to the level of SOTA LLM, save duplication of work, and jointly create more and faster increments | github |
Chinese-LLaMA-Alpaca | Chinese LLaMA & Alpaca LLMs - Open-source Chinese LLaMA model pre-trained with Chinese text data; open-source Chinese Alpaca model further fine-tuned with instructions; quickly deploy and experience the quantized version of the model locally using a laptop (personal PC) | github |
Efficient Alpaca | An open source project based on LLaMA implementation, aiming to improve the performance of Stanford Alpaca by fine-tuning the LLaMA-7B model to consume less resources, be faster inference speed, and be more suitable for researchers | github |
ChatGLM-6B-Slim | ChatGLM-6B with 20K image tokens removed, same performance, but smaller video memory usage | github |
Chinese-Vicuna | A Chinese low-resource llama+lora solution | github |
Alpaca-LoRA | Reproducing Stanford Alpaca's results on consumer hardware using LoRA | github |
LLM Accelerator | LLM Accelerator is here to make basic large models smarter! Basic large models are playing an increasingly important role in many applications. Most large language models are trained in an autoregressive manner. Although the quality of text generated by the autoregressive model is guaranteed, it leads to high inference costs and long delays. Due to the huge number of parameters and high inference costs of large models, how to reduce costs and delays in the process of large-scale deployment of large models is a key issue. To address this issue, researchers at Microsoft Research Asia proposed a method called LLM Accelerator that uses reference text to losslessly accelerate the inference of large language models, which can achieve two to three times the acceleration in typical application scenarios of large models. | blog |
Large Language Model (LLM) Fine-tuning Technical Notes | github | |
PyLLMs | A concise Python library for connecting to various LLMs (OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, HuggingfaceHub), with built-in model performance benchmarks. Very suitable for rapid prototyping and evaluation of different models, with the following features: Connect to top LLMs with a small amount of code; Response metadata including processed tokens, costs and latencies, standardize each model; Support multiple models: get completions from different models at the same time; LLM benchmarks: evaluate the quality, speed and cost of models | github |
Accelerating Large Language Models with Mixed Precision | By using low-precision floating-point operations, training and inference speed can be increased by up to 3 times without affecting model accuracy | blog |
New LLM training method Federate | Duke University and Microsoft jointly released a new LLM training method, Federated GPT. This training method distributes the original centralized training method to different edge devices. After the training is completed, it is uploaded to the center to merge the sub-models. | github |
Resource Name | Description | Link |
---|---|---|
OpenBuprompt-engineering-note | Prompt Engineering Notes (Course Summary) introduces the ChatGPT Prompt Engineering Learning Notes course for developers, which provides the working principles of language models and prompt engineering practices, and shows how to apply the language model API to applications for various tasks. The course includes content such as summarizing, inferring, transforming, expanding, and building chatbots, and tells how to design good prompts and build custom chatbots. | github - OpenBuprompt |
Tip Engineering Guide | link | |
AIGC Prompt Engineering Learning Station Learn Prompt | ChatGPT/Midjourney/Runway | link |
Prompts Featured - ChatGPT User Guide | ChatGPT usage guide to improve the playability and usability of ChatGPT | github |
An unofficial list of resources for using ChatGPT. | Aims to aggregate resources such as apps, web apps, browser extensions, CLI tools, bots, integrations, packages, articles, etc. that use ChatGPT | github |
Snack Prompt: ChatGPT Prompt prompt sharing community | link | |
ChatGPT Questioning Tips | How to ask ChatGPT questions to get high-quality answers: A complete guide to tips and tricks engineering | github |
rompt-Engineering-Guide-Chinese - rompt-Engineering-Guide | Derived from the English version, but with the AIGC prompt added | github |
OpenPrompt | An open shared prompt community, everyone recommends useful prompts | github |
GPT-Prompts | Teach you how to generate prompts with GPT | github |
Resource Name | Description | Link |
---|---|---|
privateGPT | The private deployment document question-and-answer platform based on GPT4All-J does not require an Internet connection and can 100% guarantee that the user's privacy is not leaked. It provides an API that allows users to use their own documents for interactive question-and-answer and text generation. In addition, the platform supports custom training data and model parameters to meet personalized needs. | github-privateGPT |
Auto-evaluator | Automatic evaluation of document question answering; | github |
PDF GP | An open source PDF document chat solution based on GPT, which mainly implements the following functions: one-on-one conversation with PDF documents; automatically segment content and use a powerful deep average network encoder to generate embeddings; perform semantic search on PDF content and pass the most relevant embeddings to Open AI; customize logic to generate more accurate response information, faster than OpenAI. | github |
Redis-LLM-Document-Chat | Interacting with PDF Documents with LlamaIndex, Redis, and OpenAI, contains a Jupyter notebook that demonstrates how to use Redis as a vector database to store and retrieve document vectors. It also shows how to use LlamaIndex to perform semantic search in documents and how to leverage OpenAI to provide a chatbot-like experience. | github |
doc-chatbot | A document chatbot implemented by GPT-4 + Pinecone + LangChain + MongoDB, which can chat with multiple files, multiple topics and multiple windows, and the chat history is saved by MongoDB | github |
document.ai | A universal local knowledge base solution based on vector database and GPT3.5 | github |
DocsGPT | DocsGPT is a cutting-edge open source solution that simplifies the process of finding information in project documentation. By integrating a powerful GPT model, developers can easily ask questions about a project and get accurate answers. | github |
ChatGPT Retrieval Plugin | The ChatGPT retrieval plugin repository provides a flexible solution for semantic search and retrieval of personal or organizational documents using natural language queries. | github |
LamaIndex | lamaIndex (GPT index) is the data frame for your LLM application. | github |
chatWeb | ChatWeb can crawl any web page or PDF, DOCX, TXT file and extract the text, generate an embedded summary, and answer your questions based on the text content. It is based on the chatAPI and embeddingAPI of gpt3.5, as well as the vector database implementation. | github |
Resource Name | Description | Link |
---|---|---|
Sentiment analysis of news reports | Using ChatGPT to perform sentiment analysis on news reports of listed companies, a 500% return was generated in the stock market (trading options) within 15 months (tested on historical data) - The potential of ChatGPT in predicting stock market returns using sentiment analysis of news headlines was explored. It was found that ChatGPT's sentiment analysis capabilities exceeded traditional methods and were positively correlated with stock market returns. It was proposed that ChatGPT has great value in the field of finance and economics, and some insights and suggestions were made for future research and application | paper |
Programming language generation model StarCoder | BigCode is a collaboration between ServiceNow Inc. and Hugging Face Inc. StarCoder has multiple versions. The core version StarCoderBase has 15.5 billion parameters, supports more than 80 programming languages, and has 8,192 token contexts. The video shows the effect of its vscode plugin. | github |
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages | code generation | paper |
MedicalGPT-zh: Chinese Medical General Language Model | The Chinese medical universal language model is based on the medical consensus and clinical guidelines of 28 departments to improve the model's medical field knowledge and dialogue capabilities | github |
MagicSlides | AI self-made PPT is what many people dream of. The free version can make 3 PPTs per month and supports 2,500 words of input. | link |
SalesGPT | Use LLM to implement a context-aware sales assistant that automates sales development rep activities, such as outbound sales calls | github |
HuaTuo: LLaMA fine-tuning model based on Chinese medical knowledge | github | |
ai-code-translator | Helping you translate code from one language to another is something that ChatGPT is really good at, especially GPT-4, which has a very high translation quality and can have longer tokens. | github |
ChatGenTitle | A paper title generation model fine-tuned on the LLaMA model using information from millions of arXiv papers | github |
Regex.ai | A WYSIWYG, AI-based regular expression automatic generation tool. Just select the data, it can help you write regular expressions and provide multiple ways to extract data. | video |
ChatDoctor | A medical chat model based on fine-tuning LLaMA based on medical domain knowledge. The medical data includes data on about 700 diseases and about 5,000 conversation records between doctors and patients. | paper |
CodeGPT | The key to improving programming skills is data. CodeGPT is a code dialogue dataset for GPT generated by GPT. Now 32K Chinese data are publicly available, making the model better at programming | github |
LaWGPT | A series of open source large language models based on Chinese legal knowledge | github |
LangChain-ChatGLM-Webui | Inspired by langchain-ChatGLM, the WebUI made with LangChain and ChatGLM-6B series models provides large model applications based on local knowledge. Currently, it supports uploading text format files such as txt, docx, md, pdf, etc., and provides model files including ChatGLM-6B series, Belle series, and Embedding models such as GanymedeNil/text2vec-large-chinese, nghuyong/ernie-3.0-base-zh, nghuyong/ernie-3.0-nano-zh. | github |
Resource Name | Description | Link |
---|---|---|
Databricks | (The author of the Dolly model) has released two free courses on edX, the second of which is about how the LLM is structured. | link |
Large Language Model Technology Sharing Series | Natural Language Processing Laboratory, Northeastern University | video |
How does GPT-4 work? How can we use GPT-4 to build intelligent programs? | Harvard University CS50 Open Course | video |
Tip Engineering Best Practices: Andrew Ng Tip Engineering New Course Summary + LangChain Experience Summary | medium_blog | |
Fine-tuning the LLM model | If you are interested in fine-tuning the LLM model, be sure to follow this YouTube blogger, who has made public the fine-tuning methods for almost all LLM models on the market. | YouTuber Sam Witteveen |
Transformer Architecture | Easy-to-understand introduction | youtube1 youtube2 youtube3 |
Video of Transformer multi head mechanism | If you want to really understand every detail of the entire Transform, including the mathematical principles behind it, you can watch this video, which is really a very detailed analysis. | youtube |
Introduction to Large Language Models | Introduction to Large Language Model | Introduced the concepts, usage scenarios, prompt adjustments, and Google's Gen AI development tools of Large Language Models (LLMs). |
Resource Name | Description | Link |
---|---|---|
Research on the Security of LLM Model | link | |
Chatbot Injections & Exploit | A collection of examples of Chatbot injections and vulnerabilities to help people understand the potential vulnerabilities and vulnerabilities of Chatbots. Injections and attacks include command injection, character encoding, social engineering, emojis, Unicode, etc. The repository provides some examples, some of which include a list of emojis that can be used to attack Chatbots. | github |
GPTSecurity | A community covering cutting-edge academic research and practical experience sharing, integrating knowledge on security applications such as Generative Pre-trained Transformer (GPT), Artificial Intelligence Generated Content (AIGC), and Large Language Model (LLM). Here you can find the latest research papers, blog posts, practical tools, and preset instructions (Prompts) on GPT/AIGC/LLM. | github |
Resource Name | Description | Link |
---|---|---|
DeepFloyd IF | The latest open source text-to-image model with high realism and language understanding capabilities, consisting of a frozen text encoder and three sequential pixel diffusion modules, is an efficient model that surpasses the current state-of-the-art models and achieves a zero-shot FID score of 6.66 on the COCO dataset. | github |
Multi-modal GPT | Use multimodal GPT to train a chatbot that can receive visual and language instructions at the same time. Based on the OpenFlamingo multimodal model, various open data sets are used to create various visual guidance data, and visual and language guidance are jointly trained to effectively improve model performance | github |
AudioGPT | Understanding and Generating Speech, Music, Sound, and Talking Head' by AIGC-Audio | github |
text2image-prompt-generator | A small model trained with 250,000 Midjourney prompts based on GPT-2 can generate high-quality Midjourney prompts | link data |
Here are 6 free text-to-image services other than Midjourney: | Bing Image Creator Playground AI DreamStudio Pixlr Leonardo AI Craiyon | |
BARK | A very powerful TTS (text-to-speech) project. The feature of this project is that it can add prompt words to the text, such as "laugh". This prompt word will become the sound of laughter and then synthesize it into the speech. It can also mix "male voice" and "female voice", so that you don't need to do the splicing operation again. | github |
whisper | Whisper is the best and fastest library I have ever used for speech-to-text (STT, also known as ASR). I didn't expect that such a fast model could be optimized 70x. I plan to deploy this model and make it available to everyone for transcription of large speech files and translation. This model is multilingual and can automatically identify the language, which is really powerful. | github |
OFA-Chinese: Chinese Multimodal Unified Pre-training Model | Chinese OFA model with transformers structure | github |
Wenshengtu Open Source Model Proving Ground | Images can be generated using stable-diffusion 1.5, stable-diffusion 2.1, DALL-E, kandinsky-2 and other models based on the input text, which is convenient for testing and comparison | link |
LLMScore | LLMScore is a new framework that provides evaluation scores with multi-granular compositionality. It uses a large language model (LLM) to evaluate text-to-image generation models. First, the image is converted into image-level and object-level visual descriptions, and then the evaluation instructions are fed into the LLM to measure the alignment of the synthesized image with the text, and finally a score and explanation are generated. Our extensive analysis shows that LLMScore has the highest correlation with human judgment on a wide range of datasets, significantly outperforming the commonly used text-image matching metrics CLIP and BLIP. | paper github |
VisualGLM-6B | VisualGLM-6B is an open source, multimodal conversational language model that supports images, Chinese, and English. The language model is based on ChatGLM-6B and has 6.2 billion parameters. The image part builds a bridge between the visual model and the language model by training BLIP2-Qformer. The overall model has a total of 7.8 billion parameters. | github |
Resource Name | Description | Link |
---|---|---|
Ambiguous Dataset | Whether it is possible to correctly eliminate ambiguity is an important indicator for measuring large language models. However, there has been no standardized measurement method. This paper proposes a dataset containing 1,645 different types of ambiguity and a corresponding evaluation method. | github paper |
thu instruction training data | We designed a process to automatically generate diverse and high-quality multi-round command conversation data UltraChat, and carried out meticulous manual post-processing. All English data has now been open sourced, totaling more than 1.5 million records, making it one of the largest number of high-quality command data in the open source community. | github |
Multimodal dataset MMC4 | 580 million images, 100 million documents, 40 billion tokens | github |
EleutherAI Data | 800g of text corpus is integrated for you to download for free. I don’t know the quality of the model produced by trian, but I plan to try it: | pile data paper |
UltraChat | Large-scale, information-rich, and diverse multi-turn conversation data | github |
ConvFinQA Financial Data Question Answering | github | |
The botbots dataset | A dataset containing conversations from two ChatGPT instances (gpt-3.5-turbo), CLT commands and dialogue prompts from GPT-4, covering a variety of contexts and tasks, with a generation cost of about $35, which can be used for research and training smaller dialogue models (such as Alpaca) | github |
alpaca_chinese_dataset - A manually tuned Chinese conversation dataset | github | |
CodeGPT-data | The key to improving programming skills is data. CodeGPT is a code dialogue dataset for GPT generated by GPT. Now 32K Chinese data are publicly available, making the model better at programming | github |
Resource Name | Description | Link |
---|---|---|
Name Corpus | wainshine/Chinese-Names-Corpus | |
Chinese-Word-Vectors | Various Chinese word vectors | github repo |
Chinese chat corpus | The database collects Douban multi-round, PTT gossip corpus, Qingyun corpus, TV drama dialogue corpus, Tieba forum reply corpus, Weibo corpus, Xiaohuangji corpus | link |
Chinese rumor data | In this data file, each line is a rumor data in json format. | github |
Chinese Question Answering Dataset | Link extraction code 2dva | |
WeChat public account corpus | 3G corpus, including some WeChat official account articles captured from the web, with HTML removed and only plain text. Each article is in JSON format, with name being the WeChat official account name, account being the WeChat official account ID, title being the title, and content being the text. | github |
Chinese natural language processing corpus and datasets | github | |
Task-based dialogue English dataset | 【The Most Complete Task-based Dialogue Dataset】 mainly introduces a complete set of task-based dialogue datasets, which covers the main information of all commonly used datasets in the field of task-based dialogue. In addition, in order to help researchers better grasp the context of the progress of the field, we provide the state-of-the-art experimental results on several datasets in the form of Leaderboard. | github |
Speech recognition corpus generation tool | Creating Automatic Speech Recognition (ASR) corpora from online videos with audio/captions | github |
LitBankNLP dataset | A corpus of 100 labeled English novels to support natural language processing and computational humanities tasks | github |
Chinese ULMFiT | Sentiment Analysis Text Classification Corpus and Model | github |
Administrative division data of provinces, cities, districts and towns with pinyin annotations | github | |
Education Industry News Automatic Summarization Corpus | github | |
Chinese Natural Language Processing Dataset | github | |
Wikipedia Massively Parallel Text Corpus | 85 languages, 1620 language pairs, 135M contrastive sentences | github |
Ancient Poetry Library |
github repo More complete ancient poetry library |
|
Low memory loading of Wikipedia data | Loading 17GB+ English Wikipedia corpus with the new version of nlp library only takes up 9MB of memory and the traversal speed is 2-3 Gbit/s | github |
Couplet data | 700,000 couplets | github |
Color Dictionary Dataset | github | |
42GB of JD Customer Service Dialogue Data (CSDD) | github | |
700,000 couplet data | link | |
Username blacklist | github | |
Dependency parsing corpus | 40,000 sentences of high-quality annotated data | Homepage |
People's Daily Corpus Processing Toolset | github | |
Fake news dataset fake news corpus | github | |
Poetry Quality Evaluation/Fine-Grained Emotional Poetry Corpus | github | |
Open tasks related to Chinese natural language processing | Datasets and current best results | github |
Chinese Abbreviation Dataset | github | |
Chinese Task Benchmark Assessment | Representative datasets - Benchmark (pre-trained) models - Corpus - Baseline - Toolkit - Leaderboard | github |
Chinese Rumor Database | github | |
CLUEDatasetSearch | Chinese and English NLP datasets Search all Chinese NLP datasets, with commonly used English NLP datasets | github |
Multi-document summarization dataset | github | |
Make everyone "polite" courtesy transfer task | Convert impolite sentences to polite sentences while preserving meaning, providing a dataset of 139M+ instances | Paper and code |
Cantonese/English Conversation Bilingual Corpus | github | |
List of Chinese NLP datasets | github | |
Name recognition dataset of person names, place names, and organization names | github | |
Chinese Language Comprehension Assessment Benchmark | Including representative datasets & benchmark models & corpora & rankings | github |
OpenCLaP multi-domain open source Chinese pre-trained language model warehouse | Civil documents, criminal documents, Baidu Encyclopedia | github |
Chinese full word coverage BERT and two reading comprehension data | DRCD dataset: released by Delta Research Institute in Taiwan, China, its format is the same as SQuAD, and it is an extractive reading comprehension dataset based on traditional Chinese. CMRC 2018 dataset: Chinese machine reading comprehension data released by the Harbin Institute of Technology iFlytek Joint Laboratory. Based on the given question, the system needs to extract fragments from the passage as answers, in the same format as SQuAD. |
github |
Dakshina Dataset | Latin/native script parallel dataset for twelve South Asian languages | github |
OPUS-100 | Multilingual (100 languages) parallel corpus centered on English | github |
Chinese reading comprehension dataset | github | |
Chinese Natural Language Processing Vector Collection | github | |
Chinese Language Comprehension Assessment Benchmark | Includes representative datasets, benchmark (pre-trained) models, corpora, and leaderboards | github |
A large list of NLP datasets/benchmark tasks | github | |
LitBankNLP dataset | A corpus of 100 labeled English novels to support natural language processing and computational humanities tasks | github |
700,000 couplet data | github | |
Classical Chinese (ancient Chinese) - Modern Chinese Parallel Corpus | The short chapters include short ancient books such as "The Analects of Confucius", "Mencius" and "Zuo Zhuan", which have been merged with "Zizhi Tongjian" | github |
COLDDateset, Chinese offensive language detection dataset | Covers topics such as race, gender, and region. Data will be released after the paper is published. | paper |
GAOKAO-bench: Using Chinese college entrance examination questions as a dataset | Using the Chinese college entrance examination questions as a data set, the evaluation framework for evaluating the language comprehension and logical reasoning ability of large language models includes 1,781 multiple-choice questions, 218 fill-in-the-blank questions, and 812 answer questions. | github |
Zero to NLP - Chinese NLP application data, models, training, reasoning | github |
Resource Name | Description | Link |
---|---|---|
textfilter | Chinese and English sensitive word filtering | observerss/textfilter |
Name extraction function | Chinese (modern, ancient) names, Japanese names, Chinese surnames and given names, titles (aunt, aunt, etc.), English->Chinese names (John Lee), idiom dictionary | cocoNLP |
Chinese abbreviations database | NPC: National People's Congress; China: People's Republic of China; Women's Tennis: Women's/n Tennis/n Match/vn | github |
Chinese Character Dictionary | Chinese character split method (I) split method (II) split method (III) split 手诲 扌诲 才诲 | kfcd/chaizi |
Vocabulary sentiment value | Spring water: 0.400704566541 Abundant: 0.37006739587 |
rainarch/SentiBridge |
Chinese vocabulary, stop words, sensitive words | dongxiexidian/Chinese | |
python-pinyin | Convert Chinese characters to Pinyin | mozillazg/python-pinyin |
zhtools | Convert between Traditional and Simplified Chinese | skydark/nstools |
English-like Chinese pronunciation engine | say wo i ni #say: I love you | tinyfool/ChineseWithEnglish |
chinese_dictionary | Synonyms, antonyms, negation thesaurus | guotong1988/chinese_dictionary |
wordninja | Split and extract words from English strings without spaces | wordninja |
Car brands, car parts related words | data | |
THU's vocabulary | IT thesaurus, financial thesaurus, idiom thesaurus, place name thesaurus, historical celebrity thesaurus, poetry thesaurus, medical thesaurus, diet thesaurus, legal thesaurus, automobile thesaurus, animal thesaurus | link |
Crime legal terms and classification model | Contains 856 crime knowledge graphs, crime prediction based on a 2.8 million crime training database, 13 types of question classification and legal information Q&A function based on 200,000 legal Q&A pairs | github |
Word segmentation corpus + code | Baidu Netdisk link - Extraction code pea6 | |
Chinese word segmentation + part-of-speech tagging based on Bi-LSTM + CRF | Keras implementation | link |
Chinese word segmentation and part-of-speech tagging based on Universal Transformer + CRF | link | |
Fast Neural Network Segmentation Package | java version | |
chinese-xinhua | Chinese Xinhua Dictionary database and API, including commonly used allegorical sayings, idioms, words and Chinese characters | github |
SpaCy Chinese Model | Contains functions such as Parser, NER, syntax tree, etc. Some English packages use spacy's English model. If you want to adapt to Chinese, you may need to use spacy's Chinese model. | github |
Chinese character data | github | |
Synonyms Chinese synonyms toolkit | github | |
HarvestText | Domain-adaptive text mining tools (new word discovery - sentiment analysis - entity linking, etc.) | github |
word2word | Convenient and easy-to-use multilingual word-word pair collection 62 languages / 3,564 multilingual pairs | github |
Polyphonetic dictionary data and codes | github | |
Chinese characters, words, and idioms query interface | github | |
103976 English word library packages | (sql version, csv version, Excel version) | github |
List of English swear words | github | |
Word Pinyin Data | github | |
Number name library in 186 languages | github | |
Large-scale name database of countries around the world | github | |
Chinese character feature extractor (featurizer) | Extract the features of Chinese characters (pronunciation features, glyph features) for use as features for deep learning | github |
char_featurizer - Chinese character feature extraction tool | github | |
Python interface library for the Chinese, Japanese and Korean word segmentation library mecab | github | |
g2pC Context-based Chinese pronunciation automatic tagging module | github | |
ssc, Sound Shape Code | Sound and Shape Code - A Chinese string similarity calculation method based on "Sound and Shape Code" |
version 1 version 2 blog/introduction |
Acquisition of multiple meanings/meanings of Chinese words and semantic disambiguation of words in specific sentences based on encyclopedic knowledge base | github | |
Tokenizer is a fast and customizable text tokenization library | github | |
Tokenizers | The most advanced tokenizer with emphasis on performance and versatility | github |
Transform text by replacing synonyms | github | |
token2index is a powerful and lightweight term indexing library compatible with PyTorch/Tensorflow | github | |
Traditional and Simplified Chinese Conversion | github | |
Cantonese NLP Tools | github | |
Domain Dictionary | Professional dictionary knowledge base covering 68 fields and a total of 9.16 million words | github |
Resource Name | Description | Link |
---|---|---|
BMList | Big Model Big List | github |
Bert's paper Chinese translation | link | |
Slides by Bert's original author | link | |
Text Classification Practice | github | |
bert tutorial text classification tutorial | github | |
BERT PyTorch Implementation | github | |
BERT PyTorch Implementation | github | |
BERT generates sentence vectors, BERT performs text classification and text similarity calculation | github | |
Illustration of BERT and ELMO | github | |
BERT Pre-trained models and downstream applications | github | |
Language/knowledge representation tools BERT & ERNIE | github | |
Using the gpt-2 language model in Kashgari | github | |
Facebook LAMA | Probes for analyzing facts and common sense knowledge contained in pre-trained language models. Language model analysis, providing a unified access interface for Transformer-XL/BERT/ELMo/GPT pre-trained language models | github |
GPT2 training code in Chinese | github | |
XLMFacebook's cross-language pre-trained language model | github | |
Massive Chinese pre-trained ALBERT model | github | |
Transformers 20 | Supports TensorFlow 20 and PyTorch's natural language processing pre-trained language models (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) 8 architectures/33 pre-trained models/102 languages | github |
8 papers sort out the progress and reflections of BERT-related models | github | |
French RoBERTa pre-trained language model | French RoBERTa pre-trained language model trained with 138GB corpus | link |
Chinese pre-trained ELECTREA model | Pretrain Chinese Model based on adversarial learning | github |
albert-chinese-ner | Using the pre-trained language model ALBERT for Chinese NER | github |
A collection of open source pre-trained language models | github | |
Chinese ELECTRA pre-trained model | github | |
Predicting the next word with Transformers (BERT, XLNet, Bart, Electra, Roberta, XLM-Roberta) (model comparison) | github | |
TensorFlow Hub | New language models for 40+ languages (including Chinese) | link |
UER | A repository of Chinese pre-trained models based on different corpora, encoders, and target tasks (including BERT, GPT, ELMO, etc.) | github |
A collection of open source pre-trained language models | github | |
Multilingual sentence vector pack | github | |
Language Model as a Service (LMaaS) | Language Model as a Service | github |
Open source language model GPT-NeoX-20B | With 20 billion parameters, it is the largest publicly accessible pre-trained general autoregressive language model. | github |
Chinese Scientific Literature Dataset (CSL) | Contains meta information (title, abstract, keywords, discipline, category) of 396,209 Chinese core journal papers. The CSL dataset can be used as a pre-training corpus, and can also be used to construct many NLP tasks, such as text summarization (title prediction), keyword generation, and text classification. | github |
Large model development tool | github |
Resource Name | Description | Link |
---|---|---|
Time extraction | It has been integrated into the python package cocoNLP . Welcome to try it. |
java version Python version |
Neural Network Relation Extraction PyTorch | Chinese is not supported yet | github |
Named Entity Recognition PyTorch based on BERT | Chinese is not supported yet | github |
Keyphrase extraction package pke | github | |
BLINK is the most advanced entity link library | github | |
Named Entity Recognition with BERT/CRF | github | |
LatticeLSTM Chinese named entity recognition supporting batch parallelism | github | |
Building a model for medical entity recognition | Contains dictionary and corpus annotation, based on Python | github |
Pipeline entity and relation extraction based on TensorFlow and BERT | - Entity and Relation Extraction Based on TensorFlow and BERT Pipeline entity and relation extraction based on TensorFlow and BERT, 2019 Language and Intelligent Technology Competition Information Extraction Task Solution. Schema based Knowledge Extraction, SKE 2019 | github |
Chinese named entity recognition NeuroNER vs BertNER | github | |
Chinese named entity recognition based on BERT | github | |
Chinese Key Phrase Extraction Tool | github | |
bert | Tensorflow version for Chinese named entity recognition | github |
bert-Kashgari | Kashgari, a keras-based encapsulated classification and annotation framework, can build a classification or sequence annotation model in a few minutes | github |
cocoNLP | Extraction of information such as name, address, email address, mobile phone number, mobile phone location, etc., rake phrase extraction algorithm. | github |
Microsoft Multilingual Number/Unit/Date/Time Recognition Pack | github | |
Baidu's open source benchmark information extraction system | github | |
Chinese address segmentation (address element identification and extraction), NER through sequence labeling | github | |
Open domain text knowledge triple extraction and knowledge base construction based on dependency syntax | github | |
Chinese keyword extraction method based on pre-training model | github | |
chinese_keyphrase_extractor (CKPE) | A tool for Chinese keyphrase extraction A tool for quickly extracting and identifying key phrases from natural language text | github |
A simple resume parser to extract key information from resumes | github | |
BERT-NER-Pytorch BERT Chinese NER experiments in three different modes | github |
Resource Name | Description | Link |
---|---|---|
Tsinghua University XLORE Chinese and English cross-language encyclopedia knowledge graph | Baidu, Chinese Wiki, English Wiki | link |
Automatically generate document graph | github | |
Question answering system based on medical knowledge graph |
github This repo refers to github |
|
Chinese Character Relationship Knowledge Graph Project | github | |
AmpliGraph Knowledge Graph Representation Learning (Python) Library Knowledge Graph Concept Link Prediction | github | |
Chinese knowledge graph materials, data and tools | github | |
Chinese knowledge graph based on Baidu Encyclopedia | Extract triple information and build Chinese knowledge graph | github |
Zincbase Knowledge Graph Construction Toolkit | github | |
Question answering system based on knowledge graph | github | |
Knowledge graph deep learning related materials collation | github | |
Southeast University "Knowledge Graph" Postgraduate Course (Materials) | github | |
Knowledge Graph Car Audio Project | github | |
One Piece Knowledge Graph | github | |
A dataset of 132 knowledge graphs | Covers common sense, cities, finance, agriculture, geography, meteorology, social networking, Internet of Things, medical care, entertainment, life, business, travel, science and education | link |
Large-scale, structured, bilingual COVID-19 knowledge graph (COKG-19) | link | |
Event triple extraction based on dependency syntax and semantic role labeling | github | |
Abstract Knowledge Graph | The current scale is 500,000, supporting abstraction of noun entities, state descriptions, and event actions. | github |
Large-scale Chinese knowledge graph data with 1.4 billion entities | github | |
Jiagu Natural Language Processing Tool | Based on models such as BiLSTM, it provides functions such as knowledge graph, relationship extraction, Chinese word segmentation, part-of-speech tagging, named entity recognition, sentiment analysis, new word discovery, keyword text summarization, text clustering, etc. | github |
medical_NER - Named Entity Recognition in Chinese Medical Knowledge Graph | github | |
A large list of learning materials/datasets/tool resources related to knowledge graphs | github | |
LibKGE: A knowledge graph embedding library for reproducible research | github | |
Military knowledge graph question answering project based on mongodb storage | The military weapons knowledge base includes 8 major categories such as aircraft and space equipment, more than 100 subcategories, and a total of 5,800 items. This project does not use a graph database for storage. It uses Jieba to parse questions and identify question entities. It completes queries for multiple types of questions based on query templates. It mainly provides an industrial question-answering idea demo. | github |
JD Product Knowledge Graph | github | |
Chinese relation extraction based on distant supervision | github | |
Intelligent question answering system based on medical knowledge graph | github | |
BLINK is the most advanced entity link library | github | |
A small securities knowledge graph/knowledge base | github | |
dstlr unstructured text scalable knowledge graph construction platform | github | |
Baidu Encyclopedia Character Entry Attribute Extraction | Knowledge graphing with BERT-based fine-tuning and feature extraction | github |
COVID-19 related data | Chinese medical dialogue dataset of COVID-19 and other types of pneumonia; open data sources from Tsinghua University and other institutions (COVID-19) |
github github |
DGL-KE graph embedding representation learning algorithm | github | |
Cause and effect diagram | method data | |
Causal event pairing based on multi-domain text datasets | link |
Resource Name | Description | Link |
---|---|---|
Texar | Toolkit for Text Generation and Beyond | github |
Professor Ehud Reiter's blog | linkHighly recommended by Professor Wan Xiaojun of Peking University, this blog provides an in-depth discussion and reflection on NLG technology, evaluation and application. | |
A large list of resources related to text generation | github | |
Open Domain Dialogue Generation and Its Practice in Microsoft XiaoIce | Natural language generation allows machines to master the ability to automatically create | link |
Text generation control | github | |
A large list of resources related to natural language generation | github | |
Evaluating Natural Language Generation with BLEURT | link | |
Automatic couplet data and robots |
Codelink 700,000 couplet data |
|
Automatically generate comments | Generate comments based on Hacker News article titles using the Transformer encoder-decoder model | github |
Natural language generation SQL statements (English) | github | |
Natural Language Generation Resources | github | |
Chinese Generation Task Benchmark Evaluation | github | |
Specific topic text generation/text augmentation based on GPT2 | github | |
Encoding, marking and implementing a controllable and efficient text generation method | github | |
TextFooler: Adversarial text generation module for text classification/reasoning | github | |
SimBERT | The BERT model is based on the UniLM concept and integrates retrieval and generation. | github |
New word generation and sentence making | Non-existent words are generated from scratch using GPT-2 variants along with their definitions and examples | github |
Automatically generate multiple-choice questions from text | github | |
Synthetic Data Generation Benchmark | github | |
Resource Name | Description | Link |
---|---|---|
Chinese text summarization/keyword extraction | github | |
Automatic resume summarization based on named entity recognition | github | |
Text automatic summarization library TextTeaser | English only | github |
Extractive summarization based on the latest language models such as BERT | github | |
A Comprehensive Guide to Text Summarization with Deep Learning in Python | link | |
(Colab) Abstract Text Summarization Implementation Collection (Tutorial) | github |
Resource Name | Description | Link |
---|---|---|
Chinese chatbot | Train the chatbot you want based on your own corpus, which can be used in scenarios such as intelligent customer service, online Q&A, and intelligent chat. | github |
Interesting fun robot qingyun | qingyun trained Chinese chatbot | github |
Open conversational robots, knowledge graphs, semantic understanding, natural language processing tools and data | github | |
QA robot | Amodel-for-Retrivalchatbot - Customer service robot, Chinese Retreival chatbot (Chinese retrieval robot) | git |
ConvLab open source multi-domain end-to-end dialogue system platform | github | |
A dialogue system built on the latest version of rasa | github | |
A chatbot based on the finance-judicial field (also with the nature of small talk) | github | |
End-to-end closed domain dialogue system | github | |
MiningZhiDaoQACorpus | 5.8 million Baidu Zhidao Q&A data mining project, Baidu Zhidao Q&A corpus, including more than 5.8 million questions, each with a question label. Based on this Q&A corpus, it can support a variety of applications, such as logic mining | github |
GPT2 model for Chinese small talk GPT2-chitchat | github | |
Select a list of relevant resources (Leaderboards, Datasets, Papers) based on multiple rounds of responses from the retrieval chatbot | github | |
Microsoft Conversational Bot Framework | github | |
chatbot-list | Industry-wide sharing and introduction of intelligent customer service, chatbot applications, architecture, and algorithms | github |
Chinese medical dialogue data Chinese medical dialogue data set | github | |
A large-scale medical conversation dataset | Contains 1.1 million medical consultations and 4 million doctor-patient conversations | github |
CrossWOZ: A large-scale cross-domain Chinese task-oriented multi-turn dialogue dataset and model | paper & data | |
Open source conversational information search platform | github | |
DSTC9 2020 | github | |
Paraphrase of T5 questions trained with Quora question pairs (Paraphrase) | github | |
Google releases Taskmaster-2 natural language task dialogue dataset | github | |
Haystack is a flexible, powerful and scalable question answering (QA) framework | github | |
End-to-end closed domain dialogue system | github | |
Amazon releases knowledge-based human-human open-domain conversation dataset | github | |
Albert Large QA model trained based on Baidu webqa and dureader datasets | github | |
CommonsenseQA: Common sense English QA challenge | link | |
MedQuAD (English) medical question answering dataset | github | |
A question-answering engine based on Albert and Electra, using Wikipedia text as context | github | |
A question-answering attempt based on a 140,000 song knowledge base | Features include lyrics chain, finding songs with known lyrics, and questions and answers about the triangle relationship between songs, singers, and lyrics. | github |
Resource Name | Description | Link |
---|---|---|
Chinese text error correction module code | github | |
English spelling check library | github | |
Python spell checking library | github | |
GitHub Typo Corpus Large-scale GitHub multi-language spelling error/grammar error dataset | github | |
BertPunc is a state-of-the-art punctuation repair model based on BERT | github | |
Chinese Writing Proofreading Tools | github | |
Text Correction Reference List | Chinese Spell Checking (CSC) and Grammatical Error Correction (GEC) | github |
The champion solution of the Text Intelligent Proofreading Competition | Already implemented, from Suzhou University and DAMO Academy team | link |
Resource Name | Description | Link |
---|---|---|
Chinese multimodal dataset "Wukong" | Huawei Noah's Ark Lab opens a large-scale open-source database containing 100 million image and text pairs | github |
Chinese-CLIP: A pre-trained model for Chinese text and image representation | Chinese version of CLIP pre-trained model, open source multiple model scales, a few lines of code to handle Chinese image and text representation extraction & image and text retrieval | github |
Resource Name | Description | Link |
---|---|---|
ASR speech dataset + Chinese speech recognition system based on deep learning | github | |
Tsinghua University THCHS30 Chinese speech dataset |
data_thchs30tgz-OpenSLR domestic mirror data_thchs30tgz test-noisetgz - OpenSLR domestic mirror test-noisetgz resourcetgz-OpenSLR domestic mirror resourcetgz Free ST Chinese Mandarin Corpus Free ST Chinese Mandarin Corpus AIShell-1 open source dataset-OpenSLR domestic mirror AIShell-1 open source dataset Primewords Chinese Corpus Set 1-OpenSLR domestic mirror Primewords Chinese Corpus Set 1 |
|
Laughter Detector | github | |
New version of Common Voice speech recognition dataset | Includes over 1,400 hours of speech samples from 42,000 contributors, including | link |
speech-aligner | A tool for generating phoneme-level time-aligned annotations from "human voice" and its "language text" | github |
ASR Phonetic Dictionary/Dictionary | github | |
Speech Sentiment Analysis | github | |
masr | Chinese speech recognition, providing pre-trained models and high recognition rate | github |
Chinese Text Normalization for Speech Recognition | github | |
Speech quality evaluation indicators (MOSNet, BSSEval, STOI, PESQ, SRMR) | github | |
Chinese/English pronunciation dictionary for speech recognition | github | |
CoVoST Facebook released a multilingual speech-to-text translation corpus | Includes audio, text transcription and English translation in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese) | github |
Parakeet Text-to-speech Synthesis based on PaddlePaddle | github | |
(Java) Accurate Speech Natural Language Detection Library | github | |
CoVoST Facebook released a multilingual speech-to-text translation corpus | github | |
Text-to-speech synthesis implemented in TensorFlow 2 | github | |
Python audio feature extraction package | github | |
ViSQOL is an objective and complete reference index for audio quality perception, with two modes: audio and voice. | github | |
zhrtvc | Easy-to-use Chinese voice cloning and Chinese speech synthesis system | github |
aukit | A useful speech processing toolbox, including speech noise reduction, audio format conversion, feature spectrum generation and other modules | github |
phkit | A useful phoneme processing toolbox, including Chinese phonemes, English phonemes, text-to-pinyin, text regularization and other modules | github |
zhvoice | Chinese speech corpus, with clearer and more natural speech, including 8 open source data sets, 3,200 speakers, 900 hours of speech, and 13 million words | github |
Audio for speech behavior detection | , binarization, speaker recognition, automatic speech recognition, emotion recognition and other tasks | github |
Deep Learning Emotional Text-to-Speech Synthesis | github | |
Python Audio Data Augmentation Library | github | |
Audio Enhancement Based on Large-Scale Audio Dataset | github | |
Voice transfer | github |
Resource Name | Description | Link |
---|---|---|
LayoutLM-v3 document understanding model | github | |
PyLaia is a deep learning toolkit for handwritten document analysis | github | |
Single document unsupervised keyword extraction | github | |
DocSearch Free Document Search Engine | github | |
fdfgen | Able to automatically create PDF documents and fill in information | link |
pdfx | Automatically extract cited references and download the corresponding pdf files | link |
invoice2data | Invoice pdf information extraction | invoice2data |
PDF document information extraction | github | |
PDFMiner | PDFMiner can get the exact location of text in the page, as well as other information such as fonts or lines. It also has a PDF converter that can convert PDF files into other text formats (such as HTML). There is also an extensible parser PDF that can be used for other purposes besides text analysis. | link |
PyPDF2 | PyPDF 2 is a python PDF library that can split, merge, crop, and convert the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs, and can also merge entire files together. | link |
PyPDF2 | PyPDF 2 is a python PDF library that can split, merge, crop, and convert the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs, and can also merge entire files together. | link |
ReportLab | ReportLab is a fast way to create PDF documents. A time-proven, super easy-to-use open source project for creating complex, data-driven PDF documents and custom vector graphics. It's free, open source, and written in Python. The package is downloaded more than 50,000 times a month, is part of the standard Linux distribution, embedded in many products, and was chosen to power Wikipedia's print/export functionality. | link |
SIMPdfA simple PDF file text editor written in Python | github | |
pdf-diff | PDF file diff tool can display the differences between two PDF documents | github |
Resource Name | Description | Link |
---|---|---|
Use unet to automatically detect and rebuild document tables | github | |
pdftabextract | Used for table information analysis after OCR recognition, very powerful | link |
tabula-py | Directly convert the table information in PDF to pandas dataframe, there are two versions of code: Java and Python | |
Camelot | PDF table analysis | link |
pdfplumber | PDF table analysis | |
PubLayNet | Able to divide paragraphs, recognize tables and pictures | link |
Extracting tabular data from papers | github | |
Finding answers in tables with BERT | github | |
Series of articles on Form Q&A |
Introduction Model Final Chapter |
|
Generate tabular data using GAN (English only) | github | |
carefree-learn(PyTorch) | Tabular Dataset Automated Machine Learning (AutoML) Package | github |
Closed field fine-tuning table detection | github | |
PDF table data extraction tool | github | |
TaBERT: A new model for understanding queries on tabular data | paper | |
Form Processing | Awesome-Table-Recognition | github |
Resource Name | Description | Link |
---|---|---|
Sentence, QA Similarity Matching MatchZoo | A collection of text similarity matching algorithms, including multiple deep learning methods, which are worth trying. | github |
Chinese Question Sentence Similarity Calculation Competition and Solution Summary | github | |
Similarity calculation toolkit | Written in java, it is used for similarity calculations related to words, phrases, sentences, lexical analysis, sentiment analysis, semantic analysis, etc. | github |
Chinese word similarity calculation method | It combines the word similarity calculation methods of the extended version of Synonymous Cilin and Hownet, with wider vocabulary coverage and more accurate results. | gihtub |
Python string similarity algorithm library | github | |
Similar sentence judgment model based on Siamese bilstm model, providing training data set and test data set | Provided 100,000 training samples | github |
Resource Name | Description | Link |
---|---|---|
Chinese NLP Data Enhancement (EDA) Tool | github | |
English NLP Data Enhancement Tools | github | |
One-click Chinese data enhancement tool | github | |
The application and effect of data enhancement in machine translation and other NLP tasks | link | |
NLP Data Augmentation Resource Set | github |
Resource Name | Description | Link |
---|---|---|
Regular expression for extracting email | It has been integrated into the python package cocoNLP . Welcome to try it. | |
Extract phone_number | It has been integrated into the python package cocoNLP . Welcome to try it. | |
Regular expression for extracting ID number | IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0 -9]|3[01])\d{3}[0-9xX]) IDs = re.findall(IDCards_pattern, text, flags=0) |
|
IP address regular expression | (25[0-5]| 2[0-4]\d| [0-1]\d{2}| [1-9]?\d).(25[0-5]| 2[0- 4]\d| [0-1]\d{2}| [1-9]?\d).(25[0-5]| 2[0-4]\d| [0-1]\d {2}| [1-9]?\d).(25[0-5]| 2[0-4]\d| [0-1]\d{2}| [1-9]?\d ) | |
Tencent QQ number regular expression | [1-9]([0-9]{5,11}) | |
Domestic landline number regular expression | [0-9-()()]{7,18} | |
Username regular expression | [A-Za-z0-9_-\u4e00-\u9fa5]+ | |
Domestic phone number regular expression matching (three major operators + virtual, etc.) | github | |
Regular Expression Tutorial | github |
Resource Name | Description | Link |
---|---|---|
Efficient fuzzy search tool | github | |
A large list/search engine of BERT models for various languages/tasks | link | |
Deepmatch is a deep matching model library for recommendation, advertising and search | github | |
wwsearch is a full-text search engine developed by WeChat for Enterprise | github | |
aili - the fastest in-memory index in the East | github | |
Efficient string matching tool RapidFuzz | a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy | github |
Resource Name | Description | Link |
---|---|---|
Efficient fuzzy search tool | github | |
A large list/search engine of BERT models for various languages/tasks | link | |
Deepmatch is a deep matching model library for recommendation, advertising and search | github | |
AllenNLP reading comprehension supports a variety of data and models | github |
Resource Name | Description | Link |
---|---|---|
Aspect Sentiment Analysis Package | github | |
awesome-nlp-sentiment-analysis | Sentiment analysis, emotion cause identification, evaluation object and evaluation word extraction | github |
Sentiment analysis technology enables intelligent customer service to better understand human emotions | github |
Resource Name | Description | Link |
---|---|---|
Chinese event extraction | github | |
List of literature resources on NLP event extraction | github | |
BERT Event Extraction (ACE 2005 corpus) implemented in PyTorch | github | |
News event clue extraction | github |
Resource Name | Description | Link |
---|---|---|
Wudao Dictionary | The command line version of Youdao Dictionary, supporting English-Chinese and online search | github |
NLLB | NLLB language model that supports translation between 200+ languages | link |
Easy-Translate | Script for translating large text files locally, based on Facebook/Meta AI's M2M100 model and NLLB200 model, supporting 200+ languages | github |
Resource Name | Description | Link |
---|---|---|
The best Chinese character number (Chinese numerals) - Arabic numerals conversion tool | github | |
Quickly convert "Chinese numbers" and "Arabic numbers" | github | |
Parse natural language numeric strings into integers and floating point numbers | github |
Resource Name | Description | Link |
---|---|---|
Chinese reference resolution data |
github baidu ink code a0qq |
Resource Name | Description | Link |
---|---|---|
TextCluster Short text cluster preprocessing module | github |
Resource Name | Description | Link |
---|---|---|
NeuralNLP-NeuralClassifier Tencent open source deep learning text classification tool | github |
Resource Name | Description | Link |
---|---|---|
GraphbrainAI is an open source software library and research tool that aims to facilitate automatic meaning extraction and text understanding as well as knowledge exploration and inference. | github | |
(Harvard) Free book on causal reasoning |
Resource Name | Description | Link |
---|---|---|
A library of state-of-the-art interpreters for textual machine learning models | github |
Resource Name | Description | Link |
---|---|---|
TextAttack: A framework for adversarial attacks on natural language processing models | github | |
OpenBackdoor: Text backdoor attack and defense toolkit | OpenBackdoor is developed based on Python and PyTorch, which can be used to reproduce, evaluate and develop algorithms related to text backdoor attack and defense | github |
Resource Name | Description | Link |
---|---|---|
Scattertext text visualization (python) | github | |
interactive visualization of whatlies word vectors | spacy tools | |
PySS3 SS3 text classifier machine visualization tool for explainable AI | github | |
Rendering 3D images with Notepad | github | |
attnvis Visualization of attention interactions of transformer language models such as GPT2 and BERT | github | |
Texthero text data efficient processing package | Including preprocessing, keyword extraction, named entity recognition, vector space analysis, text visualization, etc. | github |
Resource Name | Description | Link |
---|---|---|
A review of NLP annotation platforms | github | |
brat rapid annotation tool sequence annotation tool | link | |
Poplar web version natural language annotation tool | github | |
LIDA lightweight interactive dialogue annotation tool | github | |
doccano is an open source collaborative multilingual text annotation tool based on the web | github | |
Datasaurai online data annotation workflow management tool | link |
Resource Name | Description | Link |
---|---|---|
langid | 97 languages detected | https://github.com/saffsd/langid.py |
langdetect | Language Detection | https://code.google.com/archive/p/language-detection/ |
Resource Name | Description | Link |
---|---|---|
jieba | jieba | |
hanlp | hanlp | |
nlp4han | Chinese natural language processing toolset (sentence segmentation/word segmentation/part-of-speech tagging/chunking/syntactic analysis/semantic analysis/NER/N-grammar/HMM/pronoun resolution/sentiment analysis/spelling check | github |
Progress in Hate Speech Detection | link | |
Bert application based on Pytorch | Including named entity recognition, sentiment analysis, text classification, and text similarity | github |
nlp4han Chinese Natural Language Processing Toolset | Sentence segmentation/word segmentation/part-of-speech tagging/chunking/syntactic analysis/semantic analysis/NER/N-grammar/HMM/pronoun resolution/sentiment analysis/spelling check | github |
Some basic models about natural language | github | |
Template code for sequence labeling and text classification using BERT | github | |
jieba_fast accelerated version of jieba | github | |
StanfordNLP | Pure Python version of natural language processing package | link |
Python Spoken Natural Language Processing Toolkit (English) | github | |
PreNLP natural language preprocessing library | github | |
Some papers and codes related to nlp | Including topic model, word embedding, named entity recognition (NER), text classification, text generation, text similarity calculation, etc., involving various NLP-related algorithms, based on keras and tensorflow | github |
Python Text Mining/NLP Practical Examples | github | |
Forte is a flexible and powerful natural language processing pipeline toolkit | github | |
stanza Stanford team NLP tool | Can handle more than 60 languages | github |
Fancy-NLP is a text knowledge mining tool for building product portraits | github | |
A comprehensive and easy-to-use Chinese NLP toolkit | github | |
The industry often uses DSSM-based vectorized recall pipeline to reproduce | github | |
Texthero text data efficient processing package | Including preprocessing, keyword extraction, named entity recognition, vector space analysis, text visualization, etc. | github |
NLPGNN graph neural network natural language processing toolbox | github | |
Macadam | A natural language processing toolkit based on Tensorflow (Keras) and bert4keras, focusing on text classification, sequence labeling and relation extraction | github |
LineFlow is an efficient NLP data loader for all deep learning frameworks | github | |
Arabica: Python text data exploratory analysis toolkit | github | |
Python stress testing tool: SMSBoom | github |
Resource Name | Description | Link |
---|---|---|
Wang Feng Lyrics Generator | phunterlau/wangfeng-rnn | |
Analysis of Girlfriend's Emotional Fluctuations | github | |
NLP is too difficult series | github | |
Variable naming artifact | github link | |
Image text removal, can be used for comic translation | github | |
CoupletAI - Couplet Generation | Automatic couplet system based on CNN+Bi-LSTM+Attention | github |
Solving complex mathematical equations using neural network symbolic reasoning | github | |
Question-answering robot based on 140,000 song knowledge base | Features include lyrics chain, finding songs with known lyrics, and questions and answers about the triangle relationship between songs, singers, and lyrics. | github |
COPE - Metrical Poetry Editing Program | github | |
Paper2GUI | An AI desktop APP toolbox for ordinary people, which can be used immediately after installation. It supports 18+ AI models, covering speech synthesis, video frame interpolation, video super-resolution, object detection, image stylization, OCR recognition and other fields. | github |
Politeness Estimator (Trained using Sina Weibo data) | github paper | |
Getting Started with Python | Chinese programming language | homepage gitee |
Resource Name | Description | Link |
---|---|---|
Natural Language Processing Report | link | |
Knowledge Graph Report | link | |
Data mining report | link | |
Autonomous Driving Report | link | |
Machine Translation Report | link | |
Blockchain Report | link | |
Robot Report | link | |
Computer Graphics Report | link | |
3D Printing Report | link | |
Face Recognition Report | link | |
Artificial Intelligence Chip Report | link | |
CS224N Deep Learning Natural Language Processing Course | linkPyTorch implementation of the model in the courselink | |
A hands-on tutorial on natural language processing for deep learning researchers | github | |
"Natural Language Processing" by Jacob Eisenstein | github | |
ML-NLP | Knowledge points and code implementations commonly tested in machine learning and NLP interviews | github |
NLP task example project code set | github | |
Review of NLP highlights in 2019 | download | |
nlp-recipes Microsoft produced - Natural Language Processing Best Practices and Examples | github | |
A hands-on tutorial on natural language processing for deep learning researchers | github | |
Transfer Learning in Natural Language Processing (NLP) | youtube | |
Machine Learning Systems Book | link github |
Resource Name | Description | Link |
---|---|---|
NLPer-Arsenal | NLP competition, including current competition information, past competition plans, etc., continuously updated | github |
Review the top solutions of all NLP competitions | github | |
Baidu's 2019 Triple Extraction Competition, "Science Space Team" source code (7th place) | github |
Resource Name | Description | Link |
---|---|---|
BDCI2019 Financial Negative Information Determination | github | |
Open source financial investment data extraction tool | github | |
A large list of natural language processing research resources in the financial field | github | |
A chatbot based on the finance-judicial field (also for small talk) | github | |
Demonstration of the process of constructing a small financial knowledge graph | github |
Resource Name | Description | Link |
---|---|---|
Chinese Medical NLP Public Resources | github | |
spaCy Medical Text Mining and Information Extraction | github | |
Building a model for medical entity recognition | Contains dictionary and corpus annotation, based on Python | github |
Question answering system based on medical knowledge graph | githubThis repo refers to github | |
Chinese medical dialogue data Chinese medical dialogue data set | github | |
A large-scale medical conversation dataset | Contains 1.1 million medical consultations and 4 million doctor-patient conversations | github |
COVID-19 related data | Chinese medical dialogue dataset of COVID-19 and other types of pneumonia; open data sources from Tsinghua University and other institutions (COVID-19) |
github github |
Resource Name | Description | Link |
---|---|---|
Blackstone’s spaCy pipeline and NLP models for unstructured legal text | github | |
Legal Intelligence Literature Resource List | github | |
A chatbot based on the finance-judicial field (also with the nature of small talk) | github | |
Crime legal terms and classification model | Contains 856 crime knowledge graphs, crime prediction based on a 2.8 million crime training database, 13 types of question classification and legal information Q&A function based on 200,000 legal Q&A pairs | github |
A large list of legal NLP related resources | github |
Resource Name | Description | Link |
---|---|---|
Dalle-mini | A mini version of DALL·E that generates images based on text prompts | github |
Resource Name | Description | Link |
---|---|---|
phone | China Mobile Location Query | ls0f/phone |
phone | International mobile phone and phone location query | AfterShip/phone |
ngender | Determine gender based on name | observerss/ngender |
An overview of the differences between Chinese and English natural language processing (NLP) | link | |
Technical documents PDF or PPT shared by experts in major companies | github | |
comparxiv is a command for comparing the differences between two submitted versions on arXiv | pypi | |
Meta-architecture of CHAMELEON deep learning news recommendation system | github | |
Automatic resume screening system | github | |
Multiple text readability evaluation indicators implemented in Python | github |