-
Lawbda Co.
- PRC::LiaoNing
- pages.lawbda.org
- @LaWbda
🍭Data
GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
A large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activelo…
Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
<u><a href="https://circse.github.io/LT4HALA/" style="color: white">Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA)</a></u>
SikuBERT:四库全书的预训练语言模型(四库BERT) Pre-training Model of Siku Quanshu
Auto generated Dash docset feed for .gitlab-ci.yml
A tool that AI automatically recommends commit messages.
今日头条中文新闻(文本)分类数据集
3000000+语义理解与匹配数据集。可用于无监督对比学习、半监督学习等构建中文领域效果最好的预训练模型
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Code and data for "Impact of Evaluation Methodologies on Code Summarization" in ACL 2022.
A list and count of keywords in programming languages.
中华人民共和国行政区划:省级(省份)、 地级(城市)、 县级(区县)、 乡级(乡镇街道)、 村级(村委会居委会) ,中国省市区镇村二级三级四级五级联动地址数据。
Semantic Code Search Tool based on Machine Translation
Cross-Domain Deep Code Search with Few-Shot Learning
A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
A catalog of more than 400 design patterns collected from multiple sources
supplement materials for the paper Mining Attributes of Design Patterns: A Case Study on Online Posts