forked from tencent-ailab/persona-hub
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Project Persona
committed
Jun 28, 2024
0 parents
commit e12241e
Showing
10 changed files
with
375,042 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Scaling Synthetic Data Creation with 1,000,000,000 Personas | ||
<a href="PersonaHub.pdf"><img src="https://img.shields.io/badge/Paper-arXiv-red?style=for-the-badge" height=22.5></a> <a href="https://huggingface.co/datasets/proj-persona/PersonaHub"><img src="https://img.shields.io/badge/Hugging-Face-yellow?style=for-the-badge" height=22.5></a> | ||
|
||
## Introduction | ||
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce **PERSONA HUB** – a collection of **1 billion diverse personas** automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing PERSONA HUB’s use cases in synthesizing high-quality **mathematical and logical reasoning** problems, **instructions** (i.e., user prompts), **knowledge-rich texts**, **game NPCs** and **tools** (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development. | ||
|
||
<div align="center"> | ||
<img src="./assets/persona_overview.png" width="90%"> | ||
</div> | ||
|
||
|
||
## Data Release | ||
### Synthetic Data Samples | ||
To facilitate research in persona-driven data synthesis, we are initially releasing following synthetic data samples we created with various personas, including: | ||
* **50,000 math problems** | ||
* **50,000 logical reasoning problems** | ||
* **50,000 instructions** | ||
* **10,000 knowledge-rich texts** | ||
* **10,000 game NPCs** | ||
* **5,000 tools (functions)** | ||
|
||
### Persona Hub | ||
We also release a subset of our PERSONA HUB, including: | ||
* **200,000 personas** | ||
|
||
## Citation | ||
If you find our work useful, please consider citing our paper: | ||
``` | ||
@article{tencent2024persona, | ||
title={Scaling Synthetic Data Creation with 1,000,000,000 Personas}, | ||
author={{Tencent AI Lab Seattle}}, | ||
year={2024} | ||
} | ||
``` | ||
|
||
## Contact | ||
Please email `[email protected]` or `[email protected]` | ||
|
||
## Disclaimer | ||
PERSONA HUB can facilitate synthetic data creation at a billion-scale to simulate diverse inputs (i.e., use cases) from a wide variety of real-world users. If this data is used as input to query a target LLM to obtain its outputs at scale, there is a high risk that the LLM's knowledge, intelligence and capabilities will be dumped and easily replicated, thereby challenging the leading position of the most powerful LLMs. Our released data is only for research purposes. It is crucial to avoid misuse and ensure ethical and responsible application to prevent privacy violations and other ethical concerns. | ||
|
||
The released data is all generated by public available models (GPT-4, Llama-3 and Qwen), and is provided free of charge. Accordingly, Tencent and its licensors provide the data AS-IS, without warranty of any kind, express or implied. The views and opinions expressed in the data do not necessarily reflect those of Tencent. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.