CrowdData is an open repository that aggregates the crowdsourced datasets that have individual crowd votes. We aim at providing the available datasets with a standard format (explained in Download
section below) so that they can be directly used in experiments, without any work-load in preprocessing. Datasets included in this repo serve for classification tasks (mainly text classification, except Emotion Dataset). CrowData can benefit researchers investigating hybrid usage of machine and human-in-the-loop in classification tasks (the repo includes 5 datasets having the actual content of the tasks), human in classification and ranking tasks, truth discovery based on crowdsourced data, estimation of the crowd bias, and active learning. If you use any of the datasets in this repository, please make sure that you've read and followed the usage consent we explain at the bottom of this page.
We categorized the datasets in two folders: binary-classification
and multi-class-classification
. Within each folder, each dataset is kept in a separate folder having a link to the original source. Table below shows an overview of the datasets. The columns of the table are as follows:
Dataset
: Name of the dataset including a link to the original source.Description
: A brief description of the dataset.Number of tasks
: The number of tasks asked to the crowd.Number of workers
: Number of crowd workers completing the tasks.Number of total votes
: Number of votes collected for all tasks.Ground Truth
: Are the ground truths of corresponding tasks available in the dataset? Yes? No? Partially available?Task Type
: Type of the task asked to the crowd. It can be either binary or multi-class question. If it is a multi-class question, we specify whether it is categorical (how many categories?), interval (range?), or ordinal (how many classes?).Task Content
: Content of the task asked to the crowd (text, image, etc.), and does the content available in the dataset? (Available? Unavailable? Partially available?)I don't know option
: Do the crowd workers have an "I don't know" option while completing the tasks?Time spent on the task
: Does the dataset includes any information about the time spent on the tasks?
Dataset | Description | Number of tasks | Number of workers | Number of total votes | Ground Truth | Task Type | Task Content | I don't know option | Time spent on the task |
---|---|---|---|---|---|---|---|---|---|
Blue Birds | The task is to identify whether the image contains a blue bird or not. The dataset contains both the individual votes and the ground truths. | 108 | 39 | 4212 | Yes | binary | image, unavailable | No | No |
Crowdsourced Amazon Sentiment | The task is to make sentiment analysis on Amazon product reviews. There are two predicates: "is_book", "is_negative". | 1011 | 284 | 7803 | Yes | binary | text, available | No | Unavailable |
Crowdsourced loneliness-slr | Each paper is assessed by three questions: (i) Does it related to the use of technology? (ii) Does it related to older adults, and (iii) Does it related to the intervention? | 319 | 34 | 797 | Yes | binary | text, unavailable | Yes | Unavailable |
HITspam-UsingCrowdflower | The dataset contains individual worker judgments and the related ground truths about whether a HIT (from Crowdflower data) should be considered as a "spam" task. | 5380 | 153 | 42762 | Partially | binary | text, unavailable | No | Unavailable |
HITspam-UsingMTurk | The dataset contains individual worker judgments and the related ground truths about whether a HIT (from MTurk data) should be considered as a "spam" task. | 5840 | 135 | 28354 | Partially | binary | text, unavailable | No | Unavailable |
Recognizing Textual Entailment | Recognizing Textual Entailment dataset contains the individual worker judgments and the related ground truths about identifying whether a given Hypothesis sentence is implied by the information in the given text. | 800 | 164 | 8000 | Yes | binary | text, available | No | Unavailable |
Sentiment popularity - AMT | This dataset contains positive or negative judgments of workers for 500 sentences extracted from movie reviews, with gold labels assigned by the website. | 500 | 143 | 10000 | Yes | binary | text, unavailable | No | Yes |
Temporal Ordering | Temporal Ordering dataset contains the individual worker votes and the corresponding ground truths for the task of identifying whether one event happens before another event in a given context. | 462 | 76 | 4620 | Yes | binary | text, partially available | No | Unavailable |
Text Highlighting | This dataset contains two kinds of tasks: (i) classification tasks with highlighting support, and (ii) highlighting tasks, where the workers highlight evidence. | 685 | 1851 | 27711 | Yes | binary | text, available | Maybe option | Available |
Toloka Aggregation Relevance 2 | This dataset contains approximately 0.5 million anonymized individual votes that collected in the "Relevance 2 Gradations" project in 2016. | 99319 | 7139 | 475536 | Partially | binary | text, unavailable | No | Unavailable |
2010 Crowdsourced Web Relevance Judgments Data | The dataset contains the judgments about the relevance of English Web pages from the ClueWeb09 collection (http://lemurproject.org/clueweb09/). The judgments are based on 3 scales: highly relevant, relevant, and non-relevant. A fourth judgment option indicated a broken link which could not be judged. | 20232 | 766 | 98453 | Yes | multi, 3 classes | text, unavailable | No | Unavailable |
AdultContent2 | This dataset contains approximately 100K individual worker judgments and the related ground truths for classification of websites into 5 categories. | 11040 | 269 | 92721 | Partially | multi, 5 categories | text, unavailable | No | Unavailable |
AdultContent3 | This dataset contains approximately 50K individual worker judgments and the related ground truths for classification of websites into 4 categories. | 500 | 100 | 50000 | No | multi, 4 categories | text, unavailable | No | Unavailable |
Emotion | This dataset contains individual worker votes that rate the emotion of a given text, based on the followings: anger, disgust, fear, joy, sadness, surprise, valence. Furthermore, each rating contains a value from -100 to 100 for each emotion about the text. | 700 | 10 | 7000 | Yes | multi, interval (-100,100) | text, available | No | Unavailable |
Toloka Aggregation Relevance 5 | This dataset contains the judgments on the relevance of a document for a query on a 5-graded scale. | 363814 | 1274 | 1091918 | Partially | multi, 5 classes | text, unavailable | No | Unavailable |
Weather Sentiment - AMT | This dataset contains the sentiment judgments of 300 tweets. The classification task is based on the following categories: negative (0), neutral (1), positive (2), tweet not related to weather (3) and can't tell (4). | 300 | 110 | 6000 | Yes | multi, 5 classes | text, unavailable | Yes | Yes |
Word Pair Similarity | This dataset contains the individual worker votes that assign a numerical similarity score between 0 and 10 to a given text. | 30 | 10 | 300 | Yes | multi, interval (0,10) | text, unavailable | No | Unavailable |
We provide two python scripts that will help you to download all the datasets, and then transform them to a standard format. In order to do that, you should first run the download_datasets.py
, and then transform_datasets.py
. The required python version is 3.7, and the following modules should be installed on your system: os, pandas, wget, zipfile, tarfile, re, platform, and shutil
.
Running the two scripts in given order will create one csv file within each dataset folder. These csv files will be in a standard format that includes the following columns, respectively:
workerID
: ID of the crowd worker.taskID
: ID of the task answered by the corresponding worker.response
: Response of the corresponding worker on the task identified bytaskID
.goldLabel
: Gold label of the corresponding task (if available).taskContent
: Content of the task answered by the worker (if available).
Only Sentiment popularity - AMT
and Weather Sentiment - AMT
datasets will have an additional column:
timeSpent
: How much time the corresponding worker spent on this task?
P.S. If the original dataset includes multi-predicates for a task, then we create one csv file for each predicate in the transformed version of the dataset.
(You should not modify any of the directory names and/or dataset files you downloaded from this repo to obtain the resulting csv files accurately)
By using this tool you agree to acknowledge the original datasets and to check their terms and conditions. Some data providers may require authentication, filling forms, etc. We include a link to the original source both in the table above and in the individual repository folders for usefulness.