CrowdData

CrowdData is an open repository that aggregates the crowdsourced datasets that have individual crowd votes. We aim at providing the available datasets with a standard format (explained in Download section below) so that they can be directly used in experiments, without any work-load in preprocessing. Datasets included in this repo serve for classification tasks (mainly text classification, except Emotion Dataset). CrowData can benefit researchers investigating hybrid usage of machine and human-in-the-loop in classification tasks (the repo includes 5 datasets having the actual content of the tasks), human in classification and ranking tasks, truth discovery based on crowdsourced data, estimation of the crowd bias, and active learning. If you use any of the datasets in this repository, please make sure that you've read and followed the usage consent we explain at the bottom of this page.

Datasets

We categorized the datasets in two folders: binary-classification and multi-class-classification. Within each folder, each dataset is kept in a separate folder having a link to the original source. Table below shows an overview of the datasets. The columns of the table are as follows:

Dataset: Name of the dataset including a link to the original source.
Description: A brief description of the dataset.
Number of tasks: The number of tasks asked to the crowd.
Number of workers: Number of crowd workers completing the tasks.
Number of total votes: Number of votes collected for all tasks.
Ground Truth: Are the ground truths of corresponding tasks available in the dataset? Yes? No? Partially available?
Task Type: Type of the task asked to the crowd. It can be either binary or multi-class question. If it is a multi-class question, we specify whether it is categorical (how many categories?), interval (range?), or ordinal (how many classes?).
Task Content: Content of the task asked to the crowd (text, image, etc.), and does the content available in the dataset? (Available? Unavailable? Partially available?)
I don't know option: Do the crowd workers have an "I don't know" option while completing the tasks?
Time spent on the task: Does the dataset includes any information about the time spent on the tasks?

_Dataset	_Description	_{Number of tasks}	_{Number of workers}	_{Number of total votes}	_{Ground Truth}	_{Task Type}	_{Task Content}	_{I don't know option}	_{Time spent on the task}
_{Blue Birds}	_{The task is to identify whether the image contains a blue bird or not. The dataset contains both the individual votes and the ground truths.}	₁₀₈	₃₉	₄₂₁₂	_Yes	_binary	_{image, unavailable}	_No	_No
_{Crowdsourced Amazon Sentiment}	_{The task is to make sentiment analysis on Amazon product reviews. There are two predicates: "is_book", "is_negative".}	₁₀₁₁	₂₈₄	₇₈₀₃	_Yes	_binary	_{text, available}	_No	_Unavailable
_{Crowdsourced loneliness-slr}	_{Each paper is assessed by three questions: (i) Does it related to the use of technology? (ii) Does it related to older adults, and (iii) Does it related to the intervention?}	₃₁₉	₃₄	₇₉₇	_Yes	_binary	_{text, unavailable}	_Yes	_Unavailable
_{HITspam-UsingCrowdflower}	_{The dataset contains individual worker judgments and the related ground truths about whether a HIT (from Crowdflower data) should be considered as a "spam" task.}	₅₃₈₀	₁₅₃	₄₂₇₆₂	_Partially	_binary	_{text, unavailable}	_No	_Unavailable
_{HITspam-UsingMTurk}	_{The dataset contains individual worker judgments and the related ground truths about whether a HIT (from MTurk data) should be considered as a "spam" task.}	₅₈₄₀	₁₃₅	₂₈₃₅₄	_Partially	_binary	_{text, unavailable}	_No	_Unavailable
_{Recognizing Textual Entailment}	_{Recognizing Textual Entailment dataset contains the individual worker judgments and the related ground truths about identifying whether a given Hypothesis sentence is implied by the information in the given text.}	₈₀₀	₁₆₄	₈₀₀₀	_Yes	_binary	_{text, available}	_No	_Unavailable
_{Sentiment popularity - AMT}	_{This dataset contains positive or negative judgments of workers for 500 sentences extracted from movie reviews, with gold labels assigned by the website.}	₅₀₀	₁₄₃	₁₀₀₀₀	_Yes	_binary	_{text, unavailable}	_No	_Yes
_{Temporal Ordering}	_{Temporal Ordering dataset contains the individual worker votes and the corresponding ground truths for the task of identifying whether one event happens before another event in a given context.}	₄₆₂	₇₆	₄₆₂₀	_Yes	_binary	_{text, partially available}	_No	_Unavailable
_{Text Highlighting}	_{This dataset contains two kinds of tasks: (i) classification tasks with highlighting support, and (ii) highlighting tasks, where the workers highlight evidence.}	₆₈₅	₁₈₅₁	₂₇₇₁₁	_Yes	_binary	_{text, available}	_{Maybe option}	_Available
_{Toloka Aggregation Relevance 2}	_{This dataset contains approximately 0.5 million anonymized individual votes that collected in the "Relevance 2 Gradations" project in 2016.}	₉₉₃₁₉	₇₁₃₉	₄₇₅₅₃₆	_Partially	_binary	_{text, unavailable}	_No	_Unavailable
_{2010 Crowdsourced Web Relevance Judgments Data}	_{The dataset contains the judgments about the relevance of English Web pages from the ClueWeb09 collection (http://lemurproject.org/clueweb09/). The judgments are based on 3 scales: highly relevant, relevant, and non-relevant. A fourth judgment option indicated a broken link which could not be judged.}	₂₀₂₃₂	₇₆₆	₉₈₄₅₃	_Yes	_{multi, 3 classes}	_{text, unavailable}	_No	_Unavailable
_{AdultContent2}	_{This dataset contains approximately 100K individual worker judgments and the related ground truths for classification of websites into 5 categories.}	₁₁₀₄₀	₂₆₉	₉₂₇₂₁	_Partially	_{multi, 5 categories}	_{text, unavailable}	_No	_Unavailable
_{AdultContent3}	_{This dataset contains approximately 50K individual worker judgments and the related ground truths for classification of websites into 4 categories.}	₅₀₀	₁₀₀	₅₀₀₀₀	_No	_{multi, 4 categories}	_{text, unavailable}	_No	_Unavailable
_Emotion	_{This dataset contains individual worker votes that rate the emotion of a given text, based on the followings: anger, disgust, fear, joy, sadness, surprise, valence. Furthermore, each rating contains a value from -100 to 100 for each emotion about the text.}	₇₀₀	₁₀	₇₀₀₀	_Yes	_{multi, interval (-100,100)}	_{text, available}	_No	_Unavailable
_{Toloka Aggregation Relevance 5}	_{This dataset contains the judgments on the relevance of a document for a query on a 5-graded scale.}	₃₆₃₈₁₄	₁₂₇₄	_1091918	_Partially	_{multi, 5 classes}	_{text, unavailable}	_No	_Unavailable
_{Weather Sentiment - AMT}	_{This dataset contains the sentiment judgments of 300 tweets. The classification task is based on the following categories: negative (0), neutral (1), positive (2), tweet not related to weather (3) and can't tell (4).}	₃₀₀	₁₁₀	₆₀₀₀	_Yes	_{multi, 5 classes}	_{text, unavailable}	_Yes	_Yes
_{Word Pair Similarity}	_{This dataset contains the individual worker votes that assign a numerical similarity score between 0 and 10 to a given text.}	₃₀	₁₀	₃₀₀	_Yes	_{multi, interval (0,10)}	_{text, unavailable}	_No	_Unavailable

Download

We provide two python scripts that will help you to download all the datasets, and then transform them to a standard format. In order to do that, you should first run the download_datasets.py, and then transform_datasets.py. The required python version is 3.7, and the following modules should be installed on your system: os, pandas, wget, zipfile, tarfile, re, platform, and shutil.

Running the two scripts in given order will create one csv file within each dataset folder. These csv files will be in a standard format that includes the following columns, respectively:

workerID: ID of the crowd worker.
taskID: ID of the task answered by the corresponding worker.
response: Response of the corresponding worker on the task identified by taskID.
goldLabel: Gold label of the corresponding task (if available).
taskContent: Content of the task answered by the worker (if available).

Only Sentiment popularity - AMT and Weather Sentiment - AMT datasets will have an additional column:

timeSpent: How much time the corresponding worker spent on this task?

P.S. If the original dataset includes multi-predicates for a task, then we create one csv file for each predicate in the transformed version of the dataset.

(You should not modify any of the directory names and/or dataset files you downloaded from this repo to obtain the resulting csv files accurately)

Usage consent

By using this tool you agree to acknowledge the original datasets and to check their terms and conditions. Some data providers may require authentication, filling forms, etc. We include a link to the original source both in the table above and in the individual repository folders for usefulness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrowdData

Datasets

Download

Usage consent

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
binary-classification		binary-classification
multi-class-classification		multi-class-classification
README.md		README.md
download_datasets.py		download_datasets.py
transform_datasets.py		transform_datasets.py

TrentoCrowdAI/crowdsourced-datasets

Folders and files

Latest commit

History

Repository files navigation

CrowdData

Datasets

Download

Usage consent

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages