This folder contains data that we use in the course and/or that you can use to play around and test some of the skills that you have learnt. It also contains some of the scripts that were used to get the data.
Overview
-
The
authorship
folder contains the C50 corpus that can be used to train and test automatic authorship detection systems. It can be downloaded here. -
The
baby_names
folder contains baby names from Social Security applications in the USA. (names downloaded from here, names_by_state downloaded from here). -
Charlie
contains 1 simple text file containing a text snippet from Roald Dahl's 'Charlie and the Chocolate Factory'. -
The
concreteness
folder contains concreteness ratings downloaded from here. -
The
Dodds2014
folder contains sentiment scores for 100,000 words across 10 languages. It was downloaded from here. -
The
dreams
folder contains 10 text files describing dreams of Vickie, a 10-year-old girl. These texts are downloaded from DreamBank. -
linguistlist
is a collection of messages from the Linguist List. They were downloaded from here using get_linguist_data.py. All data is gzipped, except for this example. -
MSCOCO
contains image annotations, provided by Microsoft Research. These were downloadeded from here. -
presidential_debate_2016
contains a CSV file with transcripts of the 2016 (vice-)presidential debate from 26 September to 9 October. They were downloadeded from here. -
RedCircle
contains a text file with the ebook "The Adventure of the Red Circle" by Arthur Conan Doyle downloaded from here. -
Trump-Facebook
This TSV file contains Facebook statuses posted by Donald Trump. The dataset was downloaded from here. It was created by Max Woolf, using this script.