Skip to content

Latest commit

 

History

History

Data

Data

This folder contains data that we use in the course and/or that you can use to play around and test some of the skills that you have learnt. It also contains some of the scripts that were used to get the data.

Overview

  • The authorship folder contains the C50 corpus that can be used to train and test automatic authorship detection systems. It can be downloaded here.

  • The baby_names folder contains baby names from Social Security applications in the USA. (names downloaded from here, names_by_state downloaded from here).

  • Charlie contains 1 simple text file containing a text snippet from Roald Dahl's 'Charlie and the Chocolate Factory'.

  • The concreteness folder contains concreteness ratings downloaded from here.

  • The Dodds2014 folder contains sentiment scores for 100,000 words across 10 languages. It was downloaded from here.

  • The dreams folder contains 10 text files describing dreams of Vickie, a 10-year-old girl. These texts are downloaded from DreamBank.

  • linguistlist is a collection of messages from the Linguist List. They were downloaded from here using get_linguist_data.py. All data is gzipped, except for this example.

  • MSCOCO contains image annotations, provided by Microsoft Research. These were downloadeded from here.

  • presidential_debate_2016 contains a CSV file with transcripts of the 2016 (vice-)presidential debate from 26 September to 9 October. They were downloadeded from here.

  • RedCircle contains a text file with the ebook "The Adventure of the Red Circle" by Arthur Conan Doyle downloaded from here.

  • Trump-Facebook This TSV file contains Facebook statuses posted by Donald Trump. The dataset was downloaded from here. It was created by Max Woolf, using this script.