Skip to content

Analyse text from UK Parliamentary committee evidence transcripts

License

Notifications You must be signed in to change notification settings

alions7000/parliament-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

parliament-text

Download and analyse text from UK Parliamentary committee evidence transcripts

parliament-text screenshots

Project goals

  • Get new data for practicing NLP techniques
  • Access a new high-quality text data source
    • spoken English
    • diverse speakers and topics: professional, political and personal
    • professional transcripts
  • Capture key metadata
    • committee inquiries
    • names of public figures

Project status: early days...

  • Downloading: Working fairly well: 2290 transcripts downloaded; handles many of the different web formats for different inquiries. Need to check how many transcripts are missing, though.

  • Parsing: Basic setup working ok in most cases. Only parsing HTML files, not PDF. (PDF files mostly duplicate the same text as HTML files.) 9200 witness/session cases identified. Identifying names and witnesses' designations and affiliations is the greatest challenge here. The current version does a reasonable job but there is still significant inaccurcy which needs to be addressed. The JSON output file makes it easy to access the text programmatically, there's room for improvement in the format though.

  • Analysing: Not much completed here, just a quick demo using readability analysis - see example chart below.

  • Priorities The top priority is to improve the named entity recognition. Help would be very welcome!

Background

For an introduction to the work of the committees see here.

The downloads contain Parliamentary information licensed under the Open Parliament Licence v3.0.

A few newsworthy witnesses and evidence sessions from recent years:

Acknowledgements

Thanks to the Parliament website team for making the transcripts documents available online.

To the many great NLP libraries that made this project possible, including selenium, spacy, fuzzywuzzy, html2text and textstat.

Usage

Typical usage. This will download all documents, parse them into JSON storage documents, then run a simple anlaysis of the transcripts' text.

Run from the project folder:-

python parliament-text --storage=/tmp/my_storage_folder --download --parse --analyse

Analysis Example

Here's a chart showing the distribution of Gunning Fog readability values across all the speakers captured. The speakers are grouped according to certain honorific titles. Most speakers have no such title (shown as 'other'). Apparently, military witnesses apparently have the most difficult to understand speech!

selected witnesses statistics

About

Analyse text from UK Parliamentary committee evidence transcripts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages