parliament-text

Download and analyse text from UK Parliamentary committee evidence transcripts

Project goals

Get new data for practicing NLP techniques
Access a new high-quality text data source
- spoken English
- diverse speakers and topics: professional, political and personal
- professional transcripts
Capture key metadata
- committee inquiries
- names of public figures

Project status: early days...

Downloading: Working fairly well: 2290 transcripts downloaded; handles many of the different web formats for different inquiries. Need to check how many transcripts are missing, though.
Parsing: Basic setup working ok in most cases. Only parsing HTML files, not PDF. (PDF files mostly duplicate the same text as HTML files.) 9200 witness/session cases identified. Identifying names and witnesses' designations and affiliations is the greatest challenge here. The current version does a reasonable job but there is still significant inaccurcy which needs to be addressed. The JSON output file makes it easy to access the text programmatically, there's room for improvement in the format though.
Analysing: Not much completed here, just a quick demo using readability analysis - see example chart below.
Priorities The top priority is to improve the named entity recognition. Help would be very welcome!

Background

For an introduction to the work of the committees see here.

The downloads contain Parliamentary information licensed under the Open Parliament Licence v3.0.

A few newsworthy witnesses and evidence sessions from recent years:

Acknowledgements

Thanks to the Parliament website team for making the transcripts documents available online.

To the many great NLP libraries that made this project possible, including selenium, spacy, fuzzywuzzy, html2text and textstat.

Usage

Typical usage. This will download all documents, parse them into JSON storage documents, then run a simple anlaysis of the transcripts' text.

Run from the project folder:-

python parliament-text --storage=/tmp/my_storage_folder --download --parse --analyse

Analysis Example

Here's a chart showing the distribution of Gunning Fog readability values across all the speakers captured. The speakers are grouped according to certain honorific titles. Most speakers have no such title (shown as 'other'). Apparently, military witnesses apparently have the most difficult to understand speech!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
__main__.py		__main__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parliament-text

Project goals

Project status: early days...

Background

Acknowledgements

Usage

Analysis Example

About

Releases

Packages

Languages

License

alions7000/parliament-text

Folders and files

Latest commit

History

Repository files navigation

parliament-text

Project goals

Project status: early days...

Background

Acknowledgements

Usage

Analysis Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages