Skip to content

Parsers for plenary records of state parliaments covering to legislative periods

License

Notifications You must be signed in to change notification settings

panoptikum/plenary_record_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

plenary record parser

Parsers for the plenary records of German state parliaments covering two legislative periods between 2008-2018.

Pipeline

  1. Layout scan with files such as pdf_layoutscanner.py to determine where first and second column of pages start
  2. PDF2Xml conversion with parser_wrapper_xml for hundreds of files using pdfminer with individual options for each state parliament
  3. The files parse_transcript_xml_*.py use the coordinates for each text block, sentence or even letter to concatenate the text order correctly and save it as plain text file. Furthermore, I use charateristics such as boldness, font size, ... to cleary mark speaker's name, interjections and change of speaker's.
  4. Plenary_record_parser_xml_*.py parses the custom text file to detect each speech, interjection etc. A speech splits into several rows, if the speaker is interrupted by an interjection or the chair. The pipeline stores the resulting sequence with some meta data in a sqlite data base

About

Parsers for plenary records of state parliaments covering to legislative periods

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages