Skip to content

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.

License

Notifications You must be signed in to change notification settings

LuhengStat/congress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping THOMAS.gov

An experimental public domain scraper for the THOMAS legislative information system managed by the Library of Congress.

Currently:

  • Fetches and parses HTML for almost all information on bills. (Everything but related committees.)
  • Parses the semantics of Congressional action into metadata and vote outcomes.
  • Tracks the past and present state of a bill.
  • Includes a suite of unit tests for action semantics and state tracking.

Using

It's recommended you first create and activate a virtualenv with:

virtualenv virt
source virt/bin/activate

You don't have to call it "virt", but the project's gitignore is set up to ignore it already if you do.

Whether or not you use virtualenv:

pip install -r requirements.txt

To start grabbing everything:

python runner.py bills

You can supply a few kinds of flags. To limit it to 10 House simple resolutions in the 111th congress:

python runner.py bills limit=10 bill_type=hres congress=111

To get only a specific bill, pass in the ID for that bill. For example, S. 968 in the 112th congress:

python runner.py bills bill_id=s968-112

The script will cache all downloaded pages, and will not re-fetch them from the network unless a force flag is passed:

python runner.py bills force=True

Running Tests

To run this project's unit tests:

python test/runner.py

Bulk Data

The script will cache downloaded pages in a top-level cache directory, and output bulk data in a top-level data directory.

Two bulk data output files will be generated for each bill: a JSON version (data.json) and an XML version (data.xml). The XML version attempts to maintain backwards compatibility with the XML bulk data that GovTrack.us has provided for years.

TODO

Just about everything. More data:

  • Complete bill information
  • Amendment information
  • Name normalization of members of Congress
  • Name normalization for Congressional committees and subcommittees

And general improvements:

  • Hasn't been yet run over THOMAS' entire contents
  • Data quality checks
  • Use a proper command line flag parser, and syntax

About

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.8%
  • Other 1.2%