This CLI program is designed to download PDF financial disclosure reports for the members of the House of Representatives. This tool downloads PDF reports, can upload the PDFs to S3 storage, and can convert the PDFs to PNG or JPG image files. The purpose of the image conversion is to be used with an OCR program to extract text and tables.
- Initialize Environment: Set up necessary directories and configuration files.
- Configure S3 Settings: Update Amazon S3 configuration for storing the PDFs.
- Download Disclosure URLs: Retrieve the latest list of disclosure URLs via XML files.
- Download PDFs: Download the PDF transaction reports.
- Update Bucket Items: Maintain an updated list of items in the S3 bucket.
- Upload PDFs to S3: Upload new PDFs to Amazon S3, ensuring the latest reports are stored.
- Convert PDFs to Images: Convert downloaded PDFs to PNG or JPG formats.
- Cleanup Images: Remove empty or failed image directories.
To use Disclosure Download CLI, you must have Go installed on your machine. You can download and install Go from here.
Once Go is installed, you can install the Disclosure Download CLI by running:
go get github.com/paulschick/disclosureupdater
To initialize the environment for the first time, use:
disclosurecli initialize
To configure Amazon S3 settings, you can either use the CLI or place the credentials in the ~/.aws directory. To use the CLI for configuration, run:
disclosurecli configure --s3-bucket [bucket_name] --s3-region [region] --s3-hostname [hostname] --s3-api-key [api_key] --s3-secret-key [secret_key]
If you prefer to use the ~/.aws
credentials file, ensure that you have the AWS SDK configured to load these
credentials by setting the environment variable:
export AWS_SDK_LOAD_CONFIG=true
Then, create a credentials file at ~/.aws/credentials
with the following format:
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
Replace YOUR_ACCESS_KEY
and YOUR_SECRET_KEY
with your actual AWS credentials.
To download the latest disclosure URLs, run:
disclosurecli update-urls
To download the transaction report PDFs, use:
disclosurecli download-pdfs
To upload the PDFs to S3, use:
disclosurecli upload-s3
If you download more PDF files after you've uploaded to S3 initially, you'll want to update the S3 index locally to ensure you're only uploading files that aren't currently in S3:
disclosurecli update-bucket-items
To convert the PDFs to images, use:
disclosurecli convert-pdfs
# For JPG conversion
disclosurecli convert-pdfs --jpg
To remove empty directories and failed image conversions, use:
disclosurecli cleanup-images
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This README provides a basic overview of the project. Ensure that all configurations, especially those related to AWS S3 authentication and environment settings, are accurately followed for the tool to function correctly. More detailed documentation may be necessary, especially for contributing guidelines and license details, depending on the project's complexity and additional requirements.