This project converts PDF listings of real estate properties from Centris into Excel files with standardized address information. It offers three main options for processing, with varying levels of address verification, as well as a user-friendly GUI application.
- Extracts data from multiple PDF files using pdfplumber
- Standardizes city names for Montreal and Laval areas
- Offers three processing options:
- PostGrid API (Recommended for highest accuracy with Canadian addresses):
- Validates and geocodes addresses using PostGrid API (SERP-certified by Canada Post)
- Implements a scoring system to evaluate address suggestion relevance
- Google Maps API (Good for international addresses):
- Uses Google Maps API for address validation and geocoding
- Retrieves full city names and postal codes from geocoding results
- Simple Processing (Fastest, relies on Centris data accuracy):
- Extracts and cleans address data without external API calls
- Assumes Centris data, including postal codes, is accurate
- PostGrid API (Recommended for highest accuracy with Canadian addresses):
- Outputs data in a standardized format (FNAM, LNAM, ADD1, CITY, PROV, PC)
- Uses multithreading for improved performance
- Implements caching to reduce API calls (for API options)
- Provides detailed logging for API interactions and address processing (for API options)
- GUI application for easy use of the simple processing option
PostGrid is recognized by Canada Post's Software Evaluation and Recognition Program (SERP), ensuring high-quality address validation for Canadian addresses. SERP-certified software must meet strict accuracy requirements, including:
- 98% accuracy in categorizing valid/invalid addresses
- 99% rejection rate for non-correctable addresses
- 99% correction rate for fixable addresses
By using PostGrid, we ensure our address data meets Canada Post's stringent standards, which is crucial for accurate real estate listings.
Learn more about SERP-recognized software
While Google Maps API is available as an alternative and provides good accuracy for international addresses, it is not SERP-certified and may not provide the same level of accuracy for Canadian addresses.
To ensure the highest quality of address validation, we've implemented a scoring system that evaluates the relevance of address suggestions returned by the PostGrid API. This system considers factors such as:
- Exact matches for street numbers
- Similarity of street names
- City name matches
- Apartment number accuracy
The scoring system helps in selecting the most appropriate suggestion when multiple options are available, improving the overall accuracy of the address validation process.
- Python 3.x
- pdfplumber
- pandas
- requests
- requests_cache
- python-dotenv
- tkinter (usually comes with Python)
- openpyxl
- PyQt5 (for GUI application)
-
Clone the repository
-
Install required packages:
pip install pdfplumber pandas requests requests_cache python-dotenv openpyxl PyQt5
-
Create a
.env
file in the project root and add your API key(s):POSTGRID_API_KEY=your_postgrid_api_key_here GOOGLE_MAPS_API_KEY=your_google_maps_api_key_here
- Sign up for a PostGrid account at https://www.postgrid.com/
- Navigate to the API section in your dashboard
- Generate a new API key
- Copy the generated API key and add it to your
.env
file
Note: Ensure you set up billing and review the pricing for the PostGrid API usage.
- Go to the Google Cloud Console
- Create a new project or select an existing one
- Enable the Geocoding API
- Create credentials (API Key)
- Copy the generated API key and add it to your
.env
file
We now offer a graphical user interface (GUI) for easier use of the PDF to Excel converter.
- File Import: Select multiple PDF files or use drag-and-drop functionality
- File Management: Add, remove, or mass delete files from the list
- Export Options: Choose destination folder for output Excel file(s)
- Conversion Settings: Merge PDFs or keep them separate, custom naming for output files
- Progress Tracking: Visual progress bar during conversion
- Error Handling: User-friendly error messages
- Internationalization: Support for French (default) and English languages
- About Section: Project information and link to GitHub repository
- Enhanced User Experience: Keyboard shortcuts and rubber band selection
- Launch the application by running:
python pdf2excel_gui.py
- Use the "Add Files" button or drag and drop PDF files into the list.
- Select your desired output folder.
- (Optional) Choose a custom filename for the output.
- Click "Convert" to process the files.
- The application will show progress and notify you when conversion is complete.
Note: The GUI application currently uses the simple processing method without API calls.
For more advanced options, including API-based processing, use the following command-line scripts:
- For PostGrid API (recommended for highest accuracy with Canadian addresses):
python pdf2excel_postgrid.py
- For Google Maps API (good for international addresses):
python pdf2excel_googlemaps.py
- For simple processing (fastest, relies on Centris data):
python pdf2excel.py
Follow the prompts to select PDF file(s) and specify the output location.
If you want to build the executable yourself:
-
Ensure you have PyInstaller installed:
pip install pyinstaller
-
Use the provided spec file to build the executable:
pyinstaller pdf2excel_gui.spec
2.1 In case the above doesn't work, try the following:
python -m PyInstaller pdf2excel_gui.spec
This will create an executable named PDF2Excel_GUI_v1.0.0.exe
in the dist
folder.
For convenience, we provide a pre-built executable for Windows users:
- Go to the Releases page of this repository.
- Download the latest
PDF2Excel_GUI_v1.0.0.exe
file. - Run the executable on your Windows machine.
Note: The executable version uses the simple processing method without API calls. For API-based processing, please use the command-line scripts as described in the Usage section above.
- The simple processing option (
pdf2excel.py
and GUI) is fastest and doesn't require an API key, relying on the accuracy of Centris data - The PostGrid option (
pdf2excel_postgrid.py
) provides additional verification for Canadian addresses but requires an API key and may incur costs - The Google Maps option (
pdf2excel_googlemaps.py
) provides good accuracy for both Canadian and international addresses but requires an API key and may incur costs - Both API options implement caching to reduce API calls and improve performance
- City name standardization is currently set up for Montreal and Laval areas. Modify the
city_mappings
dictionary incity_mappings.py
to add more mappings if needed - The API options provide more robust address parsing, including separation of apartment numbers
- The Google Maps option retrieves full city names and postal codes from the geocoding results
- All options output the data in the same standardized format (FNAM, LNAM, ADD1, CITY, PROV, PC)
- The scripts provide detailed logging in the
logs
directory for troubleshooting and monitoring API interactions (for API options) - Manual verification may still be necessary for complex cases or addresses not found by the APIs
- For ultimate verification, users can cross-reference with Canada Post's database
Refer to the LICENSE file for more details.
Made with ChatGPT + Canvas & Cursor, including this README file.