Pandas Excel Cleaning Example

Example module to parse over-complicated and poorly formatted progress tracking spreadsheet into a clean CSV. Meant to be used as a showcase of how Python can be very powerful when used in conjunction with Excel.

Premise

Quantity progress tracking is captured in a large and convoluted workbook with several different formats and different crews on different pages. This workbook is not very intuitive for anyone but its creator to look at and is very difficult to use for data analysis. Additionally, the Excel file is quite heavy and consumes resources while open. We will convert this workbook into a homogenous, normalized, lightweight CSV file.

For the most part, the original data consists of tables plotting construction status against KP ranges on the Y-axis and crew types on the X-axis. We will follow that format but clean it up by using integer cost codes as the column names and keep the KP range as the index.

File Management

As this progress report is distributed weekly, we will always want to use the latest version of the workbook. We can create a directory to drop the files in every week called /raw.

We will write a function called latest_file to find the most recent distribution in the folder based on common patterns in the filename. As you can see by looking in the /raw folder, the file naming scheme is not overly strict. It's alright though, because each filename has an internal pattern of "...Report_YYYY MM DD_Macro..." which can be used to extract the date.

Progress File

The progress quantity tracking spreadsheet has several worksheets of interest showing status of KP chunks. They are in the format of KP chunk vs. crew type but have way too much extraneous information.

The worksheets of interest are as below and we will make a separate function for each.

2002-LineSweep
2003 - Clearing
2100-Pipeline DETAILS RoC

Code

Have a look through the code in this order:

latest_file.py
main.py
import_progress/__init__.py
import_progress/import_sweep.py
import_progress/import_clearing.py
import_progress/import_main.py
sterilize_KP.py

Output

The cleaned and combined CSV file is saved to parsed.csv.

How to Run

Ensure Pandas is installed
Open and run main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pandas Excel Cleaning Example

Premise

File Management

Progress File

Code

Output

How to Run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
import_progress		import_progress
raw		raw
.gitignore		.gitignore
README.md		README.md
latest_file.py		latest_file.py
main.py		main.py
parsed.csv		parsed.csv
sterilize_KP.py		sterilize_KP.py
test.csv		test.csv

riley-paul/pandas-example

Folders and files

Latest commit

History

Repository files navigation

Pandas Excel Cleaning Example

Premise

File Management

Progress File

Code

Output

How to Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages