Skip to content

noahgift/python-data-engineering-cookbook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python application test with Github Actions

Python Data Engineering Cookbook

Some recipes for data engineering with Python

data-engineering-tool

Data Engineering CLI csvutil.py

This is a "teaching" tool that shows how a library like Pandas, or potentially Spark can be combined to do operations on a data set. Different columns can be selected for "grouping" and different columns for "applying" and the "apply" itself can be any function you write.

  • Create a source a Python virtualenv python3 -m venv ~/.pyde && source ~/.pyde/bin/activate

How to interact with Commandline tool (Click Framework):

Check Version:

 ./csvutil.py --version
csvutil.py, version 0.1

Check Help:

./csvutil.py --help   
Usage: csvutil.py [OPTIONS] COMMAND [ARGS]...

  CSV Operations Tool



Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Aggregate CSV

./csvcli.py cvsagg --file ext/input.csv --column last_name
Processing csvfile: ext/input.csv and column name: last_name
{"count":{"mcgregor":34,"lee":3,"norris":27}}

Note, a different files leads to different conclusions. Here is an NBA dataset where the AGE of players is grouped and then the sum of all three-pointers per game are summed by age.

 ./csvcli.py cvsops --file ext/nba-2017.csv --groupby AGE --applyname 3P --func npsum   
Processing csvfile: ext/nba-2017.csv and groupby name: AGE and applyname: 3P
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: npmedian
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: npsum
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: numpy
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: tanimoto
AGE
19     4.2
20     9.6
21    13.7
22    11.1
23    18.4
24    17.3
25    18.5
26    28.0
27    13.1
28    26.7
29    16.8
30    11.1
31    14.1
32     8.3
33     3.7
34     1.7
35     3.7
36     2.8
38     1.5
39     1.9
40     1.5
Name: 3P, dtype: float64

Seperately, the AGE of the players can be used to generate a median wikipedia pageview by AGE.

 ./csvcli.py cvsops --file ext/nba-2017.csv --groupby AGE --applyname PAGEVIEWS --func npmedian 
Processing csvfile: ext/nba-2017.csv and groupby name: AGE and applyname: PAGEVIEWS
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: npmedian
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: npsum
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: numpy
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: tanimoto
AGE
19     453.00
20     456.50
21     334.50
22     187.50
23     271.25
24     368.50
25     182.75
26     547.50
27     189.00
28     368.25
29     169.50
30     131.25
31     427.00
32     315.00
33     267.50
34     489.50
35     685.50
36     416.00
38    2960.00
39     862.00
40    2891.50
Name: PAGEVIEWS, dtype: float64

About

Some recipes for data engineering with Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published