Python Data Engineering Cookbook

Some recipes for data engineering with Python

Data Engineering CLI csvutil.py

This is a "teaching" tool that shows how a library like Pandas, or potentially Spark can be combined to do operations on a data set. Different columns can be selected for "grouping" and different columns for "applying" and the "apply" itself can be any function you write.

Create a source a Python virtualenv python3 -m venv ~/.pyde && source ~/.pyde/bin/activate

How to interact with Commandline tool (Click Framework):

Check Version:

 ./csvutil.py --version
csvutil.py, version 0.1

Check Help:

./csvutil.py --help   
Usage: csvutil.py [OPTIONS] COMMAND [ARGS]...

  CSV Operations Tool



Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Aggregate CSV

./csvcli.py cvsagg --file ext/input.csv --column last_name
Processing csvfile: ext/input.csv and column name: last_name
{"count":{"mcgregor":34,"lee":3,"norris":27}}

Note, a different files leads to different conclusions. Here is an NBA dataset where the AGE of players is grouped and then the sum of all three-pointers per game are summed by age.

 ./csvcli.py cvsops --file ext/nba-2017.csv --groupby AGE --applyname 3P --func npsum   
Processing csvfile: ext/nba-2017.csv and groupby name: AGE and applyname: 3P
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: npmedian
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: npsum
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: numpy
2021-03-22 12:51:50,628 - nlib.utils - INFO - Loading appliable functions/plugins: tanimoto
AGE
19     4.2
20     9.6
21    13.7
22    11.1
23    18.4
24    17.3
25    18.5
26    28.0
27    13.1
28    26.7
29    16.8
30    11.1
31    14.1
32     8.3
33     3.7
34     1.7
35     3.7
36     2.8
38     1.5
39     1.9
40     1.5
Name: 3P, dtype: float64

Seperately, the AGE of the players can be used to generate a median wikipedia pageview by AGE.

 ./csvcli.py cvsops --file ext/nba-2017.csv --groupby AGE --applyname PAGEVIEWS --func npmedian 
Processing csvfile: ext/nba-2017.csv and groupby name: AGE and applyname: PAGEVIEWS
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: npmedian
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: npsum
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: numpy
2021-03-22 12:50:24,365 - nlib.utils - INFO - Loading appliable functions/plugins: tanimoto
AGE
19     453.00
20     456.50
21     334.50
22     187.50
23     271.25
24     368.50
25     182.75
26     547.50
27     189.00
28     368.25
29     169.50
30     131.25
31     427.00
32     315.00
33     267.50
34     489.50
35     685.50
36     416.00
38    2960.00
39     862.00
40    2891.50
Name: PAGEVIEWS, dtype: float64

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
ext		ext
nlib		nlib
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
csvcli.py		csvcli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Data Engineering Cookbook

Data Engineering CLI csvutil.py

How to interact with Commandline tool (Click Framework):

About

Releases

Packages

Languages

License

noahgift/python-data-engineering-cookbook

Folders and files

Latest commit

History

Repository files navigation

Python Data Engineering Cookbook

Data Engineering CLI csvutil.py

How to interact with Commandline tool (Click Framework):

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages