Skip to content

ganguvamshi/commit_etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

commit_etl

Assessment

For a particular repository, example Airflow (https://github.com/apache/airflow)

  • pull in the commits over the last 6 months. (plus points if this window of time can be varied)

With the data ingested, address the follow queries:

  • For the ingested commits, determine the top 5 committers ranked by count of commits and their number of commits.
  • For the ingested commits, determine the committer with the longest commit streak.
  • For the ingested commits, generate a heatmap of number of commits count by all users by day of the week and by 3 hour blocks.

Solution

Here I am using Python for ETL process. This will authenticate and extract the commits from Github REST API, Transform or filter the required fields and load them to the local postgres database which is running inside a docker containter.

GitHub REST API investigation

Though there was hint to use the committer object instead of author. upon investigating the result data, the committer object seems not relevant as it has the same value across different authors. Below we can see the committer login is web-flow for all commits which is misleading. From gitlab docs, i learnt that this web-flow is the Git committer for all web commits (merge/revert/edit/etc...) made on GitHub.com. Hence, I used author object instead of the committer.

>>> rec_df.shape
(1569, 58)
>>> rec_df.head()
                                        sha                                            node_id                                                url  ... committer.type committer.site_admin author
0  9d5327806fac61cd62abd30a6339b0cb26ad1ebf  C_kwDOAgUK29oAKDlkNTMyNzgwNmZhYzYxY2Q2MmFiZDMw...  https://api.github.com/repos/apache/airflow/co...  ...           User                False    NaN
1  7faa72795185c1af4b21f207ce7e0735c4365d46  C_kwDOAgUK29oAKDdmYWE3Mjc5NTE4NWMxYWY0YjIxZjIw...  https://api.github.com/repos/apache/airflow/co...  ...           User                False    NaN
2  76c2ade2c63abc3677b8fcd59af6f8779b613be7  C_kwDOAgUK29oAKDc2YzJhZGUyYzYzYWJjMzY3N2I4ZmNk...  https://api.github.com/repos/apache/airflow/co...  ...           User                False    NaN
3  015fef31bb7f848b321d5287edeb6ff6ac63ab10  C_kwDOAgUK29oAKDAxNWZlZjMxYmI3Zjg0OGIzMjFkNTI4...  https://api.github.com/repos/apache/airflow/co...  ...           User                False    NaN
4  379bd80573e2c02810367ee28288e59a7af2ac10  C_kwDOAgUK29oAKDM3OWJkODA1NzNlMmMwMjgxMDM2N2Vl...  https://api.github.com/repos/apache/airflow/co...  ...           User                False    NaN

[5 rows x 58 columns]
>>> rec_df[['committer.login', 'committer.id']].drop_duplicates()
  committer.login  committer.id
0        web-flow      19864447

>>> rec_df[['commit.committer.name', 'commit.committer.email']].drop_duplicates()
  commit.committer.name commit.committer.email
0                GitHub     [email protected]

Prerequisites

  • Mamba A faster conda like environment manager used to create the python environment.
  • Docker for building the postgres database.

Installation

Create the conda environment

you can create the conda environment using env.yml file

mamba env create -f env.yml -n gitapienv
mamba activate gitapienv

Build Postgres docker image

Postgres docker image is built using commitsdb.Dockerfile which initialise the table and view creation that are stored in setup.sql. The sql creates two views within the database called commits_view and commits_extended. This commits_extended view contains the Day, Hour and hour_block columns which are calculated automatically when the data is populated into the main commits table and the view is invoked.

make sure to set/export the environmental variables declared in .env file

cd DB
export ../.env
docker build -t githubdb -f commitsdb.Dockerfile . 

Create Github personal token

The GitHub REST API uses rate limiting to control API traffic. This is not mandatory but the requests with authenticated connection have higher rate limits compared to the general requests. Check this token-github link for more information. Once the token is created, save it as a environmental variable name GITHUB_TOKEN

export GITHUB_TOKEN='adceafaxxxxxx......'
# can save it to ~/.bashrc for more consistent usage

Scripts

ETL:

  • Github.py used to create a GithubCommitExtractor class for connecting to Github via REST API and pull the commits
  • etl.py The main script for Extracting the Commits, Transforming the records and Loading them to the PostgresDB

Database:

Environment:

  • .env dot file containing the environmental variables used for DB connection.

Queries:

  • sql.py used for creating database class to connect and read from the postgres.
  • analysis.py contains the solutions for the three anlysis questions

Implementation

1. Start the Postgres DB

# run the built docker image in detached mode
docker run -d --name gitpostgres --env-file ../.env -p 5432:5432 githubdb 

Check whether the container is running. If it is running, we can ssh into the running container to see the tables and records

docker ps 

should list the running container as below

CONTAINER ID   IMAGE      COMMAND                  CREATED         STATUS         PORTS                    NAMES
ad814f168826   githubdb   "docker-entrypoint.s…"   4 seconds ago   Up 4 seconds   0.0.0.0:5432->5432/tcp   gitpostgres

the tables and records can be listed as well. we should not see any records as we just created the table alone.

docker exec -it gitpostgres bash 

#once login inside, connect to psql 
psql -d github -U vamshi

#show the table records
select * from commits;

desired output:

root@ad814f168826:/# psql -d github -U vamshi
psql (15.4 (Debian 15.4-1.pgdg120+1))
Type "help" for help.

github=# \dt
         List of relations
 Schema |  Name   | Type  | Owner
--------+---------+-------+--------
 public | commits | table | vamshi
(1 row)

github=# select * from commits;
 commit_id | committer_name | committer_email | committer_date | created_at 
-----------+----------------+-----------------+----------------+------------
(0 rows)

github=#

2. Extract the Commits from Github API using python

mamba activate gitapienv
python ETL/etl.py -u 'https://github.com/apache/airflow' -d 180 2>&1 |tee ETL/log_etl.txt

-d can be changed according to the desired time window. This will extract the commits from Github repo, select the required feilds and load them into the githubdb database.

output log file: log_etl.txt

3. Analysis

I am using Python for the analysis. This will read data stored in the postgres database via sqlalchemy and psycopg2. I used pandas to analyse the data and return the required answers.

main script: analysis.py

cd queries
python analysis.py 2>&1 |tee log_analysis.txt

output:

postgres_sql-INFO-2023-08-28 17:14:34,552-Conncection successful to github.commits
Top 5 Commiters 
     committer_name  num_counts
0     Jarek Potiuk         405
1    Hussein Awala         142
2          eladkal          93
3   Tzu-ping Chung          69
4  Miroslav Šedivý          69

Commiter with longest streak:
     committer_name   streak_duration
126   Jarek Potiuk 178 days 07:06:48

count heatmap by Day and Hour Block:
 hour_block  00-03  03-06  06-09  09-12  12-15  15-18  18-21  21-00
day
MON             8     28     61     34     35     54     68     17
TUE             9     26     32     47     39     46     75     34
WED            21     33     60     51     40     55     55     33
THU            20     26     54     43     58     66     63     43
FRI             9     33     51     45     39     96     65     26
SAT             3     14     51     29     30     40     46     21
SUN             4     21     13     23     26     37     40     28

heatmap plot is saved to heatmap_plot.png

output file: log_analysis.txt This will connect to the database, extract the records and prints the output to the console for

  • top 5 committers
  • commiter with the longest commit streak
  • heatmap count grouped by day and hour_block

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published