-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 400ecec
Showing
6 changed files
with
2,702 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
|
||
# Created by https://www.toptal.com/developers/gitignore/api/jupyternotebooks,macos,pycharm+all,python | ||
# Edit at https://www.toptal.com/developers/gitignore?templates=jupyternotebooks,macos,pycharm+all,python | ||
|
||
credentials/*.json | ||
|
||
### JupyterNotebooks ### | ||
# gitignore template for Jupyter Notebooks | ||
# website: http://jupyter.org/ | ||
|
||
.ipynb_checkpoints | ||
*/.ipynb_checkpoints/* | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# Remove previous ipynb_checkpoints | ||
# git rm -r .ipynb_checkpoints/ | ||
|
||
### macOS ### | ||
# General | ||
.DS_Store | ||
.AppleDouble | ||
.LSOverride | ||
|
||
# Icon must end with two \r | ||
Icon | ||
|
||
# Thumbnails | ||
._* | ||
|
||
# Files that might appear in the root of a volume | ||
.DocumentRevisions-V100 | ||
.fseventsd | ||
.Spotlight-V100 | ||
.TemporaryItems | ||
.Trashes | ||
.VolumeIcon.icns | ||
.com.apple.timemachine.donotpresent | ||
|
||
# Directories potentially created on remote AFP share | ||
.AppleDB | ||
.AppleDesktop | ||
Network Trash Folder | ||
Temporary Items | ||
.apdisk | ||
|
||
### PyCharm+all ### | ||
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider | ||
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 | ||
|
||
# User-specific stuff | ||
.idea/**/workspace.xml | ||
.idea/**/tasks.xml | ||
.idea/**/usage.statistics.xml | ||
.idea/**/dictionaries | ||
.idea/**/shelf | ||
|
||
# Generated files | ||
.idea/**/contentModel.xml | ||
|
||
# Sensitive or high-churn files | ||
.idea/**/dataSources/ | ||
.idea/**/dataSources.ids | ||
.idea/**/dataSources.local.xml | ||
.idea/**/sqlDataSources.xml | ||
.idea/**/dynamic.xml | ||
.idea/**/uiDesigner.xml | ||
.idea/**/dbnavigator.xml | ||
|
||
# Gradle | ||
.idea/**/gradle.xml | ||
.idea/**/libraries | ||
|
||
# Gradle and Maven with auto-import | ||
# When using Gradle or Maven with auto-import, you should exclude module files, | ||
# since they will be recreated, and may cause churn. Uncomment if using | ||
# auto-import. | ||
# .idea/artifacts | ||
# .idea/compiler.xml | ||
# .idea/jarRepositories.xml | ||
# .idea/modules.xml | ||
# .idea/*.iml | ||
# .idea/modules | ||
# *.iml | ||
# *.ipr | ||
|
||
# CMake | ||
cmake-build-*/ | ||
|
||
# Mongo Explorer plugin | ||
.idea/**/mongoSettings.xml | ||
|
||
# File-based project format | ||
*.iws | ||
|
||
# IntelliJ | ||
out/ | ||
|
||
# mpeltonen/sbt-idea plugin | ||
.idea_modules/ | ||
|
||
# JIRA plugin | ||
atlassian-ide-plugin.xml | ||
|
||
# Cursive Clojure plugin | ||
.idea/replstate.xml | ||
|
||
# Crashlytics plugin (for Android Studio and IntelliJ) | ||
com_crashlytics_export_strings.xml | ||
crashlytics.properties | ||
crashlytics-build.properties | ||
fabric.properties | ||
|
||
# Editor-based Rest Client | ||
.idea/httpRequests | ||
|
||
# Android studio 3.1+ serialized cache file | ||
.idea/caches/build_file_checksums.ser | ||
|
||
### PyCharm+all Patch ### | ||
# Ignores the whole .idea folder and all .iml files | ||
# See https://github.com/joeblau/gitignore.io/issues/186 and https://github.com/joeblau/gitignore.io/issues/360 | ||
|
||
.idea/ | ||
|
||
# Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-249601023 | ||
|
||
*.iml | ||
modules.xml | ||
.idea/misc.xml | ||
*.ipr | ||
|
||
# Sonarlint plugin | ||
.idea/sonarlint | ||
|
||
### Python ### | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
pip-wheel-metadata/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
|
||
# IPython | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# End of https://www.toptal.com/developers/gitignore/api/jupyternotebooks,macos,pycharm+all,python |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
### Introduction | ||
|
||
Have you ever worked on data processing billions of records? Unless you're working as a data engineer, you won't have the experience of processing billions of records. [**The github activity dataset**](https://www.gharchive.org/), which contains records of developers' activities on **GitHub** sing 2012, contains billions of log records. | ||
|
||
<img src="https://imgur.com/WzgzWuu.png" width="800"> | ||
To handle this, we will use [Google BigQuery](https://cloud.google.com/bigquery), powerful and effective data warehouse tool. Even without spending a lot of money, you can experience the experience of analyzing Github records over the years through **Google BigQuery**. | ||
|
||
|
||
|
||
### requirements | ||
|
||
##### 1. install python packages | ||
|
||
````shell | ||
pip install --upgrade google-cloud-bigquery | ||
pip install --upgrade pandas-gpq | ||
pip install --upgrade six | ||
pip install --upgrade pyarrow | ||
```` | ||
|
||
##### 2. Get Bigquery Credentials | ||
|
||
If you wanna run the scripts in the repository, Download the credential json file according to [the link](https://cloud.google.com/docs/authentication/getting-started) and save it in the `credentials/` folder | ||
|
||
|
||
|
||
### Reading List | ||
|
||
#### 1. Read Data From Google Bigquery in jupyter | ||
|
||
###### It is so easy to handle Google BigQuery in the Jupyter. Let's learn 3 ways to handle Google BigQuery in Jupyter. | ||
|
||
#### 2. Understand the Schema of github archive | ||
|
||
###### Before we start analyzing the data, let's see how the github archive is structured. | ||
|
||
|
||
|
||
---- | ||
|
||
##### CopyRight [![CC BY-SA 4.0][cc-by-sa-shield]][cc-by-sa] | ||
|
||
This repository is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa]. | ||
|
||
[![CC BY-SA 4.0][cc-by-sa-image]][cc-by-sa] | ||
|
||
[cc-by-sa]: http://creativecommons.org/licenses/by-sa/4.0/ | ||
[cc-by-sa-image]: https://licensebuttons.net/l/by-sa/4.0/88x31.png | ||
[cc-by-sa-shield]: https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg |
Empty file.
Binary file not shown.
Oops, something went wrong.