Data Kit

Real-life data science situations and multiple programmatic solution :

Simple cases with one or multiple algorithms
Tuning and analysis on how to improve the results
Big data possible usages using AWS Servives (SageMaker, DataPipeline, Glue, ...)

Most of the scripts are written in Python, using Jupyter Notebooks

Algorithms

K-means, PCA, Linear, XGBoost

Plots

Types of plots:

Box plot

words
Correlation
Regression curve
Marker plot (2D with circles instead of points)

Concepts

Standard deviation

Captures the spread of your data around the mean. This is calculated on the already available data

RMS(L)E (Root Mean Square -Logarithmic- Error)

Cost function on the basis of which you determine the performance of your model in making predictions, or finding estimates. The closer this value is to 0, the merrier your model is. RMSE is calculated on the estimated/predicted data by comparing it with the true values

Oversampling and undersampling In case of a supervised training, we might want to boost some of the clusters that are under/over represented.

More about datascience

Statistical learning Vs Symbolic methods

When does big data and arcitecture come into action

Data Pipelines, ELT, Catalogs, Data lakes, ...
Large datasets and models => Mapreduce

Solutions :

AWS solutions and diagrams
dataiku, mlflow, ...

Python tips

http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
1-spam-classifier		1-spam-classifier
13-stock-price-prediction		13-stock-price-prediction
4-documents-classification		4-documents-classification
8-usage-prediction		8-usage-prediction
9-decision-trees		9-decision-trees
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Kit

Algorithms

Plots

Concepts

More about datascience

Python tips

About

Releases

Packages

Languages

Duwab/datakit

Folders and files

Latest commit

History

Repository files navigation

Data Kit

Algorithms

Plots

Concepts

More about datascience

Python tips

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages