Skip to content

Duwab/datakit

Repository files navigation

Data Kit

Real-life data science situations and multiple programmatic solution :

  • Simple cases with one or multiple algorithms
  • Tuning and analysis on how to improve the results
  • Big data possible usages using AWS Servives (SageMaker, DataPipeline, Glue, ...)

Most of the scripts are written in Python, using Jupyter Notebooks

Algorithms

  • K-means, PCA, Linear, XGBoost

Plots

Types of plots:

  • Box plot

box plot chart

  • words

  • Correlation

  • Regression curve

  • Marker plot (2D with circles instead of points)

Concepts

  • Standard deviation

Captures the spread of your data around the mean. This is calculated on the already available data

  • RMS(L)E (Root Mean Square -Logarithmic- Error)

Cost function on the basis of which you determine the performance of your model in making predictions, or finding estimates. The closer this value is to 0, the merrier your model is. RMSE is calculated on the estimated/predicted data by comparing it with the true values

  • Oversampling and undersampling In case of a supervised training, we might want to boost some of the clusters that are under/over represented.

More about datascience

Statistical learning Vs Symbolic methods

When does big data and arcitecture come into action

  • Data Pipelines, ELT, Catalogs, Data lakes, ...
  • Large datasets and models => Mapreduce

Solutions :

  • AWS solutions and diagrams
  • dataiku, mlflow, ...

Python tips

http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published