Real-life data science situations and multiple programmatic solution :
- Simple cases with one or multiple algorithms
- Tuning and analysis on how to improve the results
- Big data possible usages using AWS Servives (SageMaker, DataPipeline, Glue, ...)
Most of the scripts are written in Python, using Jupyter Notebooks
- K-means, PCA, Linear, XGBoost
Types of plots:
- Box plot
-
words
-
Correlation
-
Regression curve
-
Marker plot (2D with circles instead of points)
- Standard deviation
Captures the spread of your data around the mean. This is calculated on the already available data
- RMS(L)E (Root Mean Square -Logarithmic- Error)
Cost function on the basis of which you determine the performance of your model in making predictions, or finding estimates. The closer this value is to 0, the merrier your model is. RMSE is calculated on the estimated/predicted data by comparing it with the true values
- Oversampling and undersampling In case of a supervised training, we might want to boost some of the clusters that are under/over represented.
Statistical learning Vs Symbolic methods
When does big data and arcitecture come into action
- Data Pipelines, ELT, Catalogs, Data lakes, ...
- Large datasets and models => Mapreduce
Solutions :
- AWS solutions and diagrams
- dataiku, mlflow, ...