Semester long IST707 - Data Analytics project
- The objective of the project is to use the main skills taught in this class to solve a real data mining problem
- For this project, you must choose your own dataset. It can be one that you created yourself or found from other resources, such as the Kaggle competitions and the UCI repository.
- Define a problem on the dataset and describe it in terms of its real-world organizational or business application. The complexity level of the problem should be comparable to homework assignments.
Data provdided by UCI.
See full_pipeline_model.ipynb for the following:
- Data pipeline transformation
- Splitting training and test files
- Build Transformer model using PyTorch
- Train model
- Save model
- Forecast entire test set
- Run cost projection on air pollution fines
- ts_transformer.py: Time Series Transformer neural network architecture
- torch_utils.py: Custom module of PyTorch helper functions, PyTorch class of the pollution dataset, and wrapper class to allow ts_transformer to be used on a Scikit-learn pipeline
- sklearn_utils.py: Custom Transformer steps for a Scikit-learn pipeline and a pipeline creation function (bejing_pipeline)
- preprocessor.py: Kalman Filtering (Preprocessor class) class to impute missing data in the datasets during the processing of data
- process_data.py: command line script to impute missing data and clean the data
- tseries.R: Time Series plotting and analysis of sample data
- tseries_eda.Rmd: extensive EDA and time series plotting and forecasting
For all Python files (.ipynb abd .py extensions):
- Custom modules:
- torch_utils
- ts_transformer
- preprocessor
- PyTorch
- PathLib
- NumPy
- Pandas
- PyKalman
- Scikit-learn
- Matplotlib
- Multiprocessing
- os
- glob
- click
For all R scripts:
- tseries
- TSstudio
- forecast
- xts
- tidyverse