Analyzed and predicted stock prices from two datasets using both supervised and unsupervised Machine Learning techniques using PySpark
- XGBoost was used as a classification model
- Avoids overfitting by its regularization parameters
- Can deal with missing values and NaN values
- Predicted the closing price for the next 14 days on the validation data during training.
- Made predictions on holdout testing data.
- We used Root Mean Squared Error (RMSE) function to check for prediction errors. Our calculations showed an average RMSE of 30 for all the stocks with the maximum reaching 77 for an individual stock.
- Validation prediction results sometimes varied by a big margin as it produced a RMSE very different from what we got in the testing data. For instance, TITAN, a stock in the dataset, had a difference of 47 in RMSE.
- For good predictions, RMSE is usually in the range 0.2-0.5.
- RMSE was very high on training compared to testing which indicates that the model did not learn the training distribution well.
- A more complex model could reduce underfitting
- Time series k-means with Dynamic Time Warping
- Why use DTW instead of Euclidean?
- Euclidean doesn’t work well if the time series lengths are mismatched DTW is uses the nearest neighbour in the comparison sequence at each inde
- Most stocks increase, some are stagnant, very few decrease = 3 clusters
- Two models: one with normalized data, one with standardized data
- Standardized model
- Cluster 0: stocks are rising
- Cluster 1: stocks are stagnant
- Cluster 2: stagnant at first, but late 2019 there is a spike(which is interesting since the pandemic starts somewhere around then).
This appears to be the more reliable model.
-
Normalized model Cluster 0: messy - may not have trained well. Can see some rising stocks, but also some that don’t, and some that are volatile Cluster 1: rising stocks Cluster 2: 1 decreasing stock
-
The standardized model appears to be the more reliable model
Lastly, we share the insights into our results to summarize. For the XGBoost model, predicting stock market data did not produce the best results. This Implies that the stock market data is difficult to predict.
- XGBoost
- The model was definitely not the best for predicting stock market data. Stock market data is difficult to predict.
- K-Means Clustering
- Picking stocks for a portfolio that generally increases based on clusters may be more reliable than training a model that predicts the stock price.
- Updating the classification model with PySpark would give a better overview of the performance of the model
- Hyperparameter Tuning can improve the accuracy for the classification model
- Stacking models with different models can be used to produce better and more accurate predictions