Overview

This is a repo for our STAT 656 project in Spring 2024

Team Name: Machine Learners
Team members: Elizabeth Chun, Minhyuk (Joseph) Kim, Sophia Lazcano

How to work with this repository

There are 4 main files for running the analysis and reproducing outputs. You should only need to interact with two of them: config.R and models.R. The other files are used to fetch data (no editing unless desire new data pull) and process data (should be sourced/imported in models.R).

fetch_data.R pulls stock price data using the tidyquant package and creates stock_data.csv. Should not need to be touched
config.R sets important processing and modeling configs like lengths for input and output. Should be edited for params as needed.
process_data.R loads the stock_data.csv and performs processing including data splitting, scaling, and encoding. Sources config for params. Should not need to be touched.
models.R builds, trains, and evaluates models. Currently contains simple LSTM example. Sources process_data for data. Should be main work file.

Old exploratory files have been moved to exploratory/*.

Adding models

To add a new model, you can copy the models.R file and simply create your own keras model architecture and compile. In other words, model = keras_model_sequential() %>% [insert layers here]

Data Notes

Data pulled from here: https://topforeignstocks.com/indices/components-of-the-sp-500-index/ csv file as of February 4th 2024

Modified to fix tickers for BRKB (corrected to BRK-B) and BFB (corrected to BF-B) and CDAYS (corrected to DAYS)

Caret does not seem to support multi-sample model fitting - i.e. it splits the entire timeseries in rolling windows assuming it is a single time series. Could fit separate models for each stock...not exactly our goal. Currently have chosen to use keras with data generators to fit multi-sample models

Proposed workflow for project

Pick 3 models: suggested ARIMA (simple, linear type regression), XGBoost (tree based), transformer (deep learning) a. Maybe each of us responsible for one model?
Splitting: suggested to split OD/ID and check within sector vs between sector comparisons
Scaling: min-max, normalization, etc
Encoding: not sure if tickers and/or sectors need to be encoded as numeric a. not terribly interesting from a theoretical perspective b. but I think might be important from a practical perspective

Proposed discussion points for task 2

Feynman method:

How does data processing, in particular splitting, scaling, encoding, etc, affect prediction accuracy.
Define prediction accuracy, splitting, encoding, etc.
Need to define test set (can discuss OD vs ID), cross-validation or some other method. Also define scaling (and why is it needed?), encoding (optional, not sure we need it). Lastly define prediction accuracy (mean square error? MAE? etc)
"Solidify concepts"
Iterate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

How to work with this repository

Adding models

Data Notes

Proposed workflow for project

Proposed discussion points for task 2

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
exploratory		exploratory
output		output
.gitignore		.gitignore
README.md		README.md
config.R		config.R
constituents_sp500_feb4_2024.csv		constituents_sp500_feb4_2024.csv
fetch_data.R		fetch_data.R
models.R		models.R
process_data.R		process_data.R
shiny_prediction_data.csv		shiny_prediction_data.csv
shiny_prediction_data_original.csv		shiny_prediction_data_original.csv
stock_data.csv		stock_data.csv

LylChun/stat656_project

Folders and files

Latest commit

History

Repository files navigation

Overview

How to work with this repository

Adding models

Data Notes

Proposed workflow for project

Proposed discussion points for task 2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages