Authors: Brooks Li, Kush Patel, Albert Nguyen, Deeksha Koonadi, Ashley Soto
This repository contains the code used for the MIS 382N Advanced Machine Learning final project. In particular, it contains the files used to scrape the data, and then perform model analysis on the extracted data afterwards.
Overview of File Names:
- Company 10-K 1.csv, Company HTML Urls 1.csv, Company Index Urls.csv, Company Names.csv, edgar_scraper.py were all used as part of the data scraping process.
- Earning Surprise Boosting.ipynb, Earning Surprise Neural Nets.ipynb, Earning Surprise Random Forests.ipynb, Earning Surprise SVM.ipynb, Earning Surprise Decision Trees.ipynb were the main files used to build the models for classifying earning surprises.
- Sentiment Contrastive Learning files 1, 2, 3 were prototypes of contrastive learning techniques for developing stronger sentiment scores between documents and their respective base positive and negative classes. In particular, Sentiment Contrastive Learning 3.ipynb is the most recent update of the constrastive learning technique.
- Earning Surprise Data Preprocessing.ipynb is the file used for most of the data cleaning, feature extraction, data engineering.
- Earning Surprise ML Classification.ipynb was used to test what happens if you include the TFIDF word vectorizer into the dataset.
- Negative 10K, Positive 10K, Negative Earning Call Transcript, Positive Earning Call Transcript: The base documents generated by LLMs to compare the actual documents to in order to determine sentiment.
A summary of the project can be found on our blog posted on Medium: https://medium.com/@19lizezhou/predicting-earning-surprises-a-deep-dive-into-machine-learning-techniques-3c16b35f019f