For our final project, we developed models to predict the underpricing of IPOs or Initial Public Offerings. By collecting previous IPO data, we were able to train multiple AI and machine learning models to classify the data. Our best model, a Random Forest algorithm, achieved an accuracy of about 76%. All of our development was done in python using Jupiter notebooks.
The libraries we used include:
scikit_learn:
Used for training and classifying with our selected models as well as for analyzing the results.pandas:
Used for general data handling, especially importing and combining data from csv files.numpy:
Used for general data formatting and handling.pytorch:
Used to create a Binary Classification Neural Network.seaborn:
Visualization.ipython:
Development environment.
To install all of our project dependencies run:
pip install -r requirements.txt
For data collection and formatting we used pandas
and .csv
files. All of our data can be found in the data folder and each source that needed to be cleaned has an independent ipython notebook in the cleaning_scripts folder. The following websites and applications are where we sourced our data for the project.
- IPOScoop Data:
- Source: IPOScoop.com
- Notebook: scoop-data.ipynb
- We converted the
.xls
file into a.csv
file in the cleaning file. - IPOs from
2000
to2020
.
- Ticker Data:
- Source: Ticker.com
- Notebook: ticker-data.ipynb
- Scraped directly from the website using pandas
read_html
function. - Concatenated data from
2012
to2022
.
- Bloomberg Data:
- Source: Bloomberg Terminal
- Notebook: bloomberg-sector-cleaning.ipynb
- Downloaded datat from computer access in the business library.
- FRED Data:
- Source: St. Louis Federal Reserve Economic Data database.
- Used to source macroeconmic data on a specific industry and sector.
We train all our models using the following features:
- Sales - 1 Yr Growth
- Profit Margin
- Return on Assets
- Offer Size (M)
- Shares Outstanding (M)
- Offer Price
- Market Cap at Offer (M)
- Cash Flow per Share
- Instit Owner (% Shares Out)
- Instit Owner (Shares Held)
- Real GDP Per Capita
- OECD Composite Leading Indicator
- Interest Rate
- Seasonally Adjusted Unemployment Rate
- CPI Growth Rate
- Industry Sector
- Industry Group
- Industry Subgroup
- Underpriced (Classifying Feature)
These features were selected based on the features used in previous research, along with the data that was publically available to us. Please reference our research paper for a definition of each feature.
We implemented four machine-learning models that were identified by previous research done in the field. We utilized the sklearn
library to implement the random forest, gradient boosting classifier, and support vector machine. We used the pytorch
library to implement a neural network. All of our models can be found in the models folder.
- Random Forest:
- Notebook: random_forest_scikit.ipynb
- Overall Accuracy: 76%
- Underpriced Accuracy: 92.8%
- Overpriced Accuracy: 16.6%
- Implemented using the
sklearn
functionRandomForestClassifier
- Gradient Boosting Classifier:
- Notebook: gradient_boosting.ipynb
- Overall Accuracy: 75.3%
- Underpriced Accuracy: 94.9%
- Overpriced Accuracy: 12.1%
- Implemented using the
sklearn
functionGradientBoostingClassifier
- Support Vector Machine:
- Notebook: svm.ipynb
- Overall Accuracy: 73.9%
- Underpriced Accuracy: 100%
- Overpriced Accuracy: 0%
- Implemented using the
sklearn
functionsvm
- Neural Network:
- Notebook: neural_network.ipynb
- Overall Accuracy: 70.2%
- Underpriced Accuracy: 88.8%
- Overpriced Accuracy: 16.2%
- Implemented using
pytorch
library
To achieve a 76% accuracy for the random forest model, we first anaylized several of the model's parameters. Specifically, examined the results of every combination of the features listed below:
- estimators - The number of trees in the forest
- criterion - The function to measure the quality of a split
- max_depth - The maximum allowed depth for trees
- max_features - The number of features to consider in each tree
After analysis, the max_depth
of the tree turned out to be the determining factor in a model's accuracy. The full process can be found in the test_random_forest_model_config.ipynb.
For more information view our project paper. It goes into much greater detail about our problem space, algorithms, methods, and results.
-
Predicting IPO underperformance using machine learning by Rachit Agrawal
-
Textual Information and IPO Underpricing: A Machine Learning approach by Apostolos Katsafados et al.
-
A Neural Network Model to Predict Initial Return of Chinese SMEs Stock Market Initial Public Offerings by Dan Meng
-
Proceedings of the 2021 3rd International Conference on Economic Management and Cultural Industry by Kelai Wang