This project comes from CST2101- Python Programming, Alqonguin BISI program. In this project, we are requested to study a dataset called 'pima', which contains 9 features and 1000 observations. The features include 'Pregnancies', 'Glucose', 'Blood pressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age' and 'Outcome'. The last feature 'Outcome' is Class variable with '0' meaning the person is not diabetic or '1' meaning the person is diabetic.
The aim of this project is to use the first 8 features in the dataset to make predication on outcome. Two machine learning models, logistic regression and random forest model were adopted in this study and the rate of accuracy of these two models will be calculated and compared.
- Exploratory Data Analysis (seaborn, matplotlib);
- Machine learning models (Logistic regressioin and Random forest, sklearn and its functions)
From the accuracy result of these two models, it indicates the 'Random Forest' performs slightly better than 'Logistic regression' model.