This repository is a collection of my projects focused on applied statistics
Click on the link for the view full project pdf
Objective & Motivation: The city tax assessor is interested in predicting residential home sales prices in a midwestern city given a dataset of various characteristics of the home and surrounding property. The features we use to predict sales price of a home are number of bedrooms, bathrooms, and garage size.
Role: Statistician
Data: Our dataset consists of 522 total transactions from home sales in a midwestern city during the year 2002.
Models and statistical techniques: Used R as the main programming language to conduct statistical analysis and Rmarkdown to knit and export results into a PDF
- Model Estimation and Interpretation
- Fitting a Multiple Linear Regression model
- Adjusted R-Squared
- Prediction
- 95% Confidence Interval
- 95% Prediction Interval
- Hypothesis Testing
- T-test for partial slope significance
- F-Test for Overall model significance
- Partial F-Test to assess reduced and full model significance
- Multicolinearity
- Scatterplots
- Correlation Matrices
- There is significance in keeping bathroom and garage size features in our model.
- Based on the results we see that removing either bedroom or bathroom from our model is essential because both are highly correlated with each other thereby making it difficult for our model to attribute significance to our predictor variables.
Objective & Motivation: We continue our project of predicting residential home sales prices using the same dataset but now using different given features/explanatory variables: area of residence in square feet, and the absence or presence of a swimming pool or air conditioning in the home property.
Additional models and statistical techniques: We continue the choice of R as our main statistical programming tool
- Regression using a Dummy Variable
- hypothesis test on the significance of slope coefficient
- Fitting a Multiple Linear Regression model with Interaction Term of a Dummy Variable and Continuous Variable
- Plotting Fitted Regression Lines Between Dummy Variables
- Testing for Parallel Lines Between Two Regression Lines
- Fitting into a regression only with Interaction of Dummy Variables
- Caluclating estimated mean sales prices for 4 kinds of properties
Based on our two-sided hypothesis test, we concluded that the slope coefficient was in fact significant. This tells us that a significant difference between the mean change in the sales prices comparing properties containing a swimming pool in reference to one without a swimming pool does in fact exist.
In the second part of our analytical study, a multi-linear regression model that contains the interaction term of the dummy variable and the continuous variable was created. The goal of this was to establish whether the different linear regression models that resulted when the home contained a pool or did not contain a pool. We also looked at the effect that this would have dependent on the square footage of the property. The model revealed that our regression lines are not parallel, and a relationship exists between the two lines.
Based on our regression analysis it is clear that a property with a swimming pool and air conditioning ($356,752.3), cost significantly more than a property without ($189,578.2.). In this case if I were looking to buy a home in the future perhaps a home with just air conditioning would suffice.
For future reference we understand that our dataset is unbalanced with only about 7%, or 36 out of 522 observations owning swimming pools and 16%, or 88 out of 522 observations having air conditioning. Moving forward one way to correct this would be to collect more data from houses containing these features. Also, since the Interaction term between owning a swimming pool and having air conditioning is not significant and therefore remove this from the model.
Objective & Motivation: A health Insurance company wants to analyze the average length of stay of inpatients at a certain hospital and see if there are any relations with the average estimated probability of inpatients acquiring an infection while in the hospital. Therefore, the independent variable (feature) we use will be the Infection risk probability and our response variable as the length of stay.
Role: Statistician
Data: We are conducting a simple linear regression model using the SENIC hospital dataset containing 113 observations.
Models and statistical techniques: Used R as the main programming language to conduct statistical analysis and Rmarkdown to knit and export results into a PDF
- Interpretation and Parameter Inference
- Scatterplot
- Fitting a Linear Regression model
- R-Squared
- T-Testing
- Point and Interval Estimation
- 95% Confidence Interval for mean length of stay
- 95% Prediction Interval for length of stay when infection risk is 5 percent
- Diagnostics & Checking Model Assumptions
- Testing Normality: Plotting to Check for Normality and Equal Variance Assumptions
- Boxplot & Histogram Interpretations
- Partial F-Test to assess reduced and full model significance
- Possible omitted predictors
Due to our model violating normality and homoskedasticity assumptions, remedial measures should be considered so that having a linear regression model is appropriate.