HW6.Rmd

---
title: "STAT645 - Homework 6"
author: "Salih Kilicli"
date: "10/1/2019"
output:
  html_document: 
    df_print: kable
    highlight: zenburn
  pdf_document: 
    latex_engine: lualatex
  word_document: default
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
```

**Please obtain the heart disease data (from Course Content > Data). This database from Cleveland clinic
(through kaggle) contains 14 attributes. The “target” field refers to the presence of heart disease in the
patient. It is an integer, 0 (absent) to 1 (presence). A good description of the attributes can be found here
https://lucdemortier.github.io/prxojects/3_mcnulty.**

**Problem 1: There are four categorical $variables$, $cp$, $restecg$, $slope$ and $thal$. Categorize $thal$ into two groups, 0 (thal = 3) and 1(thal other than 3).**

```{r}
setwd("/Users/younique/Desktop/STAT645/Data")
heart = read.csv("heart.csv", header=TRUE)
attach(heart)
thal1=ifelse(thal==3, 1, 0)
thal1
```

**Problem 2: Scale all numeric variables. Do not scale the binary and categorical variables.**

```{r}
mydata=data.frame(target=target, sex=sex, fbs=fbs, exang=exang, thal=thal1,
  cp=as.factor(cp), slope=as.factor(slope), restecg=as.factor(restecg),
  age=as.vector(scale(age)),
  trestbps=as.vector(scale(trestbps)),
  chol=as.vector(scale(chol)),
  thalach=as.vector(scale(thalach)),
  oldpeak=as.vector(scale(oldpeak)),
  ca=as.vector(scale(ca))
)
```

**Problem 3: Fit a logistic regression model to target on $13$ explanatory variables.**

```{r}
logmodel=glm(target~age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal, data=mydata, family="binomial")
summary(logmodel)
```

**Problem 4: Use this fitted model to estimate the probability of the disease (target= 1) for the following set of
values of the explanatory variables. For these cases, also obtain the $95/%$ interval for the chance of the disease. Note that before the prediction, don’t forget to apply the same transformation on the explanatory variables as you have done before the logistic model fitting to the data in the previous question.**

age   |  sex  |  cp   |trestbps| chol  |  fbs  |restecg|thalach| exang |oldpeak| slope |  ca   |  thal |
----- | ----- | ----- | -----  | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
68    |  1    |  3    |   145  |  233  |   1   |   0   |  150  |   0   |  2.3  |   0   |   0   |   3   |
75    |  0    |  3    |   145  |  150  |   1   |   0   |  150  |   0   |  2.3  |   0   |   0   |   1   |
78    |  1    |  0    |   144  |  193  |   1   |   1   |   90  |   0   |  3.4  |   1   |   2   |   3   |

```{r}
new=data.frame(
  age=as.vector((c(68,75,78)-mean(heart$age))/sd(heart$age)),
  trestbps=as.vector((c(145,145,144)-mean(heart$trestbps))/sd(heart$trestbps)),
  chol=as.vector((c(233,150,193)-mean(heart$chol))/sd(heart$chol)),
  thalach=as.vector((c(150,150,90)-mean(heart$thalach))/sd(heart$thalach)),
  oldpeak=as.vector((c(2.3,2.3,2.4)-mean(heart$oldpeak))/sd(heart$oldpeak)),
  ca=as.vector(c(0,0,2)-mean(heart$ca))/sd(heart$ca),
  sex=c(1,0,1),
  fbs=c(1,1,1),
  exang=c(0,0,0),
  cp=as.factor(c(3,3,0)),
  restecg=as.factor(c(0,0,1)),
  slope=as.factor(c(0,0,1)),
  thal=ifelse(c(3, 1, 3)==3, 1, 0)
) 

myout=predict.glm(logmodel, newdata=new, se.fit=TRUE)

P=1/(1+exp(-as.numeric(myout$fit)))  # p*=exp(hat)/1+exp(hat) or 1/(1+exp(-hat))
cat("Probability estimates for the given data set is: \n", P)

lower=1/(1+exp(-(as.numeric(myout$fit)-1.96*as.numeric(myout$se))))
upper=1/(1+exp(-(as.numeric(myout$fit)+1.96*as.numeric(myout$se)))) # CI=1/(1+exp(-{hat+c(-1,1)1.96se(hat)}))
CI=cbind(lower,upper)
print(CI)
```

**Problem 5: Check the adequecy of the model using the Hosmer-Lemeshow test. Clearly write out the hypothesis,
test statistic and $p$-value, and conclusion.**

```{r}
library(generalhoslem)
library(reshape)
library(MASS)
logitgof(heart$target,fitted(logmodel))
```

For the logistic model, the null and alternative hypotheses given as below:

$H_0:$ The model fits the data well, $H_a:$ The model is not adequate for the data

Since the $p$-value ($69\%$) found using Hosmer-Lemeshow test is high, we fail to reject the Null hypothesis, i.e., we do not have sufficient evidence to conclude that the model is not adequate for the data.

**Problem 6: Consider the first $100$ and the last $100$ subjects of the data and fit the logistic regression based on these data only. You don’t need to re-scale the data again. Just take the above subset of the data that you have created previously.**

```{r}
train=data.frame(
age=as.vector(mydata$age[-c(101:203)]),
trestbps=as.vector(mydata$trestbps[-c(101:203)]),
chol=as.vector(mydata$chol[-c(101:203)]),
thalach=as.vector(mydata$thalach[-c(101:203)]),
oldpeak=as.vector(mydata$oldpeak[-c(101:203)]),
ca=as.vector(mydata$ca[-c(101:203)]),
sex=mydata$sex[-c(101:203)],
fbs=mydata$fbs[-c(101:203)],
exang=mydata$exang[-c(101:203)],
target=mydata$target[-c(101:203)],
cp=as.factor(mydata$cp[-c(101:203)]),
restecg=as.factor(mydata$restecg[-c(101:203)]),
slope=as.factor(mydata$slope[-c(101:203)]),
thal=mydata$thal[-c(101:203)]
)

logmodeltrain=glm(target~age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal, data=train, family="binomial")

summary(logmodeltrain)
```

**Problem 7: Next, apply this fitted model to predict the target variable for the remaining set of observations (test data). Show the confusion matrix for prediction when you use $0.5$, $0.6$ and $0.7$ as the cutoff value and use the cutoff to declare a target equal to one if the estimated probability exceeds the cutoff. Comment on the results.**

```{r}
library(e1071)
library(caret)

test=data.frame(
  age=as.vector(mydata$age)[c(101:203)],
  trestbps=as.vector(mydata$trestbps)[c(101:203)],
  chol=as.vector(mydata$chol)[c(101:203)],
  thalach=as.vector(mydata$thalach)[c(101:203)],
  oldpeak=as.vector(mydata$oldpeak)[c(101:203)],
  ca=as.vector(mydata$ca)[c(101:203)],
  sex=mydata$sex[c(101:203)],
  fbs=mydata$fbs[c(101:203)],
  exang=mydata$exang[c(101:203)],
  target=mydata$target[c(101:203)],
  thal=mydata$thal[c(101:203)],
  cp=as.factor(mydata$cp)[c(101:203)],
  restecg=as.factor(mydata$restecg)[c(101:203)],
  slope=as.factor(mydata$slope)[c(101:203)]
)

myout2=predict.glm(logmodeltrain, newdata=test)

P2=1/(1+exp(-as.numeric(myout2[[1]])))                               # p*=exp(hat)/1+exp(hat) or 1/(1+exp(-hat))
cat("Probability estimate for given data set is = ", P2)

confusionMatrix(data=as.factor(as.numeric(myout2>0.5)), reference=as.factor(test$target))
confusionMatrix(data=as.factor(as.numeric(myout2>0.6)), reference=as.factor(test$target))
confusionMatrix(data=as.factor(as.numeric(myout2>0.7)), reference=as.factor(test$target))
```

One measure of goodness of prediction is higher value of Sensitivity+Specifity. Looking at the models with different cutoff values,we see that model with cutoff value $0.6$ gives the highest Sensitivity+Specifity value, whereas the model with cutoff value $0.5$ has the smallest Sensitivity+Specifity sum. However, all of the models have pretty close Sensitivity and Specifity values.


**Problem 8: Draw an $ROC$ curve for the test data mentioned in the previous question and then comment on the
discriminatry power of the model.**

```{r}
library(MASS)
library(pROC)

logmodeltest=glm(target ~age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal, data=test, family="binomial")
out=predict(logmodeltest, type="response")
ROC = roc(test$target ~ out)
ROC
plot(ROC)
```

Looking at the ROC curve, the discriminatry power of the model looks good, close to edges and far from the 45 degrees line.

**Problem 9: Re-do the analysis stated in questions $6$ and $8$ without $ca$, $cp$, and $thal$. Comment on the discriminatory power of this model?**

```{r}
logmodel2=glm(target~age+sex+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope, data=train, family="binomial")
summary(logmodel2)

myout3=predict.glm(logmodel2, newdata=test)

logmodeltest1=glm(target ~age+sex+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope, data=test, family="binomial")
out=predict(logmodeltest1, type="response")
ROC = roc(test$target ~ out)
ROC
plot(ROC)
```

Looking at the new models ROC curve, it does poorer job than the model compared to the ROC curve since the curve gets closer to 45 degrees line and gets further from the edges. Additionally, clearly the new model has lower Sensitivity values corresponding to lower Specifity values.