-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathHW6.Rmd
202 lines (162 loc) · 8.38 KB
/
HW6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "STAT645 - Homework 6"
author: "Salih Kilicli"
date: "10/1/2019"
output:
html_document:
df_print: kable
highlight: zenburn
pdf_document:
latex_engine: lualatex
word_document: default
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
```
**Please obtain the heart disease data (from Course Content > Data). This database from Cleveland clinic
(through kaggle) contains 14 attributes. The “target” field refers to the presence of heart disease in the
patient. It is an integer, 0 (absent) to 1 (presence). A good description of the attributes can be found here
https://lucdemortier.github.io/prxojects/3_mcnulty.**
**Problem 1: There are four categorical $variables$, $cp$, $restecg$, $slope$ and $thal$. Categorize $thal$ into two groups, 0 (thal = 3) and 1(thal other than 3).**
```{r}
setwd("/Users/younique/Desktop/STAT645/Data")
heart = read.csv("heart.csv", header=TRUE)
attach(heart)
thal1=ifelse(thal==3, 1, 0)
thal1
```
**Problem 2: Scale all numeric variables. Do not scale the binary and categorical variables.**
```{r}
mydata=data.frame(target=target, sex=sex, fbs=fbs, exang=exang, thal=thal1,
cp=as.factor(cp), slope=as.factor(slope), restecg=as.factor(restecg),
age=as.vector(scale(age)),
trestbps=as.vector(scale(trestbps)),
chol=as.vector(scale(chol)),
thalach=as.vector(scale(thalach)),
oldpeak=as.vector(scale(oldpeak)),
ca=as.vector(scale(ca))
)
```
**Problem 3: Fit a logistic regression model to target on $13$ explanatory variables.**
```{r}
logmodel=glm(target~age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal, data=mydata, family="binomial")
summary(logmodel)
```
**Problem 4: Use this fitted model to estimate the probability of the disease (target= 1) for the following set of
values of the explanatory variables. For these cases, also obtain the $95/%$ interval for the chance of the disease. Note that before the prediction, don’t forget to apply the same transformation on the explanatory variables as you have done before the logistic model fitting to the data in the previous question.**
age | sex | cp |trestbps| chol | fbs |restecg|thalach| exang |oldpeak| slope | ca | thal |
----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
68 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 3 |
75 | 0 | 3 | 145 | 150 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 |
78 | 1 | 0 | 144 | 193 | 1 | 1 | 90 | 0 | 3.4 | 1 | 2 | 3 |
```{r}
new=data.frame(
age=as.vector((c(68,75,78)-mean(heart$age))/sd(heart$age)),
trestbps=as.vector((c(145,145,144)-mean(heart$trestbps))/sd(heart$trestbps)),
chol=as.vector((c(233,150,193)-mean(heart$chol))/sd(heart$chol)),
thalach=as.vector((c(150,150,90)-mean(heart$thalach))/sd(heart$thalach)),
oldpeak=as.vector((c(2.3,2.3,2.4)-mean(heart$oldpeak))/sd(heart$oldpeak)),
ca=as.vector(c(0,0,2)-mean(heart$ca))/sd(heart$ca),
sex=c(1,0,1),
fbs=c(1,1,1),
exang=c(0,0,0),
cp=as.factor(c(3,3,0)),
restecg=as.factor(c(0,0,1)),
slope=as.factor(c(0,0,1)),
thal=ifelse(c(3, 1, 3)==3, 1, 0)
)
myout=predict.glm(logmodel, newdata=new, se.fit=TRUE)
P=1/(1+exp(-as.numeric(myout$fit))) # p*=exp(hat)/1+exp(hat) or 1/(1+exp(-hat))
cat("Probability estimates for the given data set is: \n", P)
lower=1/(1+exp(-(as.numeric(myout$fit)-1.96*as.numeric(myout$se))))
upper=1/(1+exp(-(as.numeric(myout$fit)+1.96*as.numeric(myout$se)))) # CI=1/(1+exp(-{hat+c(-1,1)1.96se(hat)}))
CI=cbind(lower,upper)
print(CI)
```
**Problem 5: Check the adequecy of the model using the Hosmer-Lemeshow test. Clearly write out the hypothesis,
test statistic and $p$-value, and conclusion.**
```{r}
library(generalhoslem)
library(reshape)
library(MASS)
logitgof(heart$target,fitted(logmodel))
```
For the logistic model, the null and alternative hypotheses given as below:
$H_0:$ The model fits the data well, $H_a:$ The model is not adequate for the data
Since the $p$-value ($69\%$) found using Hosmer-Lemeshow test is high, we fail to reject the Null hypothesis, i.e., we do not have sufficient evidence to conclude that the model is not adequate for the data.
**Problem 6: Consider the first $100$ and the last $100$ subjects of the data and fit the logistic regression based on these data only. You don’t need to re-scale the data again. Just take the above subset of the data that you have created previously.**
```{r}
train=data.frame(
age=as.vector(mydata$age[-c(101:203)]),
trestbps=as.vector(mydata$trestbps[-c(101:203)]),
chol=as.vector(mydata$chol[-c(101:203)]),
thalach=as.vector(mydata$thalach[-c(101:203)]),
oldpeak=as.vector(mydata$oldpeak[-c(101:203)]),
ca=as.vector(mydata$ca[-c(101:203)]),
sex=mydata$sex[-c(101:203)],
fbs=mydata$fbs[-c(101:203)],
exang=mydata$exang[-c(101:203)],
target=mydata$target[-c(101:203)],
cp=as.factor(mydata$cp[-c(101:203)]),
restecg=as.factor(mydata$restecg[-c(101:203)]),
slope=as.factor(mydata$slope[-c(101:203)]),
thal=mydata$thal[-c(101:203)]
)
logmodeltrain=glm(target~age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal, data=train, family="binomial")
summary(logmodeltrain)
```
**Problem 7: Next, apply this fitted model to predict the target variable for the remaining set of observations (test data). Show the confusion matrix for prediction when you use $0.5$, $0.6$ and $0.7$ as the cutoff value and use the cutoff to declare a target equal to one if the estimated probability exceeds the cutoff. Comment on the results.**
```{r}
library(e1071)
library(caret)
test=data.frame(
age=as.vector(mydata$age)[c(101:203)],
trestbps=as.vector(mydata$trestbps)[c(101:203)],
chol=as.vector(mydata$chol)[c(101:203)],
thalach=as.vector(mydata$thalach)[c(101:203)],
oldpeak=as.vector(mydata$oldpeak)[c(101:203)],
ca=as.vector(mydata$ca)[c(101:203)],
sex=mydata$sex[c(101:203)],
fbs=mydata$fbs[c(101:203)],
exang=mydata$exang[c(101:203)],
target=mydata$target[c(101:203)],
thal=mydata$thal[c(101:203)],
cp=as.factor(mydata$cp)[c(101:203)],
restecg=as.factor(mydata$restecg)[c(101:203)],
slope=as.factor(mydata$slope)[c(101:203)]
)
myout2=predict.glm(logmodeltrain, newdata=test)
P2=1/(1+exp(-as.numeric(myout2[[1]]))) # p*=exp(hat)/1+exp(hat) or 1/(1+exp(-hat))
cat("Probability estimate for given data set is = ", P2)
confusionMatrix(data=as.factor(as.numeric(myout2>0.5)), reference=as.factor(test$target))
confusionMatrix(data=as.factor(as.numeric(myout2>0.6)), reference=as.factor(test$target))
confusionMatrix(data=as.factor(as.numeric(myout2>0.7)), reference=as.factor(test$target))
```
One measure of goodness of prediction is higher value of Sensitivity+Specifity. Looking at the models with different cutoff values,we see that model with cutoff value $0.6$ gives the highest Sensitivity+Specifity value, whereas the model with cutoff value $0.5$ has the smallest Sensitivity+Specifity sum. However, all of the models have pretty close Sensitivity and Specifity values.
**Problem 8: Draw an $ROC$ curve for the test data mentioned in the previous question and then comment on the
discriminatry power of the model.**
```{r}
library(MASS)
library(pROC)
logmodeltest=glm(target ~age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal, data=test, family="binomial")
out=predict(logmodeltest, type="response")
ROC = roc(test$target ~ out)
ROC
plot(ROC)
```
Looking at the ROC curve, the discriminatry power of the model looks good, close to edges and far from the 45 degrees line.
**Problem 9: Re-do the analysis stated in questions $6$ and $8$ without $ca$, $cp$, and $thal$. Comment on the discriminatory power of this model?**
```{r}
logmodel2=glm(target~age+sex+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope, data=train, family="binomial")
summary(logmodel2)
myout3=predict.glm(logmodel2, newdata=test)
logmodeltest1=glm(target ~age+sex+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope, data=test, family="binomial")
out=predict(logmodeltest1, type="response")
ROC = roc(test$target ~ out)
ROC
plot(ROC)
```
Looking at the new models ROC curve, it does poorer job than the model compared to the ROC curve since the curve gets closer to 45 degrees line and gets further from the edges. Additionally, clearly the new model has lower Sensitivity values corresponding to lower Specifity values.