forked from asadoughi/stat-learning
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path6.Rmd
46 lines (39 loc) · 1.35 KB
/
6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
title: "Chapter 10: Exercise 6"
output: html_document
---
### a
The first principal component "explains 10% of the variation" means 90% of the
information in the gene data set is lost by projecting the tissue sample
observations onto the first principal component. Another way of explaining it is
90% of the variance in the data is not contained in the first principal
component.
### b
Given the flaw shown in pre-analysis of a time-wise linear trend amongst the
tissue samples' first principal component, I would advise the researcher to
include the machine used (A vs B) as a feature of the data set. This should
enhance the PVE of the first principal component before applying the two-sample
t-test.
### c
```{r}
set.seed(1)
Control = matrix(rnorm(50*1000), ncol=50)
Treatment = matrix(rnorm(50*1000), ncol=50)
X = cbind(Control, Treatment)
X[1,] = seq(-18, 18 - .36, .36) # linear trend in one dimension
```
```{r}
pr.out = prcomp(scale(X))
summary(pr.out)$importance[,1]
```
9.911% variance explained by the first principal component.
Now, adding in A vs B via 10 vs 0 encoding.
```{r}
X = rbind(X, c(rep(10, 50), rep(0, 50)))
pr.out = prcomp(scale(X))
summary(pr.out)$importance[,1]
```
11.54% variance explained by the first principal component. That's an
improvement of 1.629%.
(*) I'm sure a better simulation could be derived from someone more versed in
PCA.