Skip to content

Commit

Permalink
mv figure reproducing to issue szcf-weiya#245
Browse files Browse the repository at this point in the history
  • Loading branch information
szcf-weiya committed Oct 17, 2021
1 parent 82682ab commit 494de82
Show file tree
Hide file tree
Showing 7 changed files with 76 additions and 38 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,9 @@
| ---- | ---------------------------------------- |
| 翻译 | szcf-weiya |
| 发布 | 2017-03-14 |
| 更新 | 2019-08-17 10:52:02 |
| 更新 | {{ git_revision_date }} |
| 状态 | Done |

!!! note "更新笔记"
@2017-12-28 使用 R Notwbook 记录了模拟本节中的例子的具体过程,详见[**这里**](../notes/HighDim/sim18_1/index.html)

这章中我们讨论特征的个数 $p$ 远大于观测的个数 $N$ 的预测问题,通常写成 $p >> N$.这样的问题变得越来越重要,特别是在基因和其他计算生物的领域中.我们将会看到这种情形下,**高方差 (high variance)****过拟合 (overfitting)** 是主要的考虑对象.结果表明,经常选择简单、高正则化的方式.本章的第一部分关注分类和回归中的预测问题,而第二部分讨论特征选择和特征评估这些更基本的问题.

首先,图 18.1 总结了一个小型的仿真结果,展示了 $p >> N$ 时,“**欠拟合更好 (less fitting is better)**”的原则.对 $N=100$ 个样本中的每一个,我们生成 $p$ 个成对相关系数为 $0.2$ 的标准高斯特征 $X$.响应变量 $Y$ 根据下面线性模型产生,
Expand All @@ -27,6 +24,12 @@ $$

> 图 18.1. 模拟实验的测试误差结果.显示了 3 个不同 $p$ 值(特征的数目)下,100 次模拟的相对测试误差的箱线图.相对误差是测试误差除以贝叶斯误差 $\sigma^2$.从左到右,显示了三个不同的正则化参数 $\lambda:0.001,100,1000$ 的岭回归的结果.拟合中的(平均)有效自由度在每张图的下面标出来了.
!!! info "weiya 注:"
基本重现图 18.1,除了 p = 20 能够完美重现,其余略有差异(详见 [Issue 245](https://github.com/szcf-weiya/ESL-CN/issues/245)

- p = 100 中 $\lambda = 0.001$ 时的均方误差会非常大,适当调高,比如设为 1,则也能重现;
- p = 1000 中不同 $\lambda$ 的差异似乎不大,信噪比较低时会有降低的趋势,但仍不会像上图分得那么开。

我们对数据进行岭回归拟合,其中采用了三个不同的正则参数 $\lambda:0.001,100,1000$.当 $\lambda=0.001$,这近似与最小二乘一样,仅仅有一点正则来保证当 $p > N$ 时,问题不是奇异的.图 18.1 显示了在每个情形下不同的估计达到的相对测试误差的箱线图.在每个岭回归拟合中使用的对应的平均自由度(\eqref{3.50})也标出来了.

!!! note "Recall: \eqref{3.50}"
Expand All @@ -38,9 +41,9 @@ $$
\end{align}
$$

自由度是一个比 $\lambda$ 更有解释性的参数.从图中我们看到,在 $p=20$ 时,$\lambda=0.001$(20df) 的岭回归最优;当 $p=100$ 时 $\lambda=100$(35df) 最优,并且当 $p=1000$时 $\lambda=1000$(43df) 最优.
自由度是一个比 $\lambda$ 更有解释性的参数.从图中我们看到,在 $p=20$ 时,$\lambda=0.001$(20df) 的岭回归最优;当 $p=100$ 时 $\lambda=100$ (35df) 最优,并且当 $p=1000$ 时 $\lambda=1000$ (43df) 最优.

这些结果可以解释如下.当 $p=20$ 时,我们拟合所有的情形,并且可以以低偏差尽可能地识别更多的显著系数.当 $p=100$ 时,我们可以采用中等程度的收缩识别一些非零的系数.最后,当 $p=1000$ 时,即使有许多非零系数,我们并不希望找到它们,并且我们需要收缩它们.为了说明这个结论,令 $t_j=\hat\beta_j/\widehat{se}_j$,其中 $\hat\beta_j$ 是岭回归估计,而 $\widehat{se}_j$ 是标准误差的估计.接着在这三种情形中取最优的岭回归参数,$\vert t_j\vert$ 为 2.0,0.6 和 0.2,并且超过 2 的 $\vert t_j\vert$ 的平均个数等于 9.8,1.2 和 0.0.
这些结果可以解释如下.当 $p=20$ 时,我们拟合所有的情形,并且可以以低偏差尽可能地识别更多的显著系数.当 $p=100$ 时,我们可以采用中等程度的收缩识别一些非零的系数.最后,当 $p=1000$ 时,即使有许多非零系数,但我们并不希望找到它们,而且需要统一收缩它们.为了说明这个结论,令 $t_j=\hat\beta_j/\widehat{se}_j$,其中 $\hat\beta_j$ 是岭回归估计,而 $\widehat{se}_j$ 是标准误差的估计.接着在这三种情形中取最优的岭回归参数,$\vert t_j\vert$ 为 2.0, 0.6 和 0.2,并且超过 2 的 $\vert t_j\vert$ 的平均个数等于 9.8,1.2 和 0.0.

$\lambda=0.001$ 的岭回归成功利用了当 $ p < N$ 时的特征的相关性,但是当 $ p > > N$ 时不能这样处理.在后者的情形下,在相对较少的样本中没有足够的信息来有效估计高维协方差阵.这种情形下,更大的正则化会有更好的预测表现.

Expand Down
65 changes: 34 additions & 31 deletions imgs/fig.18.1/ex18_1.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
## refer to https://esl.hohoweiya.xyz/notes/HighDim/sim18_1/index.html for more details
##
## ####################################################################################
library(MASS)

genX <- function(p, N = 100){
# mu
Expand All @@ -14,7 +15,6 @@ genX <- function(p, N = 100){
# covariance matrix equals to correlation matrix
Sigma = matrix(0.2, nrow=p, ncol=p)
diag(Sigma) = 1
library(MASS)
X = mvrnorm(N, mu, Sigma)
if (is.null(dim(X)))
X = matrix(X, ncol = p)
Expand All @@ -31,9 +31,12 @@ genY <- function(X){
epsi = rnorm(N, 0, sqrt(epsi.var))
Y = fX + epsi
res = list(Y=Y, epsi.var=epsi.var)
# if X is standardized, SNR should equal to ||beta||^2 / sigma2
# cat("||beta||^2 / sigma2 = ", sum(beta^2)/epsi.var, "\n")
return(res)
}

# simple linear regression
for (p in c(20, 100, 1000)) {
X = genX(p)
Y = genY(X)$Y
Expand All @@ -50,6 +53,7 @@ for (p in c(20, 100, 1000)) {
cat("p = ", p, " sig.num = ", sig.num, "\n")
}

# single variable linear regression
for (p in c(20, 100, 1000)){
num = 0
for (j in 1:100){
Expand All @@ -67,41 +71,40 @@ for (p in c(20, 100, 1000)){
cat("p = ", p, " num of significant term: ", num ,"\n")
}

calc_df <- function(X, lambda) {
return(sum(diag(X %*% solve(t(X) %*% X + lambda * diag(ncol(X))) %*% t(X))))
}

test.err.full = matrix(0, nrow=0, ncol=3)
epsi.var.full = c()
dfs = matrix(0, nrow = 0, ncol = 3)
for (p in c(20, 100, 1000)){
X = genX(p)
Y.res = genY(X)
Y = Y.res$Y
epsi.var = Y.res$epsi.var
epsi.var.full = c(epsi.var.full, epsi.var)
model = lm.ridge(Y~X, lambda=c(0.001, 100, 1000))
# calculate test error
#X.test = genX(p)
#Y.test = genY(X.test)
#pred = cbind(1, X.test) %*% t(coef(model))
#test.err = numeric(3)
#for (i in 1:ncol(pred))
#{
# test.err[i] = sum((Y.test - pred[,i])^2)/nrow(pred)
#}
#test.err.full = rbind(test.err.full, test.err)

for (j in 1:100){
x.test = genX(p, N=1)
y.test = sum(x.test * rnorm(p)) + rnorm(1, 0, sqrt(epsi.var))
pred = as.matrix(cbind(1, x.test)) %*% t(coef(model))
test.err = (pred - y.test)^2
# training set
Xall = genX(p, N = 200)
X = Xall[1:100,]
Y.res = genY(Xall)
Y = Y.res$Y[1:100]
epsi.var = Y.res$epsi.var
model = lm.ridge(Y~0+X, lambda=c(0.001, 100, 1000))
dfs = rbind(dfs, sapply(c(0.001, 100, 1000), function(lambda) calc_df(X, lambda)))

X.test = Xall[101:200,]
Y.test = Y.res$Y[101:200]
# pred = as.matrix(cbind(1, X.test)) %*% t(coef(model))
pred = X.test %*% t(coef(model))
cat("p = ", p, ", ", epsi.var, " err = ", colMeans((pred - Y.test)^2), "\n")
test.err = colMeans((pred - Y.test)^2) / epsi.var
test.err.full = rbind(test.err.full, test.err)
}
}
t1 = test.err.full[1:100, ]
t2 = test.err.full[101:200, ]
t3 = test.err.full[201:300, ]



boxplot(t1/epsi.var.full[1], main = "20 features", col = "green")
boxplot(t2/epsi.var.full[2], main = "100 features", col = "green")
boxplot(t3/epsi.var.full[3], main = "1000 features", col = "green")

png("../ESL-CN/imgs/fig.18.1/p20.png")
boxplot(test.err.full[1:100, ], main = "20 features", col = "green")
dev.off()
png("../ESL-CN/imgs/fig.18.1/p100.png")
boxplot(test.err.full[101:200, ], main = "100 features", col = "green")
dev.off()
png("../ESL-CN/imgs/fig.18.1/p1000.png")
boxplot(test.err.full[201:300, ], main = "1000 features", col = "green")
dev.off()
32 changes: 32 additions & 0 deletions imgs/fig.18.1/ex18_1.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
using Distributions
using LinearAlgebra
using StatsBase
using StatsPlots

lmridge(X, Y, λ) = inv(X' * X + λ * 1.0I) * X' * Y

function run(;p = 20, N = 100, nrep = 100, λs = [0.001, 100, 1000], snr = 2)
rel_errs = zeros(nrep, 3)
dfs = zeros(nrep, 3)
for i = 1:nrep
Σ = ones(p, p) * 0.2 + 0.8I
dist = MvNormal(Σ)
X = rand(dist, N)'
β = randn(p)

# σ2 = β' * Σ * β / snr
σ2 = var(X * β) / snr
# σ2 = p / 2
y = X * β + randn(N) * sqrt(σ2)

dfs[i,:] = [tr(X * inv(X' * X + λ * 1.0I) * X') for λ in λs]
βhat = [lmridge(X, y, λ) for λ in λs]

# calc test error
Xtest = rand(dist, N)'
ytest = Xtest * β + randn(N) * sqrt(σ2)
test_err = [norm(ytest - Xtest * βhat[i], 2)^2 / N for i = 1:3]
rel_errs[i, :] = test_err / σ2
end
return boxplot(rel_errs, title = "p = $p", xticks = (1:3, string.(mean(dfs, dims=1))))
end
Binary file modified imgs/fig.18.1/p100.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified imgs/fig.18.1/p1000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified imgs/fig.18.1/p20.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ pages:
- 模拟 Fig. 7.9: 'notes/ModelSelection/sim7_9.md'
- 模拟 Fig. 13.5: 'notes/Prototype/sim13_5.md'
- 模拟 Fig. 14.42: 'notes/ICA/sim14_42.md'
- 模拟 Fig. 18.1: 'notes/HighDim/sim18_1.md'
# - 模拟 Fig. 18.1: 'notes/HighDim/sim18_1.md'
- 模拟 Eq. 10.2: 'notes/boosting/sim-eq-10-2.md'
- 模拟 Tab. 12.2: 'notes/SVM/skin-of-the-orange.md'
- 模拟 Fig. 9.7: 'notes/tree/sim-9-7.md'
Expand Down

0 comments on commit 494de82

Please sign in to comment.