forked from topepo/caret
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSA.Rhtml
511 lines (426 loc) · 22.3 KB
/
SA.Rhtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
<!--begin.rcode results='hide', echo=FALSE, message=FALSE
library(caret)
data(BloodBrain)
library(parallel)
library(doMC)
registerDoMC(cores=detectCores()-1)
library(randomForest)
theme1 <- caretTheme()
theme1$superpose.symbol$col = c(rgb(1, 0, 0, .4), rgb(0, 0, 1, .4),
rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.symbol$pch = c(15, 16, 17)
theme1$superpose.cex = .8
theme1$superpose.line$col = c(rgb(1, 0, 0, .9), rgb(0, 0, 1, .9), rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.line$lwd <- 2
theme1$superpose.line$lty = 1:3
theme1$plot.symbol$col = c(rgb(.2, .2, .2, .4))
theme1$plot.symbol$pch = 16
theme1$plot.cex = .8
theme1$plot.line$col = c(rgb(1, 0, 0, .7))
theme1$plot.line$lwd <- 2
theme1$plot.line$lty = 1
hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
if (is.character(x)) highr::hi_html(x) else hook_inline(x)
})
opts_chunk$set(comment=NA, digits = 3)
session <- paste(format(Sys.time(), "%a %b %d %Y"),
"using caret version",
packageDescription("caret")$Version,
"and",
R.Version()$version.string)
end.rcode-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
Design by Free CSS Templates
http://www.freecsstemplates.org
Released for free under a Creative Commons Attribution 2.5 License
Name : Emerald
Description: A two-column, fixed-width design with dark color scheme.
Version : 1.0
Released : 20120902
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Feature Selection using Simulated Annealing</title>
<link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
<link href="style.css" rel="stylesheet" type="text/css" media="screen" />
</head>
<body>
<div id="wrapper">
<div id="header-wrapper" class="container">
<div id="header" class="container">
<div id="logo">
<h1><a href="#">Feature Selection using Simulated Annealing</a></h1>
</div>
<!--
<div id="menu">
<ul>
<li class="current_page_item"><a href="#">Homepage</a></li>
<li><a href="#">Blog</a></li>
<li><a href="#">Photos</a></li>
<li><a href="#">About</a></li>
<li><a href="#">Contact</a></li>
</ul>
</div>
-->
</div>
<div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
</div>
<!-- end #header -->
<div id="page">
<div id="content">
<h1>Contents</h1>
<ul>
<li><a href="#sa">Simulated Annealing</a></li>
<li><a href="#performance">Internal and External Performance Estimates</a></li>
<li><a href="#syntax">Basic Syntax</a></li>
<li><a href="#example">Example</a></li>
<li><a href="#custom">Customizing the Search</a></li>
</ul>
<div id="sa"></div>
<h1>Simulated Annealing</h1>
<p>
Simulated annealing (SA) is a global search method that makes small random changes (i.e. perturbations) to an initial candidate solution. If the performance value for the perturbed value is better than the previous solution, the new solution is accepted. If not, an acceptance probability is determined based on the difference between the two performance values and the current iteration of the search. From this, a sub-optimal solution can be accepted on the off-change that it may eventually produce a better solution in subsequent iterations. See <a href="http://scholar.google.com/scholar?hl=en&q=%22Optimization+by+simulated+annealing">Kirkpatrick (1984)</a> or <a href="http://scholar.google.com/scholar?q=%22Simulated+annealing+algorithms%3A+An+overview">Rutenbar (1989)</a> for better descriptions.
</p>
<p>
In the context of feature selection, a solution is a binary vector that describes the current subset. The subset is perturbed by randomly changing a small number of members in the subset.
</p>
<div id="performance"></div>
<h1>Internal and External Performance Estimates</h1>
<p>
Much of the discussion on this subject in the <a href="ga.html#performance">genetic algorithm page</a> is relevant here, although SA search is less aggressive than GA search. In any case, the implementation here conducts the SA search inside the resampling loops and uses an external performance estimate to choose how many iterations of the search are appropriate.
</p>
<div id="syntax"></div>
<h1>Basic Syntax</h1>
<p>
The syntax of this function is very similar to the previous information for genetic algorithm searches.
</p>
<p>
The most basic usage of the function is:
</p>
<!--begin.rcode sa_syntax_basic,eval=FALSE
obj <- safs(x = predictors,
y = outcome,
iters = 100)
end.rcode-->
<p>
where
</p>
<ul>
<li> <span class="mx arg">x</span>:
a data frame or matrix of predictor values
</li>
<li> <span class="mx arg">y</span>:
a factor or numeric vector of outcomes
</li>
<li> <span class="mx arg">iters</span>:
the number of iterations for the SA
</li>
</ul>
<p>
This isn't very specific. All of the action is in the control function. That can be used to specify the model to be fit, how predictions are made and summarized as well as the genetic operations.
</p>
<p>
Suppose that we want to fit a linear regression model. To do this, we can use <span class="mx funCall">train</span> as an interface and pass arguments to that function through <span class="mx funCall">safs</span>:
</p>
<!--begin.rcode sa_syntax_basic_lm,eval=FALSE
ctrl <- safsControl(functions = caretSA)
obj <- safs(x = predictors,
y = outcome,
iters = 100,
safsControl = ctrl,
## Now pass options to `train`
method = "lm")
end.rcode-->
<p>
Other options, such as <span class="mx arg">preProcess</span>, can be passed in as well.
</p>
<p>
Some important options to <span class="mx funCall">safsControl</span> are:
</p>
<ul>
<li> <span class="mx arg">method</span>, <span class="mx arg">number</span>, <span class="mx arg">repeats</span>, <span class="mx arg">index</span>, <span class="mx arg">indexOut</span>, etc:
options similar to those for <a href="http://topepo.github.io/caret/training.html#control"><span class="mx funCall">train</span></a> top control resampling.
</li>
<li> <span class="mx arg">metric</span>:
this is similar to <a href="http://topepo.github.io/caret/training.html#control"><span class="mx funCall">train</span></a>'s option but, in this case, the value should be a named vector with values for the internal and external metrics. If none are specified, the first value returned by the summary functions (see details below) are used and a warning is issued. A similar two-element vector for the option <span class="mx arg">maximize</span> is also required. See the <a href="#example2">last example here</a> for an illustration.
</li>
<li> <span class="mx arg">holdout</span>:
this is a number between [0, 1) that can be used to hold out samples for computing the internal fitness value. Note that this is independent of the external resampling step. Suppose 10-fold CV is being used. Within a resampling iteration, <span class="mx arg">holdout</span> can be used to sample an additional proportion of the 90% resampled data to use for estimating fitness. This may not be a good idea unless you have a very large training set and want to avoid an internal resampling procedure to estimate fitness.
</li>
<li> <span class="mx arg">improve</span>:
an integer (or infinity) defining how many iterations should pass without an improvement in fitness before the current subset is reset to the last known improvement.
</li>
<li> <span class="mx arg">allowParallel</span>: should the external resampling loop be run in parallel?.
</li>
</ul>
<p>
There are a few built-in sets of functions to use with <span class="mx funCall">safs</span>: <code>caretSA</code>, <code>rfSA</code>, and <code>treebagSA</code>. The first is a simple interface to
<span class="mx funCall">train</span>. When using this, as shown above, arguments can be passed to <span class="mx funCall">train</span> using the <span class="mx arg">...</span> structure and the resampling estimates of performance can be used as the internal fitness value. The functions provided by <code>rfSA</code> and <code>treebagSA</code> avoid using <span class="mx funCall">train</span> and their internal estimates of fitness come from using the out-of-bag estimates generated from the model.
</p>
<div id="example"></div>
<h1>Example</h1>
<p>
Using the example from the <a href="rfe.html#example">previous page</a> where there are five real predictors and 40 noise predictors.
</pa>
<!--begin.rcode sa_load_sim,echo=FALSE,message=FALSE
library(mlbench)
n <- 100
p <- 40
sigma <- 1
set.seed(1)
sim <- mlbench.friedman1(n, sd = sigma)
colnames(sim$x) <- c(paste("real", 1:5, sep = ""),
paste("bogus", 1:5, sep = ""))
bogus <- matrix(rnorm(n * p), nrow = n)
colnames(bogus) <- paste("bogus", 5+(1:ncol(bogus)), sep = "")
x <- cbind(sim$x, bogus)
y <- sim$y
normalization <- preProcess(x)
x <- predict(normalization, x)
x <- as.data.frame(x)
end.rcode-->
<p>
We'll fit a random forest model and use the out-of-bag RMSE estimate as the internal performance metric and use the same repeated 10-fold cross-validation process used with the search. To do this, we'll use the built-in <code>rfSA</code> object for this purpose. The default SA operators will be used with 1000 iterations of the algorithm.
</p>
<!--begin.rcode sa_rf,cache=TRUE
sa_ctrl <- safsControl(functions = rfSA,
method = "repeatedcv",
repeats = 5,
improve = 50)
set.seed(10)
rf_sa <- safs(x = x, y = y,
iters = 500,
safsControl = sa_ctrl)
rf_sa
end.rcode-->
<p>
As with the GA, we can plot the internal and external performance over iterations.
</p>
<!--begin.rcode sa_rf_profile,fig.width=8,fig.height=4
plot(rf_sa) + theme_bw()
end.rcode-->
<p>
The performance here isn't as good as the previous GA or RFE solutions. Based on these results, the iteration associated with the best external RMSE estimate was <!--rinline I(rf_sa$optIter) --> with a corresponding RMSE estimate of <!--rinline I(round(mean(subset(rf_sa$external, Iter == rf_sa$optIter)$RMSE), 2)) -->.
</p>
<p>
Using the entire training set, the final SA is conducted and, at iteration <!--rinline I(rf_sa$optIter) -->, there were <!--rinline I(length(rf_sa$optVariables)) --> selected: <!--rinline I(paste(rf_sa$optVariables, sep = "", collapse = ", ")) -->. The random forest model with these predictors is created using the entire training set is trained and this is the model that is used when <span class="mx funCall">predict.safs</span> is executed.
</p>
<div id="custom"></div>
<h1>Customizing the Search</h1>
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">fit</span> Function</p></i></b>
<p>
This function builds the model based on a proposed current subset. The arguments for the function must be:
</p>
<ul>
<li> <span class="mx arg">x</span>: the current training set of predictor data with the appropriate subset of variables</li>
<li> <span class="mx arg">y</span>: the current outcome data (either a numeric or factor vector)</li>
<li> <span class="mx arg">lev</span>: a character vector with the class levels (or <code>NULL</code> for regression problems)</li>
<li> <span class="mx arg">last</span>: a logical that is <span class="mx arg">TRUE</span> when the final SA search is conducted on the entire data set </li>
<li> <span class="mx arg">...</span>: optional arguments to pass to the fit function in the call to <span class="mx funCall">safs</span></li>
</ul>
<p>
The function should return a model object that can be used to generate predictions. For random forest, the fit function is simple:
</p>
<!--begin.rcode ga_fit
rfSA$fit
end.rcode-->
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">pred</span> Function</p></i></b>
<p>
This function returns a vector of predictions (numeric or factors) from the current model . The input arguments must be
</p>
<ul>
<li> <span class="mx arg">object</span>: the model generated by the <span class="mx funCall">fit</span> function</li>
<li> <span class="mx arg">x</span>: the current set of predictor set for the held-back samples</li>
</ul>
<p>
For random forests, the function is a simple wrapper for the predict function:
</p>
<!--begin.rcode ga_pred
rfSA$pred
end.rcode-->
<p>
For classification, it is probably a good idea to ensure that the resulting factor variables of predictions has the same levels as the input data.
</p>
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">fitness_intern</span> Function</p></i></b>
<p>
The <span class="mx funCall">fitness_intern</span> function takes the fitted model and computes one or more performance metrics. The inputs to this function are:
</p>
<ul>
<li> <span class="mx arg">object</span>: the model generated by the <span class="mx funCall">fit</span> function</li>
<li> <span class="mx arg">x</span>: the current set of predictor set. If the option <span class="mx funCall">safsControl</span><code>$</code><span class="mx arg">holdout</span> is zero, these values will be from the current resample (i.e. the same data used to fit the model). Otherwise, the predictor values are from the hold-out set created by <span class="mx funCall">safsControl</span><code>$</code><span class="mx arg">holdout</span>.
</li>
<li> <span class="mx arg">y</span>: outcome values. See the note for the <span class="mx arg">x</span> argument to understand which data are presented to the function. </li>
<li> <span class="mx arg">maximize</span>: a logical from <span class="mx funCall">safsControl</span> that indicates whether the metric should be maximized or minimized</li>
<li> <span class="mx arg">p</span>: the total number of possible predictors</li>
</ul>
<p>
The output should be a <b>named</b> numeric vector of performance values.
</p>
<p>
In many cases, some resampled measure of performance is used. In the example above using random forest, the OOB error was used. In other cases, the resampled performance from <span class="mx funCall">train</span> can be used and, if <span class="mx funCall">safsControl</span><code>$</code><span class="mx arg">holdout</span> is not zero, a static hold-out set can be used. This depends on the data and problem at hand. If left
</p>
<p>
The example function for random forest is:
</p>
<!--begin.rcode ga_int_func
rfSA$fitness_intern
end.rcode-->
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">fitness_extern</span> Function</p></i></b>
<p>
The <span class="mx funCall">fitness_extern</span> function takes the observed and predicted values form the external resampling process and computes one or more performance metrics. The input arguments are:
</p>
<ul>
<li> <span class="mx arg">data</span>: a data frame or predictions generated by the <span class="mx funCall">fit</span> function. For regression, the predicted values in a column called <code>pred</code>. For classification, <code>pred</code> is a factor vector. Class probabilities are usually attached as columns whose names are the class levels (see the random forest example for the <span class="mx funCall">fit</span> function above)
</li>
<li> <span class="mx arg">lev</span>: a character vector with the class levels (or <code>NULL</code> for regression problems)</li>
</ul>
<p>
The output should be a <b>named</b> numeric vector of performance values.
</p>
<p>
The example function for random forest is:
</p>
<!--begin.rcode ga_ext_func
rfSA$fitness_extern
end.rcode-->
<p>
Two functions in <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> that can be used as the summary
function are <span class="mx funCall">defaultSummary</span> and <span class="mx funCall">twoClassSummary</span> (for
classification problems with two classes).
</p>
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">initial</span> Function</p></i></b>
<p>
This function creates an initial subset. Inputs are:
</p>
<ul>
<li> <span class="mx arg">vars</span>: the number of possible predictors</li>
<li> <span class="mx arg">prob</span>: the probability that a feature is in the subset</li>
<li> <span class="mx arg">...</span>: not currently used</li>
</ul>
<p>
The output should be a vector of integers indicating which predictors are in the initial subset.
</p>
<p>
Alternatively, instead of a function, a vector of integers can be used in this slot.
</p>.
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">perturb</span> Function</p></i></b>
<p>
This function perturbs the subset. Inputs are:
</p>
<ul>
<li> <span class="mx arg">x</span>: the integers defining the current subset</li>
<li> <span class="mx arg">vars</span>: the number of possible predictors</li>
<li> <span class="mx arg">number</span>: the number of predictors to randomly change</li>
<li> <span class="mx arg">...</span>: not currently used</li>
</ul>
<p>
The output should be a vector of integers indicating which predictors are in the new subset.
</p>
<!-- ---------------------------------------------------------------------------------------- -->
<b><p><i>The <span class="mx funCall">prob</span> Function</p></i></b>
<p>
This function computes the acceptance probability. Inputs are:
</p>
<ul>
<li> <span class="mx arg">old</span>: the fitness value for the current subset</li>
<li> <span class="mx arg">new</span>: the fitness value for the new subset</li>
<li> <span class="mx arg">iteration</span>: the current iteration number or, if the <span class="mx arg">improve</span>argument of <span class="mx funCall">safsControl</span> is used, the number of iterations since the last restart</li>
<li> <span class="mx arg">...</span>: not currently used</li>
</ul>
<p>
The output should be a numeric value between zero and one.
</p>
<p>
One of the biggest difficulties in using simulated annealing is the specification of the acceptance probability calculation. There are many references on different methods for doing this but the general consensus is that 1) the probability should decrease as the difference between the current and new solution increases and 2) the probability should decrease over iterations. One issue is that the difference in fitness values can be scale-dependent. In this package, the default probability calculations uses the percent difference, i.e. <tt>(current - new)/current</tt> to normalize the difference. The basic form of the probability simply takes the difference, multiplies by the iteration number and exponentiates this product:
</p>
<pre>
prob = exp[(current - new)/current*iteration]
</pre>
<p>
To demonstrate this, the plot below shows the probability profile for different fitness values of the current subset and different (absolute) differences. For the example data that were simulated, the RMSE values ranged between values greater than 4 to just under 3. In the plot below, the red curve in the right-hand panel shows how the probability changes over time when comparing a current value of 4 with a new values of 4.5 (smaller values being better). While this difference would likely be accepted in the first few iterations, it is unlikely to be accepted after 30 or 40. Also, larger differences are uniformly disfavored relative to smaller differences.
</p>
<!--begin.rcode sa_prob_plot,fig.width=8,fig.height=4
grid <- expand.grid(old = c(4, 3.5),
new = c(4.5, 4, 3.5) + 1,
iter = 1:40)
grid <- subset(grid, old < new)
grid$prob <- apply(grid, 1,
function(x)
safs_prob(new = x["new"],
old= x["old"],
iteration = x["iter"]))
grid$Difference <- factor(grid$new - grid$old)
grid$Group <- factor(paste("Current Value", grid$old))
ggplot(grid, aes(x = iter, y = prob, color = Difference)) +
geom_line() + facet_wrap(~Group) + theme_bw() +
ylab("Probability") + xlab("Iteration")
end.rcode-->
<p>
While this is the default, any user-written function can be used to assign probabilities.
</p>
<!-- ------------------------ end #content------------------------ -->
<div style="clear: both;"> </div>
</div>
<!-- end #content -->
<div id="sidebar">
<ul>
<li>
<h2>General Topics</h2>
<ul>
<li><a href="index.html">Front Page</a></li>
<li><a href="visualizations.html">Visualizations</a></li>
<li><a href="preprocess.html">Pre-Processing</a><li>
<li><a href="splitting.html">Data Splitting</a></li>
<li><a href="varimp.html">Variable Importance</a></li>
<li><a href="other.html">Model Performance</a></li>
<li><a href="parallel.html">Parallel Processing</a></li>
</ul>
<h2>Model Training and Tuning</h2>
<ul>
<li><a href="training.html">Basic Syntax</a></li>
<li><a href="modelList.html">Sortable Model List</a></li>
<li><a href="bytag.html">Models By Tag</a></li>
<li><a href="similarity.html">Models By Similarity</a></li>
<li><a href="custom_models.html">Using Custom Models</a></li>
<li><a href="sampling.html">Sampling for Class Imbalances</a></li>
<li><a href="random.html">Random Search</a></li>
<li><a href="adaptive.html">Adaptive Resampling</a></li>
</ul>
<h2>Feature Selection</h2>
<ul>
<li><a href="featureselection.html">Overview</a>
<li><a href="rfe.html">RFE</a></li>
<li><a href="filters.html">Filters</a></li>
<li><a href="GA.html">GA</a></li>
<li><a href="SA.html">SA</a></li>
</ul>
</li>
</ul>
</div>
<!-- end #sidebar -->
<div style="clear: both;"> </div>
</div>
<div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
<!-- end #page -->
</div>
<div id="footer-content"></div>
<div id="footer">
<!--begin.rcode echo = FALSE
knit_hooks$set(inline = hook_inline)
end.rcode-->
<p>Created on <!--rinline I(session) -->.</p>
</div>
<!-- end #footer -->
</body>
</html>