forked from IntelLabs/distiller
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathalgo_pruning.html
406 lines (305 loc) · 22.9 KB
/
algo_pruning.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="img/favicon.ico">
<title>Pruning - Neural Network Distiller</title>
<link href='https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="css/theme.css" type="text/css" />
<link rel="stylesheet" href="css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
<link href="extra.css" rel="stylesheet">
<script>
// Current page data
var mkdocs_page_name = "Pruning";
var mkdocs_page_input_path = "algo_pruning.md";
var mkdocs_page_url = null;
</script>
<script src="js/jquery-2.1.1.min.js" defer></script>
<script src="js/modernizr-2.8.3.min.js" defer></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-nav-search">
<a href="index.html" class="icon icon-home"> Neural Network Distiller</a>
<div role="search">
<form id ="rtd-search-form" class="wy-form" action="./search.html" method="get">
<input type="text" name="q" placeholder="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul class="current">
<li class="toctree-l1">
<a class="" href="index.html">Home</a>
</li>
<li class="toctree-l1">
<a class="" href="install.html">Installation</a>
</li>
<li class="toctree-l1">
<a class="" href="usage.html">Usage</a>
</li>
<li class="toctree-l1">
<a class="" href="schedule.html">Compression Scheduling</a>
</li>
<li class="toctree-l1">
<a class="" href="prepare_model_quant.html">Preparing a Model for Quantization</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Compressing Models</span>
<ul class="subnav">
<li class="">
<a class="" href="pruning.html">Pruning</a>
</li>
<li class="">
<a class="" href="regularization.html">Regularization</a>
</li>
<li class="">
<a class="" href="quantization.html">Quantization</a>
</li>
<li class="">
<a class="" href="knowledge_distillation.html">Knowledge Distillation</a>
</li>
<li class="">
<a class="" href="conditional_computation.html">Conditional Computation</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<span class="caption-text">Algorithms</span>
<ul class="subnav">
<li class=" current">
<a class="current" href="algo_pruning.html">Pruning</a>
<ul class="subnav">
<li class="toctree-l3"><a href="#weights-pruning-algorithms">Weights Pruning Algorithms</a></li>
<ul>
<li><a class="toctree-l4" href="#magnitude-pruner">Magnitude Pruner</a></li>
<li><a class="toctree-l4" href="#sensitivity-pruner">Sensitivity Pruner</a></li>
<li><a class="toctree-l4" href="#level-pruner">Level Pruner</a></li>
<li><a class="toctree-l4" href="#splicing-pruner">Splicing Pruner</a></li>
<li><a class="toctree-l4" href="#automated-gradual-pruner-agp">Automated Gradual Pruner (AGP)</a></li>
<li><a class="toctree-l4" href="#rnn-pruner">RNN Pruner</a></li>
</ul>
<li class="toctree-l3"><a href="#structure-pruners">Structure Pruners</a></li>
<ul>
<li><a class="toctree-l4" href="#structure-ranking-pruners">Structure Ranking Pruners</a></li>
<li><a class="toctree-l4" href="#automated-gradual-pruner-agp-for-structures">Automated Gradual Pruner (AGP) for Structures</a></li>
<li><a class="toctree-l4" href="#hybrid-pruning">Hybrid Pruning</a></li>
</ul>
</ul>
</li>
<li class="">
<a class="" href="algo_quantization.html">Quantization</a>
</li>
<li class="">
<a class="" href="algo_earlyexit.html">Early Exit</a>
</li>
</ul>
</li>
<li class="toctree-l1">
<a class="" href="model_zoo.html">Model Zoo</a>
</li>
<li class="toctree-l1">
<a class="" href="jupyter.html">Jupyter Notebooks</a>
</li>
<li class="toctree-l1">
<a class="" href="design.html">Design</a>
</li>
<li class="toctree-l1">
<span class="caption-text">Tutorials</span>
<ul class="subnav">
<li class="">
<a class="" href="tutorial-struct_pruning.html">Pruning Filters and Channels</a>
</li>
<li class="">
<a class="" href="tutorial-lang_model.html">Pruning a Language Model</a>
</li>
<li class="">
<a class="" href="tutorial-lang_model_quant.html">Quantizing a Language Model</a>
</li>
<li class="">
<a class="" href="tutorial-gnmt_quant.html">Quantizing GNMT</a>
</li>
</ul>
</li>
</ul>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="index.html">Neural Network Distiller</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="index.html">Docs</a> »</li>
<li>Algorithms »</li>
<li>Pruning</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main">
<div class="section">
<h1 id="weights-pruning-algorithms">Weights Pruning Algorithms</h1>
<p><center><img alt="algorithms" src="imgs/algorithms-pruning.png" /></center></p>
<h2 id="magnitude-pruner">Magnitude Pruner</h2>
<p>This is the most basic pruner: it applies a thresholding function, \(thresh(.)\), on each element, \(w_i\), of a weights tensor. A different threshold can be used for each layer's weights tensor.<br>
Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.</p>
<p>\[ thresh(w_i)=\left\lbrace
\matrix{{{w_i: \; if \;|w_i| \; \gt}\;\lambda}\cr {0: \; if \; |w_i| \leq \lambda} }
\right\rbrace \]</p>
<h2 id="sensitivity-pruner">Sensitivity Pruner</h2>
<p>Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor.
<br>
The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution.<br>
<center><img alt="conv1" src="imgs/alexnet-conv1-hist.png" /> <img alt="fc1" src="imgs/alexnet-fc1-hist.png" /></center>
<center>The distributions of Alexnet conv1 and fc1 layers</center></p>
<p>We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\(\sigma\)) of the tensor. Thus, if we set the threshold to \(s*\sigma\), then basically we are thresholding \(s * 68\%\) of the tensor elements. </p>
<p>\[ thresh(w_i)=\left\lbrace
\matrix{{{w_i: \; if \;|w_i| \; \gt}\;\lambda}\cr {0: \; if \; |w_i| \leq \lambda} }
\right\rbrace \]</p>
<p>\[
\lambda = s * \sigma_l \;\;\; where\; \sigma_l\; is \;the \;std \;of \;layer \;l \;as \;measured \;on \;the \;dense \;model
\]</p>
<p>How do we choose this \(s\) multiplier?</p>
<p>In <a href="https://arxiv.org/abs/1506.02626">Learning both Weights and Connections for Efficient Neural Networks</a> the authors write:</p>
<blockquote>
<p>"We used the sensitivity results to find each layer’s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer’s weights</p>
</blockquote>
<p>So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \(s\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.</p>
<h3 id="method-of-operation">Method of Operation</h3>
<ol>
<li>Start by running a pruning sensitivity analysis on the model. </li>
<li>Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.</li>
</ol>
<h3 id="schedule">Schedule</h3>
<p>In their <a href="https://arxiv.org/abs/1506.02626">paper</a> Song Han et al. use iterative pruning and change the value of the \(s\) multiplier at each pruning step. Distiller's <code>SensitivityPruner</code> works differently: the value \(s\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are "pulled" toward the center of the distribution and thus more elements gets pruned.</p>
<p>This actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.
<center><img alt="conv1" src="imgs/alexnet-fc1-training-plot.png" /></center></p>
<p>We use a simple iterative-pruning schedule such as: <em>Prune every second epoch starting at epoch 0, and ending at epoch 38.</em> This excerpt from <code>alexnet.schedule_sensitivity.yaml</code> shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:</p>
<pre><code>pruners:
my_pruner:
class: 'SensitivityPruner'
sensitivities:
'features.module.0.weight': 0.25
'features.module.3.weight': 0.35
'features.module.6.weight': 0.40
'features.module.8.weight': 0.45
'features.module.10.weight': 0.55
'classifier.1.weight': 0.875
'classifier.4.weight': 0.875
'classifier.6.weight': 0.625
policies:
- pruner:
instance_name : 'my_pruner'
starting_epoch: 0
ending_epoch: 38
frequency: 2
</code></pre>
<h2 id="level-pruner">Level Pruner</h2>
<p>Class <code>SparsityLevelParameterPruner</code> uses a similar method to go around specifying specific thresholding magnitudes.
Instead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.<br><br />
This pruner is much more stable compared to <code>SensitivityPruner</code> because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's <code>SensitivityPruner</code> is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \(s\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). </p>
<p>To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each</p>
<h3 id="method-of-operation_1">Method of Operation</h3>
<ol>
<li>Sort the weights in the specified layer by their absolute values. <br></li>
<li>Mask to zero the smallest magnitude weights until the desired sparsity level is reached.</li>
</ol>
<h2 id="splicing-pruner">Splicing Pruner</h2>
<p>In <a href="https://arxiv.org/abs/1600.604493">Dynamic Network Surgery for Efficient DNNs</a> Guo et. al propose that network pruning and splicing work in tandem. A <code>SpilicingPruner</code> is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the <code>PruningPolicy</code> to mask weights only during the forward pass.</p>
<h2 id="automated-gradual-pruner-agp">Automated Gradual Pruner (AGP)</h2>
<p>In <a href="https://arxiv.org/abs/1710.01878">To prune, or not to prune: exploring the efficacy of pruning for model compression</a>, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in <code>AutomatedGradualPruner</code>.
<center><img alt="agp formula" src="imgs/agp_formula.png" /></center></p>
<blockquote>
<p>"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \(s_i\) (usually 0) to a final sparsity value \(s_f\) over a span of n pruning steps.
The intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are
abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.""</p>
</blockquote>
<p><center><img alt="Gradual Pruning" src="imgs/gradual_pruning.png" /></center></p>
<p>You can play with the scheduling parameters in the <a href="https://github.com/IntelLabs/distiller/blob/master/jupyter/agp_schedule.ipynb">agp_schedule.ipynb notebook</a>.</p>
<p>The authors describe AGP:</p>
<blockquote>
<ul>
<li>Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.</li>
<li>Doesn't require much hyper-parameter tuning</li>
<li>Shown to perform well across different models</li>
<li>Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.</li>
</ul>
</blockquote>
<h2 id="rnn-pruner">RNN Pruner</h2>
<p>The authors of <a href="https://arxiv.org/abs/1704.05119">Exploring Sparsity in Recurrent Neural Networks</a>, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, "propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network." They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers.</p>
<p>Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.</p>
<p><center><img alt="Baidu RNN Pruning" src="imgs/baidu_rnn_pruning.png" /></center></p>
<h1 id="structure-pruners">Structure Pruners</h1>
<p>Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire "structures", such as kernels, filters, and even entire feature-maps.</p>
<h2 id="structure-ranking-pruners">Structure Ranking Pruners</h2>
<p>Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below).
In <a href="https://arxiv.org/abs/1608.08710">Pruning Filters for Efficient ConvNets</a> the authors use filter ranking, with <strong>one-shot pruning</strong> followed by fine-tuning. The authors of <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6288897">Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition</a> also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:</p>
<blockquote>
<p>First, after sweeping through the full training set several times the weights become relatively stable — they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)</p>
</blockquote>
<h3 id="l1rankedstructureparameterpruner">L1RankedStructureParameterPruner</h3>
<p>The <code>L1RankedStructureParameterPruner</code> pruner calculates the magnitude of some "structure", orders all of the structures based on some magnitude function and the <em>m</em> lowest ranking structures are pruned away. This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude. The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \(L_1\)-norm of the structure, and at the same time it is a good proxy of the \(L_1\)-norm. Basically, you can think of <code>mean(abs(t))</code> as a form of normalization of the structure L1-norm by the length of the structure. <code>L1RankedStructureParameterPruner</code> currently prunes weight filters, channels, and rows (for linear layers).</p>
<h3 id="activationapozrankedfilterpruner">ActivationAPoZRankedFilterPruner</h3>
<p>The <code>ActivationAPoZRankedFilterPruner</code> pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters.
This method is called <em>Network Trimming</em> from the research paper:
"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures",
Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016
https://arxiv.org/abs/1607.03250 </p>
<h3 id="gradientrankedfilterpruner">GradientRankedFilterPruner</h3>
<p>The <code>GradientRankedFilterPruner</code> tries to asses the importance of weight filters using the product of their gradients and the filter value. </p>
<h3 id="randomrankedfilterpruner">RandomRankedFilterPruner</h3>
<p>For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking. The <code>RandomRankedFilterPruner</code> pruner can be used for this purpose.</p>
<h2 id="automated-gradual-pruner-agp-for-structures">Automated Gradual Pruner (AGP) for Structures</h2>
<p>The idea of a mathematical formula controlling the sparsity level growth is very useful and <code>StructuredAGP</code> extends the implementation to structured pruning.</p>
<h3 id="pruner-compositions">Pruner Compositions</h3>
<p>Pruners can be combined to create new pruning schemes. Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions. For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:</p>
<ul>
<li><code>L1RankedStructureParameterPruner_AGP</code></li>
<li><code>ActivationAPoZRankedFilterPruner_AGP</code></li>
<li><code>GradientRankedFilterPruner_AGP</code></li>
<li><code>RandomRankedFilterPruner_AGP</code></li>
</ul>
<h2 id="hybrid-pruning">Hybrid Pruning</h2>
<p>In a single schedule we can mix different pruning techniques. For example, we might mix pruning and regularization. Or structured pruning and element-wise pruning. We can even apply different methods on the same tensor. For example, we might want to perform filter pruning for a few epochs, then perform <em>thinning</em> and continue with element-wise pruning of the smaller network tensors. This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.</p>
</div>
</div>
<footer>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="algo_quantization.html" class="btn btn-neutral float-right" title="Quantization">Next <span class="icon icon-circle-arrow-right"></span></a>
<a href="conditional_computation.html" class="btn btn-neutral" title="Conditional Computation"><span class="icon icon-circle-arrow-left"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="http://www.mkdocs.org">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" style="cursor: pointer">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="conditional_computation.html" style="color: #fcfcfc;">« Previous</a></span>
<span style="margin-left: 15px"><a href="algo_quantization.html" style="color: #fcfcfc">Next »</a></span>
</span>
</div>
<script>var base_url = '.';</script>
<script src="js/theme.js" defer></script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML" defer></script>
<script src="search/main.js" defer></script>
</body>
</html>