-
Notifications
You must be signed in to change notification settings - Fork 0
/
atom.xml
1162 lines (1076 loc) · 111 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>小骑士</title>
<subtitle>subtitle</subtitle>
<link href="/atom.xml" rel="self"/>
<link href="http://yoursite.com/"/>
<updated>2018-01-23T04:26:22.729Z</updated>
<id>http://yoursite.com/</id>
<author>
<name>knightyang</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>先知平台与Google Cloud automl的对比分析</title>
<link href="http://yoursite.com/2018/01/23/%E7%AC%AC%E5%9B%9B%E8%8C%83%E5%BC%8F%E5%85%88%E7%9F%A5%E5%B9%B3%E5%8F%B0%E4%B8%8EGoogle%20Cloud%20automl%E5%AF%B9%E6%AF%94%E5%88%86%E6%9E%90/"/>
<id>http://yoursite.com/2018/01/23/第四范式先知平台与Google Cloud automl对比分析/</id>
<published>2018-01-23T03:44:59.477Z</published>
<updated>2018-01-23T04:26:22.729Z</updated>
<content type="html"><![CDATA[<h3 id="先知平台与Google-Cloud-automl的对比分析"><a href="#先知平台与Google-Cloud-automl的对比分析" class="headerlink" title="先知平台与Google Cloud automl的对比分析"></a>先知平台与Google Cloud automl的对比分析</h3><h4 id="1-第四范式先知平台"><a href="#1-第四范式先知平台" class="headerlink" title="1. 第四范式先知平台"></a>1. 第四范式先知平台</h4><ul>
<li><p>简介</p>
<p>“先知”聚集了数据免清洗、模型自学习、一键上线、弹性计算、实时数据流、智能数据集成、特征自动组合、面向 AI的计算框架、模型解读技术、个性化需求满足这10 大超级功能,“先知” 平台能够自动化、智能化的实现机器学习全流程。</p>
</li>
<li><p>分析</p>
<p>先知平台底层是第四范式自研的大规模分布式机器学习框架 GDBT(General Distributed Brain Technology)。</p>
<p><strong>特点是将处理机器学习相关的功能封装成了独立的算子,例如数据预处理阶段的sql算子、数据拆分算子、数据清洗算子;特征工程相关的特征抽取算子、自动特征组合算子;分类相关的逻辑回归算子、DNN算子;模型预测与评估相关的算子等等。特定算子的输出可作为一些算子的输入:例如特征抽取算子的输出可直接传递给逻辑回归算子。</strong></p>
<p>整体架构包括了离线计算(基于yarn、spark的离线计算集群,常用于模型训练)与实时计算(基于高性能云服务器,接收Http请求与RPC请求,常用于线上预测)。其整体架构图如下:<br><img src="https://res.infoq.com/articles/the-fourth-paradigm-prophet-platform/zh/resources/01.jpg" alt="先知平台架构图"></p>
</li>
<li><p>用法</p>
<p>登陆先知平台直接拖拽算子,填写算子参数。</p>
<p><img src="web.jpg" alt="先知平台用户操作界面"></p>
</li>
<li><p>小结</p>
<p>先知系统将机器学习相关的数据预处理作继承为算子,同时封装机器学习相关算法为算子,相关算子间可以构成数据依赖;依托高性能云服务器和分布式Yarn集群可提供高效的在线预测与分类、离线训练;提供图形化用户界面,可通过配置算子的方式方便构建模型并提供服务。</p>
<p>将机器学习算法封装成算子也带来不足:若平台已有算法不能满足业务需求,则无法使用,平台没有支持业务方自定义算子。</p>
</li>
</ul>
<h4 id="2-Google-Cloud-automl"><a href="#2-Google-Cloud-automl" class="headerlink" title="2. Google Cloud automl"></a>2. Google Cloud automl</h4><ul>
<li><p>简介</p>
<p> Google的自动训练模型平台,基于监督学习创建,开发者只需要通过鼠标拖拽的方式上传一组图片、导入标签,随后谷歌系统就会自动生成一个定制化的机器学习模型,几乎不需要任何人为的干预。</p>
</li>
</ul>
<ul>
<li><p>分析</p>
<p>该平台目前只提供了Cloud AutoML Vision(处理图片、视频相关),后续会推出更多功能。现阶段处于试用期,需要提交申请才可能使用(ps:以个人名义申请快一周了还未有答复)。</p>
<p><strong>最大的特点是运用迁移学习技术,基于已训练好的旧场景模型和少量新场景数据,重新训练一个适用于新场景的模型;还通过learning2learn功能自动挑选适合的模型,搭配超参数调整技术(Hyperparameter tuning technologies)自动调整参数。这样模型训练和调参都能自动</strong></p>
</li>
<li><p>用法</p>
<p>尚未申请到试用资格,只能从官方博客视频中查看用法:上传数据配置参数、点击训练即可,支持模型线上部署,提供Restful接口。</p>
</li>
<li><p>小结</p>
<p>基于相似场景已有的模型和新场景的数据,自动调参训练出适用于新场景的模型。</p>
</li>
</ul>
<h3 id="3-先知平台与Google-Cloud-automl的对比分析"><a href="#3-先知平台与Google-Cloud-automl的对比分析" class="headerlink" title="3. 先知平台与Google Cloud automl的对比分析"></a>3. 先知平台与Google Cloud automl的对比分析</h3><ul>
<li><p><strong>用途</strong></p>
<p>先知平台侧重于机器学习,将机器学习相关功能点算子化,业务方通过配置算子、构建算子DAG依赖图来执行。</p>
<p>Google cloud automl侧重于基于已有模型和少量新场景数据,快速训练出新模型。</p>
</li>
<li><p><strong>可借鉴点</strong></p>
<ul>
<li>功能算子化。代码封装、解耦</li>
<li>迁移学习。少量样本数据、模型复用</li>
</ul>
</li>
</ul>
<h3 id="4-参考资料"><a href="#4-参考资料" class="headerlink" title="4. 参考资料"></a>4. 参考资料</h3><ul>
<li><a href="http://blog.csdn.net/dzJx2EOtaA24Adr/article/details/79091614" target="_blank" rel="external">谷歌重磅:不用写代码也能建模调参,Cloud AutoML要实现全民玩AI</a></li>
<li><a href="https://www.blog.google/topics/google-cloud/cloud-automl-making-ai-accessible-every-business/" target="_blank" rel="external">Cloud AutoML: Making AI accessible to every business</a></li>
<li><a href="http://www.sohu.com/a/119200448_465975" target="_blank" rel="external">专栏 | 第四范式先知平台的整体架构和实现细节</a></li>
<li><a href="https://prophet.4paradigm.com/#/prophets" target="_blank" rel="external">第四范式先知平台</a></li>
</ul>
]]></content>
<summary type="html">
<h3 id="先知平台与Google-Cloud-automl的对比分析"><a href="#先知平台与Google-Cloud-automl的对比分析" class="headerlink" title="先知平台与Google Cloud automl的对比分析"></a
</summary>
<category term="automl" scheme="http://yoursite.com/categories/automl/"/>
<category term="automl" scheme="http://yoursite.com/tags/automl/"/>
</entry>
<entry>
<title>Tensorflow "hello world"源码分析</title>
<link href="http://yoursite.com/2017/11/27/Tensorflow%20%E2%80%9Chello%20world%E2%80%9D%E7%A8%8B%E5%BA%8F%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90%EF%BC%88%E4%B8%80%EF%BC%89/"/>
<id>http://yoursite.com/2017/11/27/Tensorflow “hello world”程序源码分析(一)/</id>
<published>2017-11-27T02:13:55.935Z</published>
<updated>2017-11-27T02:15:14.798Z</updated>
<content type="html"><![CDATA[<h3 id="“hello-world”-代码"><a href="#“hello-world”-代码" class="headerlink" title="“hello world” 代码"></a>“hello world” 代码</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div></pre></td><td class="code"><pre><div class="line">import tensorflow as tf</div><div class="line"></div><div class="line"># 创建一个常量 op, 产生一个 1x2 矩阵. 这个 op 被作为一个节点</div><div class="line"># 加到默认图中.</div><div class="line">#</div><div class="line"># 构造器的返回值代表该常量 op 的返回值.</div><div class="line">matrix1 = tf.constant([[3., 3.]])</div><div class="line"></div><div class="line"># 创建另外一个常量 op, 产生一个 2x1 矩阵.</div><div class="line">matrix2 = tf.constant([[2.],[2.]])</div><div class="line"></div><div class="line"># 创建一个矩阵乘法 matmul op , 把 'matrix1' 和 'matrix2' 作为输入.</div><div class="line"># 返回值 'product' 代表矩阵乘法的结果.</div><div class="line">product = tf.matmul(matrix1, matrix2)</div><div class="line"></div><div class="line"># 启动默认图.</div><div class="line">sess = tf.Session()</div><div class="line"></div><div class="line"># 调用 sess 的 'run()' 方法来执行矩阵乘法 op, 传入 'product' 作为该方法的参数.</div><div class="line"># 上面提到, 'product' 代表了矩阵乘法 op 的输出, 传入它是向方法表明, 我们希望取回</div><div class="line"># 矩阵乘法 op 的输出.</div><div class="line">#</div><div class="line"># 整个执行过程是自动化的, 会话负责传递 op 所需的全部输入. op 通常是并发执行的.</div><div class="line">#</div><div class="line"># 函数调用 'run(product)' 触发了图中三个 op (两个常量 op 和一个矩阵乘法 op) 的执行.</div><div class="line">#</div><div class="line"># 返回值 'result' 是一个 numpy `ndarray` 对象.</div><div class="line">result = sess.run(product)</div><div class="line">print result</div><div class="line"># ==> [[ 12.]]</div><div class="line"></div><div class="line"># 任务完成, 关闭会话.</div><div class="line">sess.close()</div></pre></td></tr></table></figure>
<h4 id="一-构建Const-Tensor"><a href="#一-构建Const-Tensor" class="headerlink" title="一. 构建Const Tensor"></a>一. 构建Const Tensor</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">matrix1 = tf.constant([[3., 3.]])</div></pre></td></tr></table></figure>
<ol>
<li>获取global default_graph,图在初始时会获取所有已注册的operator<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">self._registered_ops = op_def_registry.get_registered_ops()</div></pre></td></tr></table></figure>
</li>
</ol>
<p>operator的注册由gen_array_ops.py完成</p>
<blockquote>
<p>Note: gen_array_ops.py是编译tensorflow时动态生成的文件,文件内容来自tensorflow\python\framework\python_op_gen.cc</p>
<p>这种动态生成文件的好处在哪儿那? 是为了可以生成包含custom operator的gen_array_ops.py?</p>
</blockquote>
<ol>
<li>将[[3., 3.]] 转换成Tensor</li>
</ol>
<ul>
<li>将python的 list 转换成<code>numpy ndarray</code></li>
<li>构建<code>TensorProto</code>对象, 指定<code>dtype</code>、<code>tensor_shape</code>、<code>tensor_context</code>三个属性,分别代表Tensor的数据类型对应的enum(省空间,如float32对应值<code>1</code>)、形状(如 1 * 2 )、值对应的可序列化字符串(即nparray.toString() )<blockquote>
<p>proto格式可用于传输与保存:python -> c++,设备之间传输;也可用于保存与加载</p>
</blockquote>
</li>
</ul>
<ol>
<li><p>创建Const operator</p>
<ul>
<li>构建NodeDef,指定name、attr(即dtype、value)</li>
<li>根据output_types构建List[Tensor<i, output_type="">]</i,></li>
</ul>
</li>
<li><p>返回ConstOperator.output[0]作为Const Tensor</p>
</li>
</ol>
<h4 id="二-构建矩阵乘法Tensor"><a href="#二-构建矩阵乘法Tensor" class="headerlink" title="二. 构建矩阵乘法Tensor"></a>二. 构建矩阵乘法Tensor</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">product = tf.matmul(matrix1, matrix2)</div></pre></td></tr></table></figure>
<ol>
<li>也是先将输入转换成Tensor(此处matrix1、matrix2已是tensor)</li>
<li>构建matmul op,自动推导output tensors的数量与类型<blockquote>
<p>所有op的参数定义都位于文件E:\tensorflow-master\tensorflow\core\ops\ops.pbtxt</p>
</blockquote>
</li>
<li>返回output tensor</li>
</ol>
<h4 id="三-启动Session"><a href="#三-启动Session" class="headerlink" title="三. 启动Session"></a>三. 启动Session</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">sess = tf.Session()</div></pre></td></tr></table></figure>
<ol>
<li>Tensorflow里Python相当于Client,C++ 为Server,session的实际运行位于c++。 通过Swig,python可调用C++的function。</li>
<li>根据配置信息获取session<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">Status NewSession(const SessionOptions& options, Session** out_session) {</div><div class="line"> SessionFactory* factory;</div><div class="line"> Status s = SessionFactory::GetFactory(options, &factory);</div><div class="line"> if (!s.ok()) {</div><div class="line"> *out_session = nullptr;</div><div class="line"> LOG(ERROR) << s;</div><div class="line"> return s;</div><div class="line"> }</div><div class="line"> *out_session = factory->NewSession(options);</div><div class="line"> if (!*out_session) {</div><div class="line"> return errors::Internal("Failed to create session.");</div><div class="line"> }</div><div class="line"> return Status::OK();</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
</ol>
<p>若target为空,则为DirectSession,代表local模式<br>若target指定了ip:port列表,则为集群模式</p>
<h4 id="四-执行图计算"><a href="#四-执行图计算" class="headerlink" title="四. 执行图计算"></a>四. 执行图计算</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">result = sess.run(product)</div></pre></td></tr></table></figure>
<ol>
<li>将tensor对应的python数据结构转换成c++格式的</li>
<li>调用directSession.run() <strong>(实际执行,包含图的构建、剪枝、优化,基于设备的分区等等,后续进行长文分析)</strong></li>
</ol>
]]></content>
<summary type="html">
<h3 id="“hello-world”-代码"><a href="#“hello-world”-代码" class="headerlink" title="“hello world” 代码"></a>“hello world” 代码</h3><figure class="hi
</summary>
<category term="Tensorflow" scheme="http://yoursite.com/categories/Tensorflow/"/>
<category term="Tensorflow" scheme="http://yoursite.com/tags/Tensorflow/"/>
</entry>
<entry>
<title>Python算法模型常用部署方式总结</title>
<link href="http://yoursite.com/2017/10/18/python%E7%AE%97%E6%B3%95%E6%A8%A1%E5%9E%8B%E5%B8%B8%E7%94%A8%E9%83%A8%E7%BD%B2%E6%96%B9%E5%BC%8F%E6%80%BB%E7%BB%93/"/>
<id>http://yoursite.com/2017/10/18/python算法模型常用部署方式总结/</id>
<published>2017-10-18T12:41:41.600Z</published>
<updated>2017-10-18T12:42:16.040Z</updated>
<content type="html"><![CDATA[<h3 id="Python算法模型常用部署方式总结"><a href="#Python算法模型常用部署方式总结" class="headerlink" title="Python算法模型常用部署方式总结"></a>Python算法模型常用部署方式总结</h3><p>很绕口的标题,这段时间看了好些Tensorflow、Tensorflow-Serving相关的源码与文档,尝试搭建一个分布式的算法模型线上部署架构。</p>
<p>最开始调研了Tensorflow-Serving(毕竟名字大啊),对于其能自动识别并加装模型、可配置的模型版本管理策略、方便的A/B Test等等很是喜爱。可惜业务优先级调整,得优先部署算法团队基于python版本opencv、dlib实现的人脸识别算法,只好先将Tensorflow-Serving放一放,后续有时间会总结一些Tensorflow(-Serving)架构的源码分析。(后面也想到一种很低效的实现方式)。</p>
<p>言归正传,python算法模型常用部署方式总结:</p>
<h4 id="预期功能"><a href="#预期功能" class="headerlink" title="预期功能"></a>预期功能</h4><ul>
<li>分布式、可扩展、高可用、框架链路响应十毫秒内</li>
<li>支持Python服务</li>
<li>热部署(上线新服务时,无需重启/暂停框架)</li>
<li>多服务(可部署多个算法服务,同时提供服务,服务间无耦合)</li>
<li>多版本(支持单服务多版本同时在线)</li>
<li>服务自动发现、自动部署</li>
</ul>
<h4 id="部署方案总结"><a href="#部署方案总结" class="headerlink" title="部署方案总结"></a>部署方案总结</h4><ul>
<li><p>基于web-server(利用Django或Flask),部署python服务,提供restful api</p>
<ul>
<li>例如: <a href="https://www.r-bloggers.com/lang/uncategorized/1579" target="_blank" rel="external">python数据挖掘模型的API部署</a></li>
</ul>
</li>
<li><p>部署SOA + 算法服务 </p>
<blockquote>
<p>Note: 即基于SOA框架实现分布式服务,其麻烦点在于绝大部分SOA框架都是基于JAVA/C++编写(例如常用的dubbo),而这里的算法服务却是基于python</p>
</blockquote>
<ul>
<li><p>将python算法模型改写成JAVA or C++</p>
<ul>
<li><p>例如用PMML(JAVA)重新实现算法模型, 然后用JAVA重写个加载模型后的业务处理逻辑</p>
<blockquote>
<p>但是PMML无法封装dlib里的模型dlib_face_recognition_resnet_model_v1</p>
</blockquote>
</li>
</ul>
</li>
<li><p>JAVA服务类里直接调用shell命令执行python代码</p>
</li>
<li><p>实现Python SOA(饿了么自研,思路开源,代码未开源)</p>
<ul>
<li><a href="https://eleme.github.io/blog/2016/eleme-python-soa/" target="_blank" rel="external">饿了么的 Python SOA</a> </li>
</ul>
</li>
<li><p>或者自研:python server + zk 构建服务集群,java client thrift调取服务</p>
</li>
</ul>
</li>
<li><p>专门的框架,如tensorflow-serving,可以能很方便的上线部署、灰度测试等等</p>
<ul>
<li>若要自研python版本算法框架,可参考<a href="http://www.jianshu.com/p/26abf06ebecb" target="_blank" rel="external">AlphaML(系统篇)-机器学习serving框架</a></li>
</ul>
</li>
</ul>
]]></content>
<summary type="html">
<h3 id="Python算法模型常用部署方式总结"><a href="#Python算法模型常用部署方式总结" class="headerlink" title="Python算法模型常用部署方式总结"></a>Python算法模型常用部署方式总结</h3><p>很绕口的标题
</summary>
<category term="算法部署" scheme="http://yoursite.com/categories/%E7%AE%97%E6%B3%95%E9%83%A8%E7%BD%B2/"/>
<category term="算法部署" scheme="http://yoursite.com/tags/%E7%AE%97%E6%B3%95%E9%83%A8%E7%BD%B2/"/>
</entry>
<entry>
<title>Spark内存管理</title>
<link href="http://yoursite.com/2017/08/10/Spark%E5%86%85%E5%AD%98%E7%AE%A1%E7%90%86/"/>
<id>http://yoursite.com/2017/08/10/Spark内存管理/</id>
<published>2017-08-10T09:22:31.562Z</published>
<updated>2017-08-10T09:23:39.107Z</updated>
<content type="html"><![CDATA[<h3 id="Spark内存管理"><a href="#Spark内存管理" class="headerlink" title="Spark内存管理"></a>Spark内存管理</h3><p>Spark作为基于内存的分布式计算引擎,理解Spark内存管理方式对于程序调优、Spark运行机制有着不错的效果。本文旨在介绍Spark内存管理的脉络,总结与沉淀。</p>
<p>在执行Spark程序时(yarn-cluster模式),会启动两种类型的 JVM : Driver、Work。前者为控制进程,负责Job提交、根据RDD图划分Stage、生成Task,并负责相关的任务调度;后者执行Task、反馈结果给Driver。由于 Driver 的内存管理相对来说较为简单,本文主要对 Executor 的内存管理进行分析,下文中的 Spark 内存均特指 Executor 的内存。</p>
<p>本文主要内容为:</p>
<ul>
<li>介绍Spark内存分类、内存额度计算方式</li>
<li>Executor内存管理方法</li>
<li>Spark如何使用内存</li>
</ul>
<h4 id="内存分类"><a href="#内存分类" class="headerlink" title="内存分类"></a>内存分类</h4><p>按存储对象分类,Spark将Executor内存分成三种类型:</p>
<ul>
<li>执行时内存(execution memory),task执行特定操作时使用的内存,数据正在被使用,如:shuffle、sort、join、agg等</li>
<li>存储内存(storage memory),常用于存储广播变量(Broadcast)、RDD缓存等,通常这类数据会被多读很少更新</li>
<li>其它,用于存储Spark内部代码的对象,用户代码的class对象等</li>
</ul>
<p>内存大小计算方式:</p>
<ul>
<li>JVM 系统总内存:<code>Runtime.getRuntime.maxMemory (-Xmx)</code></li>
<li>其它内存最低值:固定为300MB</li>
<li>存储内存:<code>(JVM 系统总内存 - 300) * spark.memory.fraction * spark.memory.storageFraction</code></li>
<li>执行时内存:<code>(JVM 系统总内存 - 300) * spark.memory.fraction * (1 - spark.memory.storageFraction)</code></li>
<li>其它:<code>(JVM 系统总内存 - 执行时内存 - 存储内存)</code></li>
</ul>
<p>而按照execution memory申请内存方式的不同,又可将其使用的内存分为:堆内存(on-heap)、堆外内存(off-heap),前者使用JVM的堆内存,后者使用JVM之外的内存,不过2者共用同一个内存配额。</p>
<h4 id="Executor-内存管理"><a href="#Executor-内存管理" class="headerlink" title="Executor 内存管理"></a>Executor 内存管理</h4><p>Spark提供了2种Executor Memory Manager: <code>StaticMemoryManager</code>、<code>UnifiedMemoryManager</code>,<strong>它们均是管理整个Executor的内存(即整个JVM)</strong> 。逻辑管理,虽然内含MemoryAllocator,但并不会进行实际上的内存申请与回收。</p>
<p>前者(<code>StaticMemoryManager</code>)会固定execution memory 和 storage memory 的内存最大值,运行期间不能超过该阈值;后者可以允许在一定条件下,使用的内存超过该阈值,比如execution memory额度用完后,可以使用storage free memory(即: storage未用完的内存额度)。</p>
<h5 id=""><a href="#" class="headerlink" title=" "></a> </h5><p>Spark自1.6开始,就默认采用<code>UnifiedMemoryManager</code>管理Executor内存。<code>UnifiedMemoryManager</code>借助<strong>内存池(Memory Pool)</strong>记录每个taskAttempt的内存使用情况,处理taskAttempt的申请、释放内存的请求。为了避免单个taskAttempt占用大量内存,设定了内存限制:每个task内存使用率最多1/N,最少1/2N(N为task个数)。</p>
<p>内存池被分成三类:storageMemoryPool、onHeapExecutionMemoryPool、offHeapExecutionMemoryPool。</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="meta">@GuardedBy</span>(<span class="string">"this"</span>)</div><div class="line"><span class="keyword">protected</span> <span class="keyword">val</span> storageMemoryPool = <span class="keyword">new</span> <span class="type">StorageMemoryPool</span>(<span class="keyword">this</span>)</div><div class="line"><span class="meta">@GuardedBy</span>(<span class="string">"this"</span>)</div><div class="line"><span class="keyword">protected</span> <span class="keyword">val</span> onHeapExecutionMemoryPool = <span class="keyword">new</span> <span class="type">ExecutionMemoryPool</span>(<span class="keyword">this</span>, <span class="string">"on-heap execution"</span>)</div><div class="line"><span class="meta">@GuardedBy</span>(<span class="string">"this"</span>)</div><div class="line"><span class="keyword">protected</span> <span class="keyword">val</span> offHeapExecutionMemoryPool = <span class="keyword">new</span> <span class="type">ExecutionMemoryPool</span>(<span class="keyword">this</span>, <span class="string">"off-heap execution"</span>)</div></pre></td></tr></table></figure>
<p>其申请资源的逻辑分别为:</p>
<h5 id="申请storage-memory:"><a href="#申请storage-memory:" class="headerlink" title="申请storage memory:"></a>申请storage memory:</h5><ul>
<li><p>若storageMemoryPool 内存足够,直接分配,并同时更新内存池的已分配内存数、剩余内存数</p>
<blockquote>
<p>Note: 对内存池的操作都是 lock.synchronized</p>
</blockquote>
</li>
<li><p>若storageMemoryPool 内存不够</p>
</li>
<li><p>则占用onHeapExecutionMemoryPool.memoryFree;</p>
</li>
<li><p>若依旧不够,则释放一些指定的Block(由调用者指定BlockId)</p>
</li>
<li><p>若还是不够,则申请失败,返回false</p>
</li>
</ul>
<h5 id="申请onHeap-Execution-Memory-基于active-task-attempt-id-:"><a href="#申请onHeap-Execution-Memory-基于active-task-attempt-id-:" class="headerlink" title="申请onHeap Execution Memory(基于active task attempt id):"></a>申请onHeap Execution Memory(基于active task attempt id):</h5><ul>
<li>若内存池里memoryFree足够,直接分配</li>
<li>若不够了,则向storageMemoryPool 借用memoryFree</li>
<li>还是不够,则回收storageMemoryPool占用onHeapExecutionMemoryPool的内存</li>
<li>若内存够了,直接返回</li>
<li>若不够且task占用的内存低于最低配额(1/2N),则wait;否则直接返回申请到的额度</li>
</ul>
<h5 id="申请offHeap-Execution-Memory-基于active-task-attempt-id"><a href="#申请offHeap-Execution-Memory-基于active-task-attempt-id" class="headerlink" title="申请offHeap Execution Memory(基于active task attempt id):"></a>申请offHeap Execution Memory(基于active task attempt id):</h5><ul>
<li>内存够,直接分配</li>
<li>内存不够,返回已申请到的额度,or wait</li>
</ul>
<h4 id="Task-内存管理"><a href="#Task-内存管理" class="headerlink" title="Task 内存管理"></a>Task 内存管理</h4><p><code>TaskMemoryManager</code>以<code>MemoryConsumer</code>的方式管理当前单个Task Attempt的内存分配与回收(spill),以<code>pageTable</code>的方式(即: <code>MemoryBlock</code>数组)记录分配的内存地址。</p>
<p><code>MemoryConsumer</code> 顾名思义,为内存消费者(子类有<code>ShuffleExternalSorter</code>、<code>UnsafeExternalSorter</code>等),会向TaskMemoryManager 申请内存、释放内存,内存不够时还能写磁盘。</p>
<p>通常,<code>MemoryConsumer</code> 每次申请,都会申请memoryPage样式的一段连续内存,即由<code>MemoryBlock</code>记录了内存的起始地址与长度。</p>
<blockquote>
<p>Note: 若为offHeap Memory,起始地址 加 长度就能定位到分配的内存;</p>
<p> 若为onHeap Memory,由于GC后起始地址会改变,故用一object 加 偏移量 标注起始地址</p>
</blockquote>
<p><code>MemoryConsumer</code>申请内存处理逻辑:</p>
<ul>
<li><p><code>TaskMemoryManager</code> 向Executor内存管理(<code>UnifiedMemoryManager</code>) 申请内存</p>
</li>
<li><p>若内存不够,<code>TaskMemoryManager</code>循环释放其余<code>MemoryConsumer</code>占用的内存,即spill磁盘</p>
</li>
<li><p>若依旧不够,该<code>MemoryConsumer</code>写磁盘</p>
</li>
<li><p>获取申请到的内存数额,生成下一个page编号</p>
</li>
<li><p>调用<code>MemoryAllocator</code>,实际执行内存分配,并将page记录到pageTable</p>
<ul>
<li>若为onHeapMemroy,有个内存缓存池,对于1MB及以上的请求,先在池子里查看有无可用内存,有则直接使用池子里的内存(释放内存时,直接将内存丢池子里,池子里的内存为弱引用,Full GC时可回收);若池子里没有,则new long[size],申请一个大数组。</li>
<li>若为offHeapMemory,直接<code>Platform.allocateMemory(size)</code>,无需考虑JVM GC导致内存挪动的问题</li>
</ul>
</li>
<li><p>若page.size,即申请到的内存小于所需内存,则释放该page,抛出OutOfMemoryError</p>
<blockquote>
<p>这个是比较奇怪的,为什么不在实际申请资源前进行判断那? 若申请到的内存数额低于预期,直接就抛异常。 是为了子类重载? </p>
</blockquote>
</li>
</ul>
<h4 id="结束语"><a href="#结束语" class="headerlink" title="结束语"></a>结束语</h4><p>Spark内存管理模块具有很强的层次性:</p>
<ul>
<li>Executor Memory Manager:管理整个JVM的storage\execution内存</li>
<li>Task Memory Manager:管理单个taskAttempt的内存,向executor申请内存</li>
<li>Memory Consumer:内存实际消费者,其子类与具体处理逻辑紧密相连</li>
</ul>
<h5 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h5><ul>
<li><a href="https://www.ibm.com/developerworks/cn/analytics/library/ba-cn-apache-spark-memory-management/index.html?ca=drs-&utm_source=tuicool&utm_medium=referral" target="_blank" rel="external">Apache Spark 内存管理详解</a></li>
</ul>
]]></content>
<summary type="html">
<h3 id="Spark内存管理"><a href="#Spark内存管理" class="headerlink" title="Spark内存管理"></a>Spark内存管理</h3><p>Spark作为基于内存的分布式计算引擎,理解Spark内存管理方式对于程序调优、Sp
</summary>
<category term="Spark" scheme="http://yoursite.com/categories/Spark/"/>
<category term="Spark" scheme="http://yoursite.com/tags/Spark/"/>
</entry>
<entry>
<title>Spark作业提交方式</title>
<link href="http://yoursite.com/2017/07/10/Spark%20%E4%BD%9C%E4%B8%9A%E6%8F%90%E4%BA%A4%E6%96%B9%E5%BC%8F/"/>
<id>http://yoursite.com/2017/07/10/Spark 作业提交方式/</id>
<published>2017-07-10T03:42:10.567Z</published>
<updated>2017-07-10T03:44:30.696Z</updated>
<content type="html"><![CDATA[<h3 id="Spark作业提交方式"><a href="#Spark作业提交方式" class="headerlink" title="Spark作业提交方式"></a>Spark作业提交方式</h3><p>最近碰到个问题,在CDH上采用Spark Streaming方式运行Elasticsearch(ES)相关的处理程序,先后遇到两个JAR冲突:</p>
<ul>
<li>ES的jackson.jar最低需求版本2.4,与CDH的<code>hive-jdbc-1.1.0-cdh5.7.2-standalone.jar</code>里包含的jackson版本冲突</li>
<li>Spark集群默认为JAVA1.7版本,而ES-5.4需要用JAVA1.8</li>
</ul>
<p>解决问题过程中发现,不同的Spark提交作业方式,得有不同的解决办法,不过基本解决思路却是一致的:</p>
<ul>
<li>设置Spark启动Driver、Work时的JAVA_HOME,使之指向JAVA1.8,而非默认指向的JAVA1.7</li>
<li>将ES依赖的jackson.jar放置在JAVA CLASS PATH最前面,以便优先使用。</li>
</ul>
<p>本文主要聊聊常见模式下Spark Driver、Work的JAVA启动参数:</p>
<ul>
<li>local模式</li>
<li>yarn-client模式</li>
<li>yarn-cluster模式</li>
</ul>
<h3 id="local模式"><a href="#local模式" class="headerlink" title="local模式"></a>local模式</h3><p>local模式下,spark driver、worker都在本地线程池里运行,位于同一个 JAVA进程。</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">spark-submit \</div><div class="line"> --master local[2] \</div><div class="line"> --jars jackson-core-2.8.6.jar,my-dependent.jar \</div><div class="line"> --class clife.data.spark.ESUpdate </div><div class="line"> my-main.jar</div></pre></td></tr></table></figure>
<ol>
<li><p>调用<code>SPARK_HOME/bin/spark-submit</code>脚本,转向spark-class</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"</div></pre></td></tr></table></figure>
</li>
<li><p><code>SPARK_HOME/bin/spark-class</code> 加载spark env 变量后,启动JVM执行<code>org.apache.spark.launcher.Main</code> 获取后续执行命令,最后执行该命令</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">build_command() {</div><div class="line"> "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"</div><div class="line"> printf "%d\0" $?</div><div class="line">}</div><div class="line"></div><div class="line">CMD=()</div><div class="line">while IFS= read -d '' -r ARG; do</div><div class="line"> CMD+=("$ARG")</div><div class="line">done < <(build_command "$@")</div><div class="line"></div><div class="line">CMD=("${CMD[@]:0:$LAST}")</div><div class="line">exec "${CMD[@]}"</div></pre></td></tr></table></figure>
</li>
<li><p>运行<code>org.apache.spark.launcher.Main</code>,输出JAVA命令。</p>
<ul>
<li><strong>会先将SPARK_CLASSPATH变量对应的值加入JAVA CLASSPATH</strong>,随后才是SPARK_CONF、SPARK_JAR、lib等等</li>
</ul>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line"><span class="function">List<String> <span class="title">buildClassPath</span><span class="params">(String appClassPath)</span> <span class="keyword">throws</span> IOException </span>{</div><div class="line"> String sparkHome = getSparkHome();</div><div class="line"></div><div class="line"> List<String> cp = <span class="keyword">new</span> ArrayList<String>();</div><div class="line"> addToClassPath(cp, getenv(<span class="string">"SPARK_CLASSPATH"</span>));</div><div class="line"> addToClassPath(cp, appClassPath);</div><div class="line"></div><div class="line"> addToClassPath(cp, getConfDir());</div></pre></td></tr></table></figure>
<ul>
<li><p>输出命令:</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line"><span class="meta">$</span>{JAVA_HOME}/bin/java \</div><div class="line"> -cp /CDH-5.7.2-1.cdh5.7/jars/* \</div><div class="line"> -XX:MaxPermSize=256m \</div><div class="line"> org.apache.spark.deploy.SparkSubmit \</div><div class="line"> --master local[2] \</div><div class="line"> --class my.main.class.name \</div><div class="line"> ...</div><div class="line"> my-main.jar</div></pre></td></tr></table></figure>
</li>
</ul>
</li>
<li><p><code>SPARK_HOME/bin/spark-class</code>运行上述命令,正式执行Spark程序</p>
</li>
<li><p>jar包冲突的解决办法:</p>
<ul>
<li>解决jackson冲突: <code>export SPARK_CLASSPATH=.:jackson-core-2.8.6.jar</code>,会将该jar放在JVM CLASS_PATH最前面,优先加载</li>
<li>解决java版本冲突:本地export JAVA_HOME=jdk1.8</li>
</ul>
</li>
</ol>
<h4 id="yarn-client模式"><a href="#yarn-client模式" class="headerlink" title="yarn-client模式"></a>yarn-client模式</h4><p>会在本地启动spark driver,向yarn申请资源,集群运行task。</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">--master yarn \</div><div class="line">--deploy-mode client \</div></pre></td></tr></table></figure>
<h6 id="解决Driver端的冲突"><a href="#解决Driver端的冲突" class="headerlink" title="解决Driver端的冲突"></a>解决Driver端的冲突</h6><p>JAVA版本冲突:</p>
<ul>
<li>由于client模式的Driver运行在本地,可以本地export JAVA_HOME=jdk1.8,即可解决java版本问题。</li>
</ul>
<p>Jackson Jar冲突:</p>
<ul>
<li><p>脚本运行流程上和local模式一致,在执行<code>org.apache.spark.deploy.SparkSubmit</code>时,与local模式出现不同。简单来说,就是根据输入的不同参数,生成不同的启动Driver的上下文。</p>
</li>
<li><p>设置<code>--verbose</code> 可打印在<code>SparkSubmit</code>里通过反射执行<code>childMainClass</code> 时的参数。</p>
</li>
<li><p>重点关注childClasspath,会将其首先加入classloader里:</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">if</span> (verbose) {</div><div class="line"> printStream.println(<span class="string">s"Main class:\n<span class="subst">$childMainClass</span>"</span>)</div><div class="line"> printStream.println(<span class="string">s"Arguments:\n<span class="subst">${childArgs.mkString("\n")}</span>"</span>)</div><div class="line"> printStream.println(<span class="string">s"System properties:\n<span class="subst">${sysProps.mkString("\n")}</span>"</span>)</div><div class="line"> printStream.println(<span class="string">s"Classpath elements:\n<span class="subst">${childClasspath.mkString("\n")}</span>"</span>)</div><div class="line"> printStream.println(<span class="string">"\n"</span>)</div><div class="line">}</div><div class="line"></div><div class="line"><span class="keyword">val</span> loader =</div><div class="line"> <span class="keyword">if</span> (sysProps.getOrElse(<span class="string">"spark.driver.userClassPathFirst"</span>, <span class="string">"false"</span>).toBoolean) {</div><div class="line"> <span class="keyword">new</span> <span class="type">ChildFirstURLClassLoader</span>(<span class="keyword">new</span> <span class="type">Array</span>[<span class="type">URL</span>](<span class="number">0</span>),</div><div class="line"> <span class="type">Thread</span>.currentThread.getContextClassLoader)</div><div class="line"> } <span class="keyword">else</span> {</div><div class="line"> <span class="keyword">new</span> <span class="type">MutableURLClassLoader</span>(<span class="keyword">new</span> <span class="type">Array</span>[<span class="type">URL</span>](<span class="number">0</span>),</div><div class="line"> <span class="type">Thread</span>.currentThread.getContextClassLoader)</div><div class="line"> }</div><div class="line"><span class="type">Thread</span>.currentThread.setContextClassLoader(loader)</div><div class="line"></div><div class="line"><span class="keyword">for</span> (jar <- childClasspath) {</div><div class="line"> addJarToClasspath(jar, loader)</div><div class="line">}</div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">Classpath elements:</div><div class="line">file:/home/my/my-main.jar</div><div class="line">file:/home/my/jackson-core-2.8.6.jar</div></pre></td></tr></table></figure>
</li>
<li><p>而client模式下,一般通过 <code>--jars</code>方式提交jar包,会加载jar并传达给<code>childClasspath</code></p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">if</span> (deployMode == <span class="type">CLIENT</span>) {</div><div class="line"> childMainClass = args.mainClass</div><div class="line"> <span class="keyword">if</span> (isUserJar(args.primaryResource)) {</div><div class="line"> childClasspath += args.primaryResource</div><div class="line"> }</div><div class="line"> <span class="keyword">if</span> (args.jars != <span class="literal">null</span>) { childClasspath ++= args.jars.split(<span class="string">","</span>) }</div><div class="line"> <span class="keyword">if</span> (args.childArgs != <span class="literal">null</span>) { childArgs ++= args.childArgs }</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
<li><p>这样在Driver端的jar冲突就解决了。</p>
</li>
</ul>
<h6 id="解决Executor-Task的冲突"><a href="#解决Executor-Task的冲突" class="headerlink" title="解决Executor Task的冲突"></a>解决Executor Task的冲突</h6><p> <code>ApplicationMaster</code>用于Spark向Yarn申请资源,<code>ExecutorRunnable</code>会生成Executor JVM启动时的参数(SPARK默认配置下会打印)。</p>
<figure class="highlight"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">val commands = prepareCommand(masterAddress, slaveId, hostname, executorMemory, executorCores,</div><div class="line"> appId, localResources)</div><div class="line"></div><div class="line">logInfo(s"""</div><div class="line"> |===============================================================================</div><div class="line"> |YARN executor launch context:</div><div class="line"> | env:</div><div class="line"> |${env.map { case (k, v) => s" $k -> $v\n" }.mkString}</div><div class="line"> | command:</div><div class="line"> | ${commands.mkString(" ")}</div><div class="line"> |===============================================================================</div><div class="line"> """.stripMargin)</div></pre></td></tr></table></figure>
<p>JAVA版本冲突:</p>
<ul>
<li><p>会从环境变量里获取JAVA_HOME</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> commands = prefixEnv ++ <span class="type">Seq</span>(</div><div class="line"> <span class="type">YarnSparkHadoopUtil</span>.expandEnvironment(<span class="type">Environment</span>.<span class="type">JAVA_HOME</span>) + <span class="string">"/bin/java"</span>,</div><div class="line"> <span class="string">"-server"</span>,</div></pre></td></tr></table></figure>
</li>
<li><p>Executor启动前的环境(设置环境变量、class-path、相关参数)也是在<code>ExecutorRunnable</code>里设置的</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">lazy</span> <span class="keyword">val</span> env = prepareEnvironment(container) <span class="comment">//准备Container启动的环境(环境变量、class-path、jvm参数、Executor参数)</span></div><div class="line"></div><div class="line">sparkConf.getExecutorEnv.foreach { <span class="keyword">case</span> (key, value) =></div><div class="line"> <span class="comment">// This assumes each executor environment variable set here is a path</span></div><div class="line"> <span class="comment">// This is kept for backward compatibility and consistency with hadoop</span></div><div class="line"> <span class="type">YarnSparkHadoopUtil</span>.addPathToEnvironment(env, key, value)</div><div class="line"> }</div><div class="line"></div><div class="line"> <span class="comment">/** Get all executor environment variables set on this SparkConf */</span></div><div class="line"> <span class="function"><span class="keyword">def</span> <span class="title">getExecutorEnv</span></span>: <span class="type">Seq</span>[(<span class="type">String</span>, <span class="type">String</span>)] = {</div><div class="line"> <span class="keyword">val</span> prefix = <span class="string">"spark.executorEnv."</span></div><div class="line"> getAll.filter{<span class="keyword">case</span> (k, v) => k.startsWith(prefix)}</div><div class="line"> .map{<span class="keyword">case</span> (k, v) => (k.substring(prefix.length), v)}</div><div class="line"> }</div><div class="line"></div><div class="line"> <span class="function"><span class="keyword">def</span> <span class="title">addPathToEnvironment</span></span>(env: <span class="type">HashMap</span>[<span class="type">String</span>, <span class="type">String</span>], key: <span class="type">String</span>, value: <span class="type">String</span>): <span class="type">Unit</span> = {</div><div class="line"> <span class="keyword">val</span> newValue = <span class="keyword">if</span> (env.contains(key)) { env(key) + getClassPathSeparator + value } <span class="keyword">else</span> value</div><div class="line"> env.put(key, newValue)</div><div class="line"> }</div></pre></td></tr></table></figure>
</li>
<li><p>解决办法:提交Spark作业时指定<code>--conf spark.executorEnv.JAVA_HOME=/JAVA1.8-dir/</code>,即可复写默认JAVA_HOME</p>
</li>
</ul>
<p>Jackson.jar冲突:</p>
<ul>
<li><p><code>ExecutorRunnable</code>设置class-path,加载顺序为:</p>
<ul>
<li><code>spark.executor.extraClassPath</code></li>
<li><code>Environment.PWD</code></li>
<li><code>spark.yarn.jar</code> 或 <code>SPARK_JAR</code></li>
<li><code>HadoopClasspath</code></li>
<li><code>SPARK_DIST_CLASSPATH</code></li>
</ul>
</li>
<li><p>CDH的<code>hive-jdbc-1.1.0-cdh5.7.2-standalone.jar</code>(含冲突jackson)位于<code>HadoopClasspath</code></p>
</li>
<li><p>解决办法:提交Spark作业时指定<code>--conf spark.executor.extraClassPath=/jackson.jar</code></p>
<p></p>
</li>
</ul>
<h4 id="yarn-cluster模式"><a href="#yarn-cluster模式" class="headerlink" title="yarn-cluster模式"></a>yarn-cluster模式</h4><p>向yarn申请资源,运行driver、task。</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">--master yarn \</div><div class="line">--deploy-mode cluster \</div></pre></td></tr></table></figure>
<p>命令行提交spark作业时,<code>spark-submit</code>会自动JVM,执行main-class: <code>org.apache.spark.deploy.Client</code>,然后向Yarn申请资源,submitApplication,运行Driver。</p>
<h6 id="解决Driver端冲突"><a href="#解决Driver端冲突" class="headerlink" title="解决Driver端冲突"></a>解决Driver端冲突</h6><p><code>yarn.Client.createContainerLaunchContext()</code> 生成Driver提交环境。</p>
<p>解决java版本冲突:</p>
<ul>
<li>类似yarn-client模式下运行work,JAVA_HOME也是从环境变量中取,<code>spark.yarn.appMasterEnv.JAVA_HOME</code>可以覆写。</li>
</ul>
<p>解决jackson.jar冲突:</p>
<ul>
<li>类似yarn-client模式下运行work,设置<code>spark.driver.extraClassPath=/jackson.jar</code>解决</li>
</ul>
<h6 id="解决Executor-Task端冲突"><a href="#解决Executor-Task端冲突" class="headerlink" title="解决Executor Task端冲突"></a>解决Executor Task端冲突</h6><ul>
<li>解决方法同yarn-client模式下运行的work</li>
</ul>
<h4 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h4><p>在不能升级CDH集群配置前提下,解决Spark Jar冲突、JAVA版本不一致的方法其实很简单:</p>
<ul>
<li>定位Driver、Work的JVM启动命令</li>
<li>将程序需要的Jar放置于class-path最前面</li>
<li>改写默认JAVA_HOME</li>
</ul>
]]></content>
<summary type="html">
<h3 id="Spark作业提交方式"><a href="#Spark作业提交方式" class="headerlink" title="Spark作业提交方式"></a>Spark作业提交方式</h3><p>最近碰到个问题,在CDH上采用Spark Streaming方式运行
</summary>
<category term="Spark" scheme="http://yoursite.com/categories/Spark/"/>
<category term="Spark" scheme="http://yoursite.com/tags/Spark/"/>
<category term="spark-submit" scheme="http://yoursite.com/tags/spark-submit/"/>
</entry>
<entry>
<title>用Elasticsearch处理SQL查询</title>
<link href="http://yoursite.com/2017/06/26/%E7%94%A8Elasticsearch%E5%A4%84%E7%90%86SQL%E6%9F%A5%E8%AF%A2/"/>
<id>http://yoursite.com/2017/06/26/用Elasticsearch处理SQL查询/</id>
<published>2017-06-26T09:59:54.584Z</published>
<updated>2017-06-26T10:00:35.962Z</updated>
<content type="html"><![CDATA[<h2 id="用Elasticsearch处理SQL查询"><a href="#用Elasticsearch处理SQL查询" class="headerlink" title="用Elasticsearch处理SQL查询"></a>用Elasticsearch处理SQL查询</h2><h3 id="简介"><a href="#简介" class="headerlink" title="简介"></a>简介</h3><p>最近碰到个需求:业务方mysql单机查询耗时无法满足需求、部分sql查询不出来,希望能提供个实时查询框架解决这个问题,最好是分布式的,好扩展嘛。(PS:最开始产品经理聊时,提到mysql 库表太大、单表count(<em>)都不能出结果。目测单表数据量估计快10亿了,结果后来和业务后台开发核对细节时,发现单表数据量最多也就百万级,最慢sql也就耗时几秒,哪儿有撒count(</em>)不出结果的,也是醉了,还是得多和搞技术的交流啊。真实需求是百万级数据量需要查询耗时毫秒级)</p>
<p>调研了各种实时查询框架:Elasticsearch、druid、phoenix、solr、impala,最终选定了Elasticsearch,<strong>版本5.4.0</strong>。原因简单来说:ES更简单、更快,当然缺点是对join的支持也不好、没有分区的概念。</p>
<p>本文主要聊聊如何用ES处理SQL业务,碰到的问题、解决版本。不会涉及ES的基本概念。</p>
<p>主要聊聊:</p>
<ul>
<li>ES文档结构设计</li>
<li>Mysql数据如何导入ES</li>
<li>写ES Query DSL中碰到的问题</li>
</ul>
<h3 id="文档结构的设计"><a href="#文档结构的设计" class="headerlink" title="文档结构的设计"></a>文档结构的设计</h3><p> 在文档结构设计上尝试了2种方案:</p>
<h5 id="方案1:将ES当关系数据库使,数据库表结构与ES文档结构一一对应。-参考系列文章:把Elasticsearch-当数据库使"><a href="#方案1:将ES当关系数据库使,数据库表结构与ES文档结构一一对应。-参考系列文章:把Elasticsearch-当数据库使" class="headerlink" title="方案1:将ES当关系数据库使,数据库表结构与ES文档结构一一对应。 参考系列文章:把Elasticsearch 当数据库使"></a>方案1:将ES当关系数据库使,数据库表结构与ES文档结构一一对应。 参考系列文章:<a href="https://segmentfault.com/a/1190000004433446" target="_blank" rel="external">把Elasticsearch 当数据库使</a></h5><p>简单说,就是对于数据库中的每一个表,都在ES中创建一个type(类似数据库里的表),ES type里的feild 名称/类型 与数据库表的字段名/类型 保持一致。</p>
<p>建表简单、mysql数据导入ES也简单,但是将SQL转换成ES Query DSL时 碰到一系列问题:</p>
<ul>
<li><p>问题一:ES Join支持不好</p>
<ul>
<li><p><a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/relations.html" target="_blank" rel="external">ES处理join的几种方式</a>(select * from tbl1 join tbl2 on tbl1.id = tbl2.id;)</p>
<ul>
<li><a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/application-joins.html" target="_blank" rel="external">应用层联接</a><ul>
<li>ES转换成<ul>
<li>执行 select id, * from tbl1; 结果存数组变量tbl1_id_array</li>
<li>执行 select * from tbl2 where id in [tbl1_id_array]</li>
<li>结果再拼接</li>
</ul>
</li>
<li>适用于tbl1的数据量很小 </li>
</ul>
</li>
<li><a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/denormalization.html" target="_blank" rel="external">表字段冗余</a><ul>
<li>ES 在建表时,将table1的信息冗余存入tabl2</li>
</ul>
</li>
<li><a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/nested-objects.html" target="_blank" rel="external">嵌套对象</a> <ul>
<li>比如SQL里 一个表存储所有文章,一个表存储所有评论,评论表中存储文章id,通过join能查找一篇文章的所有评论</li>
<li>ES 可以只创建一个表,每篇文章里存储所有的评论,来规避join</li>
</ul>
</li>
<li><a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/parent-child.html" target="_blank" rel="external">父-子关系文档</a><ul>
<li>ES 需要在设计表的时候 就指定好2表之间的关系</li>
<li>可以处理2张表都很大的情况,但是需要在创建表的时候就建立join关系,不大实用</li>
</ul>
</li>
</ul>
</li>
<li><p>其他资料</p>
<ul>
<li><a href="https://segmentfault.com/a/1190000004468130" target="_blank" rel="external">把Elasticsearch当数据库使:Join</a> </li>
</ul>
</li>
<li><p>Join总结</p>
<ul>
<li>ES不太适合关系数据库的join。ES JOin只会返回单表的数据,Mysql Join 会返回左右2表的数据。另外join又分为left join、right join、inner join等等,ES得用户处理这些join逻辑,不大方便。</li>
</ul>
</li>
<li><p>问题二:Union ALL、子查询不支持</p>
</li>
</ul>
</li>
</ul>
<h5 id="方案2:将Mysql的库表结构转换成ES容易处理的文档结构"><a href="#方案2:将Mysql的库表结构转换成ES容易处理的文档结构" class="headerlink" title="方案2:将Mysql的库表结构转换成ES容易处理的文档结构"></a>方案2:将Mysql的库表结构转换成ES容易处理的文档结构</h5><p> 简单说,就是通过新的文档结构查询时,不再需要join、union all、子查询了。</p>
<p> 不过这个设计过程就比较痛苦了,需要比较深入的了解业务库表结构,各种摸爬滚打后,将库表结构转换成2种ES文档结构:</p>
<ul>
<li><p>由单个mysql表构成</p>
<ul>
<li><p>普通文档</p>
<ol>
<li>ES字段与mysql字段的名字、类型保持一致</li>
<li><p>若mysql表含有主键,就将主键作为ES的_id</p>
<blockquote>
<p>例如: mysql table 主键为(col1, col2),则可将col1_col2作为ES的_id</p>
</blockquote>
</li>
<li><p>若mysql表里没有主键,则视情况由ES自动生成_id 或者用别的方式</p>
</li>
</ol>
</li>
<li>父文档<ul>
<li>mysql里必须含有主键,并将主键值作为ES文档的_id值</li>
</ul>
</li>
<li>子文档<ul>
<li>mysql表里必须要有父文档_id对应的字段<blockquote>
<p>例如 父文档为school,_id为”school_id”对应的值,那子文档class里必须存有字段”school_id”</p>
</blockquote>
</li>
</ul>
</li>
</ul>
</li>
<li><p>由多个mysql表构成</p>
<ul>
<li>确定唯一的main table。其余表数据会与该表有关联。</li>
<li>确定其余副表是作为ES的nested common field 还是 nested array field。即主表的一行记录在副本中是最多有一行对应数据、还是多行对应数据。</li>
</ul>
</li>
<li><p>例子:ES device表的设计过程</p>
<ul>
<li><p>确定由哪些表构成。device表包含mysql的七张表: </p>
<ul>
<li>tb_device_mac (device_id, product_id, status)</li>
<li>tb_device_mac_history (device_id, mac_address, status)</li>
<li>tb_device_auth (devce_id, user_id, auth_time)</li>
<li>tb_device_auth_history (devce_id, user_id, auth_time)</li>
<li>tb_device_bind (device_id, bind_time)</li>
<li>tb_device_bind_history (device_id, bind_time, unbind_time)</li>
<li>tb_product (product_id, product_name)</li>
</ul>
</li>
<li><p>确定主表: tb_device_mac</p>
</li>
<li>确定nested common field 对应的副表: tb_product</li>
<li>表融合: 相似结构的表合并成一个字段: 如 tb_device_mac 和 tb_device_mac_history 在ES里用一个字段存储,即tb_device_mac 会冗余存储</li>
<li><p>device ES 表结构</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div></pre></td><td class="code"><pre><div class="line">"properties": {</div><div class="line"> "device_id" : {}, // 来自主表tb_device_mac的字段</div><div class="line"> "mac_address": {},</div><div class="line"> "status": {},</div><div class="line"> "macs": { // 包含来自tb_device_mac 和 tb_device_mac_history的字段</div><div class="line"> "type": "nested",</div><div class="line"> "properties": {</div><div class="line"> "device_id" : {},</div><div class="line"> "product_id": {},</div><div class="line"> "status": {},</div><div class="line"> "is_history": {"type": "boolean"} // 新增字段</div><div class="line"> }</div><div class="line"> },</div><div class="line"> "auths": {} // 存储来自tb_device_auth 和 tb_dvice_auth_history的数据</div><div class="line"> "binds": {} // 存储来自tb_device_bind 和 tb_device_bind_history的数据</div><div class="line"> "product": {} //存储来自tb_product的数据</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
<li><p>device 数据</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "device_id": 1,</div><div class="line"> "mac_address": "ABCDEFG",</div><div class="line"> "status": 0,</div><div class="line"> "macs": [ // tb_device_mac 与 (tb_device_mac union tb_device_mac_history) 是一对多关系,因此用数组存储</div><div class="line"> { // 来自tb_device_mac的数据会冗余存储一份在这里,便于处理tb_device_mac union all tb_device_mac_history 的数据</div><div class="line"> "device_id": 1,</div><div class="line"> "product_id": 12,</div><div class="line"> "status": 0,</div><div class="line"> "is_history": false</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "device_id": 1,</div><div class="line"> "product_id": 12,</div><div class="line"> "status": 1,</div><div class="line"> "is_history": true</div><div class="line"> }</div><div class="line"> ],</div><div class="line"> "binds": [{...}],</div><div class="line"> "auths": [],</div><div class="line"> "product": { // tb_product与tb_device_mac是一对一关系,因此这里不用数组存储</div><div class="line"> "product_id": 12,</div><div class="line"> "product_nam": "hello"</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
</ul>
</li>
<li><p>文档设计总结</p>
<ul>
<li>虽然方案2相比于方案1,在设计难度上更高,但是能有效的解决ES对join、uion all、子查询支持不好的问题。<blockquote>
<p>Tips: 方案2将之前SQL join、子查询、union all设计到的表融合到ES的同一个文档中,能用简单Query DSL语句实现之前的复杂SQL</p>
</blockquote>
</li>
</ul>
</li>
</ul>
<h4 id="数据导入ES"><a href="#数据导入ES" class="headerlink" title="数据导入ES"></a>数据导入ES</h4><p>应业务方需求,现网数据的写入、更新方式不变。也就是依旧还是会把数据更新到Mysql,不会直接写ES,而仅仅是查询ES。所以,这边的有个机制,把Mysql上的更新操作实时的同步到ES。</p>
<ul>
<li>这里采用了阿里开源的canal监听mysql binlog</li>
<li>将mysql insert、update、delete涉及的数据发Kafka,</li>
<li>写了个ES程序从Kafka读数据,</li>
<li>封装成ES的IndexRequest/UpdateRequest/DeleteRequest,批量处理(构造BulkRequest,设置批量阈值为1000)。</li>
</ul>
<p>ES同步程序里,会将mysql的部分insert请求,转换成ES的UpdateRequest,例如前文提到的文档结构device,mysql insert bind 会转换成 ES update device.binds 。这个过程中竟然还导致ES服务器进程出现OutOfMemory,导致服务器进程异常退出。</p>
<p>OutOfMemory原因分析:</p>
<ul>
<li>ES服务器采用默认JVM配置,JVM内存为2G,<code>-XX:+HeapDumpOnOutOfMemoryError</code>会自动dump内存,生成hprof文件。</li>
<li><p>分析发现,85%以上的内存被BulkShardRequest.iterms占用,iterms里竟然大部分都是IndexRequest,而不是UpdateRequest!!!</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line">Class Name | Shallow Heap | Retained Heap</div><div class="line">---------------------------------------------------------------------------------------------------------------------------------</div><div class="line">org.elasticsearch.action.bulk.BulkShardRequest @ 0xbf1945d0 | 64 | 1,654,253,216</div><div class="line">|- items org.elasticsearch.action.bulk.BulkItemRequest[1000] @ 0xbf194610 | 4,016 | 1,654,253,152</div><div class="line">| |- [736] org.elasticsearch.action.bulk.BulkItemRequest @ 0x8b9cb678 | 32 | 1,968,936</div><div class="line">| | |- request org.elasticsearch.action.index.IndexRequest @ 0x8b7eab88 | 120 | 1,968,792</div><div class="line">| | | |- source org.elasticsearch.common.bytes.PagedBytesReference @ 0x8b9cb5d8 | 32 | 1,968,600</div><div class="line">| | | |- opType org.elasticsearch.action.DocWriteRequest$OpType @ 0xafba9758 | 32 | 88</div><div class="line">| | | |- index java.lang.String @ 0xbf0bccd0 bigdata-realtime-v1 | 24 | 80</div><div class="line">| | | |- type java.lang.String @ 0xbf0bcd60 device | 24 | 56</div><div class="line">| | | |- id java.lang.String @ 0xbf0bcd98 1006685 | 24 | 56</div><div class="line">| | | '- Total: 13 entries | | </div><div class="line">| | |- primaryResponse org.elasticsearch.action.bulk.BulkItemResponse @ 0x8b9cb698 | 32 | 112</div><div class="line">| | |- <class> class org.elasticsearch.action.bulk.BulkItemRequest @ 0xb1e7ded0 | 8 | 8</div><div class="line">| | '- Total: 3 entries | | </div><div class="line">| |- [737] org.elasticsearch.action.bulk.BulkItemRequest @ 0x8bbac2e0 | 32 | 1,968,936</div><div class="line">| |- [738] org.elasticsearch.action.bulk.BulkItemRequest @ 0x8bd8ce88 | 32 | 1,968,936</div><div class="line"></div><div class="line">---------------------------------------------------------------------------------------------------------------------------------</div></pre></td></tr></table></figure>
</li>
<li><p>查看ES源码,UpdateRequest会转换成IndexRequest,而由于同一个设备(device)文档,内嵌的数据量高达3w多(device.binds为含有3w多元素的数组),导致单个文档太大(3MB),不巧一个bulk里1000个UpdateRequest都是涉及这种文档的,最终导致OOM</p>
</li>
<li>解决办法:<ul>
<li>调大ES JVM内存</li>
<li>让业务查看数据是否引入脏数据,device.binds元素个数3000以内为正常</li>
</ul>
</li>
</ul>
<h4 id="SQL转换成Query-DSL"><a href="#SQL转换成Query-DSL" class="headerlink" title="SQL转换成Query DSL"></a>SQL转换成Query DSL</h4><p>这块改写比较简单,目前只碰到一个问题:</p>
<ul>
<li>ES 对于Count(distinct col) 采用基数估算,结果不精确,而业务方需要精确值<ul>
<li>解决办法:改写成select col from tbl group by col,客户端再count</li>
</ul>
</li>
</ul>
<h4 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h4><p>杀鸡用了牛刀,可惜没用上ES核心功能(搜索引擎)。后续会考虑用ELK实现日志实时搜索框架。</p>
]]></content>
<summary type="html">
<h2 id="用Elasticsearch处理SQL查询"><a href="#用Elasticsearch处理SQL查询" class="headerlink" title="用Elasticsearch处理SQL查询"></a>用Elasticsearch处理SQL查询</
</summary>
<category term="ElasticSearch" scheme="http://yoursite.com/categories/ElasticSearch/"/>
<category term="ElasticSearch" scheme="http://yoursite.com/tags/ElasticSearch/"/>
</entry>
<entry>
<title>ElasticSearch调研总结</title>
<link href="http://yoursite.com/2017/05/10/ElasticSearch%E8%B0%83%E7%A0%94%E6%80%BB%E7%BB%93/"/>
<id>http://yoursite.com/2017/05/10/ElasticSearch调研总结/</id>
<published>2017-05-10T09:37:53.142Z</published>
<updated>2017-05-10T09:39:22.961Z</updated>
<content type="html"><![CDATA[<h3 id="ElasticSearch调研总结"><a href="#ElasticSearch调研总结" class="headerlink" title="ElasticSearch调研总结"></a>ElasticSearch调研总结</h3><h4 id="ElasticSearch简介"><a href="#ElasticSearch简介" class="headerlink" title="ElasticSearch简介"></a>ElasticSearch简介</h4><p> Elasticsearch 是一个分布式可扩展的<strong>实时</strong>搜索和分析引擎,一个建立在全文搜索引擎 Apache Lucene(TM) 基础上的搜索引擎。ElasticSearch可以做的事儿:</p>
<ul>
<li>分布式的实时文件存储,并为每一个字段建立索引,默认存储在本地磁盘</li>
<li>实时分析的分布式搜索引擎</li>
<li>集群规模可以动态扩展,具备分布式的基本要求:动态扩展、容错、负载均衡</li>
</ul>
<h3 id="基本概念"><a href="#基本概念" class="headerlink" title="基本概念"></a>基本概念</h3><ul>
<li><p>Cluster (集群)</p>
</li>
<li><p>Node (节点)</p>
<blockquote>
<p>每个节点上运行一个ElasticSearch实例,节点启动后会自动广播查询,配置信息中cluster.name值相同的节点会构造成一个集群</p>
</blockquote>
</li>
<li><p>Shard (分片)</p>
<blockquote>
<p>类似HDFS中的block,会将大容量的数据切分成多个Shard,不同的Shard可以分散存储在集群上不同的节点</p>
</blockquote>
</li>
<li><p>Replia (副本)</p>
<blockquote>
<p>一个Replia是一个Shard的精准复制,每个Shard可含有0个或多个Replia。用于容错、并发查询</p>
</blockquote>
</li>
<li><p>面向文档</p>
<p> ElasticSearch是面向文档存储的,基本的存储单位就是文档,一条记录就是一个文档,文档格式统一为JSON格式,例如:</p>
<figure class="highlight json"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> <span class="attr">"name"</span> : <span class="string">"John"</span>,</div><div class="line"> <span class="attr">"sex"</span> : <span class="string">"Male"</span>,</div><div class="line"> <span class="attr">"age"</span> : <span class="number">25</span>,</div><div class="line"> <span class="attr">"birthDate"</span>: <span class="string">"1990/05/01"</span>,</div><div class="line"> <span class="attr">"about"</span> : <span class="string">"I love to go rock climbing"</span>,</div><div class="line"> <span class="attr">"interests"</span>: [ <span class="string">"sports"</span>, <span class="string">"music"</span> ]</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
<li><p>index (索引)、type (类型)、id </p>
<p>ES中通过index/type/id 来唯一标示一个文档。与Mysql概念对比:</p>
<p>| Mysql | ES |<br>| ————- | ——————- |<br>| database(数据库) | index(索引) |<br>| table(表) | type(类型) |<br>| row(行) | 文档 |<br>| column(列) | field |<br>| schema | mapping |<br>| index | everything is index |<br>| sql | Query DSL |</p>
</li>
<li><p>Query DSL</p>
<p>Mysql 中采用SQL语句进行查询,ES中统一采用Query DSL查询,详情可参考:<a href="http://www.voidcn.com/blog/bigbigtreewhu/article/p-6323823.html" target="_blank" rel="external">Elasticsearch——Query DSL</a>。<code>select * from bank;</code> 示例:</p>
<figure class="highlight"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">curl -XPOST 'localhost:9200/bank/_search?pretty' -d '</div><div class="line">{</div><div class="line"> "query": { "match_all": {} }</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
</ul>
<h4 id="集群搭建"><a href="#集群搭建" class="headerlink" title="集群搭建"></a>集群搭建</h4><ul>
<li><p>版本选择与下载、解压,<a href="[http://www.elasticsearch.org/download/](http://www.elasticsearch.org/download/">下载地址</a> )</p>
<ul>
<li><strong>Note: 调研版本选择的是2.4.1,而非最新版本5.*</strong>(版本号2.4之后就直接是5.* ),原因:<ul>
<li>5.<em> 要求JAVA8,5.</em> 官方插件也要求JAVA8</li>
<li>很多常用的第三方插件(bigdesk、head、sql、jdbc-sql)还未支持5.*</li>
<li>2.4 与 5.* 性能实测差异不大,详见<a href="http://www.ctolib.com/topics/79270.html" target="_blank" rel="external">ElasticSearch 5.0 测评以及使用体验</a></li>
</ul>
</li>
</ul>
</li>
<li><p>配置所有节点,将$ES_HOME/elasticsearch.yml中cluster.name 修改成相同值</p>
</li>
<li><p>$ES_HOME/bin/elasticsearch</p>
<blockquote>
<p>所有节点上执行执行,就启动了ES。会自动广播组成集群。</p>
</blockquote>
<p>JVM默认参数:</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">-Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true</div></pre></td></tr></table></figure>
</li>
<li><p>集群访问地址:host:9200</p>
<blockquote>
<p><a href="http://200.200.200.64:9200/" target="_blank" rel="external">http://200.200.200.64:9200/</a></p>
</blockquote>
</li>
<li><p>插件安装</p>
<ul>
<li><p>Note:ES安装包只提供了最基本的功能:本地分布式存储数据、创建索引、提供查询服务等。其余一些额外功能则需要第三方插件支持,如:集群状态监控、SQL转DSL Query、集群性能监控、Mysql数据导入ES等。详见:<a href="http://chenghuiz.iteye.com/blog/2310377" target="_blank" rel="external">elasticsearch以及其常用插件安装</a></p>
</li>
<li><p>访问(host:9200/_plugin/plugin<em>name)</em>,例如:</p>
<blockquote>
<p><a href="http://200.200.200.64:9200/_plugin/head/" target="_blank" rel="external">http://200.200.200.64:9200/_plugin/head/</a></p>
</blockquote>
</li>
</ul>
</li>
</ul>
<h4 id="性能测试"><a href="#性能测试" class="headerlink" title="性能测试"></a>性能测试</h4><ul>
<li><p>将Mysql单表中的数据导入ES</p>
<ul>
<li><p>Mysql单表数据量:1亿行,16GB;导入ES后,占用内存:32GB</p>
<ul>
<li><strong>Note:第一次导入完成后,发现数据丢失,即从Mysql导入一亿行数据后,ES中count(*)仅有7600万行,后续会有原因分析</strong></li>
</ul>
</li>
<li><p>导入方法:</p>
<ul>
<li><p>安装第三方插件:elasticsearch-jdbc,按照git上的操作指引进行数据导入,git地址:<a href="https://github.com/jprante/elasticsearch-jdbc" target="_blank" rel="external">elasticsearch-jdbc</a>,导入脚本:</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div></pre></td><td class="code"><pre><div class="line"><span class="meta">#</span>!/bin/sh</div><div class="line">export JAVA_HOME=/home/knightyang/jdk1.8.0_131</div><div class="line">export PATH=${JAVA_HOME}/bin/:$PATH;</div><div class="line">JDBC_IMPORTER_HOME=/home/knightyang/elasticsearch-jdbc-2.3.4.0</div><div class="line">bin=$JDBC_IMPORTER_HOME/bin</div><div class="line">lib=$JDBC_IMPORTER_HOME/lib</div><div class="line">echo '{</div><div class="line">"type" : "jdbc",</div><div class="line">"jdbc": {</div><div class="line">"elasticsearch.autodiscover":true,</div><div class="line">"url":"jdbc:mysql://host:3306/test", </div><div class="line">"user":"test", </div><div class="line">"password":"test", </div><div class="line">"sql":"select *,id as _id from user_info_yqj ",</div><div class="line">"elasticsearch" : {</div><div class="line"> "host" : "host",</div><div class="line"> "port" : 9300</div><div class="line">},</div><div class="line">"index" : "yqj_info_index_more", </div><div class="line">"type" : "yqj_info_type_more" </div><div class="line">}</div><div class="line">}'| java \</div><div class="line"> -cp "${lib}/*" \</div><div class="line"> -Dlog4j.configurationFile=${bin}/log4j2.xml \</div><div class="line"> org.xbib.tools.Runner \</div><div class="line"> org.xbib.tools.JDBCImporter</div></pre></td></tr></table></figure>
</li>
</ul>
</li>
</ul>
</li>
<li><p>安装bigdesk插件,监控集群资源状态(Memory、CPU、GC、Thread Pool等)</p>
<blockquote>
<p>用于查看执行SQL时的资源消耗</p>
</blockquote>
</li>
<li><p>安装sql插件,可方便的将sql转换成DSL Query</p>
<blockquote>
<p>SQL: select count(*) from yqj_info_index_more</p>
</blockquote>
</li>
<li><p>安装head插件,可以方便的提交DSL Query</p>
</li>
<li><p>count(*) 统计一亿行数据耗时对比(清理查询缓存后多次执行,取平均值)</p>
<p>| Mysql | ES |<br>| —– | —– |<br>| 51秒 | 900毫秒 |</p>
<blockquote>
<p>Note: Mysql 执行时间波动较大,从31秒 ~ 80秒,预计于机器负载有关。ES执行时间基本无波动。</p>
</blockquote>
</li>
</ul>
<h4 id="Mysql导入ES丢数据原因分析"><a href="#Mysql导入ES丢数据原因分析" class="headerlink" title="Mysql导入ES丢数据原因分析"></a>Mysql导入ES丢数据原因分析</h4><ul>
<li><p>导入过程:同性能测试中的导入方法</p>
</li>
<li><p>原因分析:</p>
<ul>
<li><p>ES中是否含有重复 _id?</p>
<ul>
<li>没有。ES中采用Mysql表的单一主键值做_id,<strong>不会出现重复</strong></li>
</ul>
</li>
<li><p>Mysql数据中是否有row含有特殊字段?(比如string太长、含有特殊编码字符等)</p>
<ul>
<li>没有。二分查找,找到ES中丢失的一些具体数据,数据本身并没有什么特殊</li>
</ul>
</li>
<li><p>分析Mysql数据导入ES的第三方插件源码: elasticsearch-jdbc</p>
<ul>
<li><p>日志中含有出错信息</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">[<span class="number">15</span>:<span class="number">28</span>:<span class="number">15</span>,<span class="number">983</span>][ERROR][org.xbib.elasticsearch.helper.client.BulkTransportClient][elasticsearch[importer][listener][T<span class="comment">#4]] bulk [45] failed with 8020 failed items, failure message = failure in bulk execution:</span></div><div class="line">[<span class="number">0</span>]: index [yqj_info_index_more], type [yqj_info_type_more], id [<span class="number">11598123</span>], message [RemoteTransportException[[Leader][ip:<span class="number">9300</span>][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execu</div><div class="line">tion of org.elasticsearch.transport.TransportService$<span class="number">4</span>@<span class="number">7691</span>eb0a on EsThreadPoolExecutor[bulk, queue capacity = <span class="number">50</span>, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@c2f9664[Running, pool size = <span class="number">8</span>, active threads</div><div class="line">= <span class="number">8</span>, queued tasks = <span class="number">50</span>, completed tasks = <span class="number">21008</span>]]];]</div></pre></td></tr></table></figure>
</li>
<li><p>出错原因:导入程序向ES发送数据的速率太快了,超过ES的处理能力。</p>
</li>
<li><p>解决办法:</p>
<ol>
<li>降低导入程序的发送速率 or</li>
<li>增大ES接收数据的线程池数量、缓存队列的size</li>
</ol>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="elasticsearch-jdbc-原理分析"><a href="#elasticsearch-jdbc-原理分析" class="headerlink" title="elasticsearch-jdbc 原理分析"></a>elasticsearch-jdbc 原理分析</h4><ul>
<li><p>git地址:<a href="https://github.com/jprante/elasticsearch-jdbc" target="_blank" rel="external">elasticsearch-jdbc</a></p>
</li>
<li><p>处理时序图:</p>
</li>
<li><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line">title: elasticseach-jdbc时序图</div><div class="line">Runner.main()-> JDBCImport.execute(): 1. 开始向ES导入数据</div><div class="line">Note right of JDBCImport.execute(): 启动一个ExecutorServer\n默认线程数为1</div><div class="line">JDBCImport.execute()-> Context.execute(): 2. JDBCImport中构造上下文context\n包含Source(mysql)、Sink(ES)</div><div class="line">Context.execute() -> Context.beforeFetch(): 3. 创建Source、Sink,设置相关参数</div><div class="line">Context.beforeFetch() -> Sink.beforeFetch(): 4. 创建BulkTransportClient,处理fetch前的准备工作\n如:创建ES索引</div><div class="line">Context.beforeFetch() -> Source.beforeFetch(): 5. 这里直接返回</div><div class="line">Context.execute() --> Source.fetch(): 6. 开始fetch数据了</div><div class="line">Source.fetch() -> Source.executeQuery(): 7. 通过JDBC执行SQL</div><div class="line">Source.executeQuery() -> Source.fetch(): 8. 返回SQL结果的迭代器\n</div><div class="line">Source.fetch() -> Source.processRow(): 9. 遍历处理每一行数据\n给数据加上ES的元数据\n如index、type、_id、version等</div><div class="line">Source.processRow() -> Sink.index(): 10. 将数据丢给Sink,构造成IndexRequest,缓存\n当缓存的IndexRequest数量达到一定阈值(10000)\n或经过一定时间后,将这批Request构造成BulkRequest</div><div class="line">Sink.index() -> BulkRequestHandler.execute(): 11. 将BulkRequest经底层Netty发送给ES Server\n若ES处理Failed,log the message</div></pre></td></tr></table></figure>
</li>
<li><p>关键步骤</p>
<ul>
<li><p>JDBCImport中构造上下文context:包含Source(mysql)、Sink(ES)</p>
</li>
<li><p>Sink初始化,创建BulkTransportClient,处理fetch前的准备工作</p>
<ul>
<li><p>如:创建ES索引、设定缓存IndexRequest的阈值,同时处理的BulkRequest最大数量</p>
<blockquote>
<p>Note: ES处理BulkRequest的线程池默认最大线程数为8,Request缓存队列上线值为50。</p>
<p>若缓存队列满了,还有Request过来就直接抛异常拒绝访问了。</p>
</blockquote>
</li>
</ul>
</li>
<li><p>Source中调用JDBC执行SQL,获取SQL结果的迭代器,遍历处理每一行</p>
<ul>
<li>若SQL结果字段名有与ES Control key同名时,就用该字段值替代ES meta。</li>
<li>例如:SQL结果含有字段_id,就将字段值设定为ES存储的文档id</li>
</ul>
</li>
<li><p>Sink将每一行数据封装成IndexRequest,并缓存、处理</p>
<ul>
<li>当缓存的IndexRequest数量达到阈值(10000),将缓存的所有IndexRequest封装成BulkRequest,netty发送给ES Server</li>
<li>当定时线程到期后(默认设定为30秒),就将当前缓存的所有IndexReques封装成BulkRequest,发送给ES Server</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="ES-处理BulkRequest的流程"><a href="#ES-处理BulkRequest的流程" class="headerlink" title="ES 处理BulkRequest的流程"></a>ES 处理BulkRequest的流程</h4><ul>
<li><p>ES启动时,NettyTransport会创建Netty的ServerBootstrap,默认监听端口9300-9400</p>
</li>
<li><p>接收message时,调用ChannelPipeline的ChannelUpStreamHandler依次处理,关键处理类:<code>MessageChannelHandler</code></p>
</li>
<li><p>ES启动时,会注册很多action -> RequestHandler,当收到BulkRequest,调用HandledTransportAction进行处理(action为“data/write/bulk”)</p>
</li>
<li><p>一系列调用后,转到TransportBulkAction.doExecute,根据index分组,若index不存在,会自动创建</p>
</li>
<li><p>TransportBulkAction.executeBulk() 将所有的IndexRequest取出,根据shard分组,构造BulkShardRequest</p>
<ul>
<li>属于同一shard的,都将存储在一起;给IndexRequest分配shard,似乎只是简单轮询</li>
</ul>
</li>
<li><p>遍历BulkShardRequest,调用TransportReplicationAction.doExecute(),Reroute阶段开始(就是将Request发送到Shard所在node)</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"><span class="meta">@Override</span></div><div class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">doExecute</span><span class="params">(Task task, Request request, ActionListener<Response> listener)</span> </span>{</div><div class="line"> <span class="keyword">new</span> ReroutePhase((ReplicationTask) task, request, listener).run();</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
<li><p>从Bulk线程池取出线程处理这个Request,后续存储、建索引过程还未分析</p>
<ul>
<li>默认线程数:8 。Min(处理器核数, 32) </li>
<li>会有个BlockQueue对应这个线程池,若没空闲线程,就将request丢队列,若队列没位置就拒绝这个request</li>
</ul>
</li>
</ul>
<h4 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h4><ul>
<li>ElasticSearch自身安装简单,但是好些功能(如集群监控、SQL转 DSL Query等)依赖第三方插件<ul>
<li>但是第三方插件稳定性有待商榷。如:Mysql导入ES的插件可能丢失数据(插入失败后未重试)</li>
</ul>
</li>
<li>ElasticSearch能满足实时查询的需求,查询耗时也满足需求,硬件资源消耗也可接受<ul>
<li>一亿级的数据量,count(*)耗时1秒左右</li>
<li>1GB的内存就能顺畅跑起来,执行查询时JVM内存也未见明显波动</li>
</ul>
</li>
</ul>
<h4 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h4><ul>
<li><a href="http://i.zhcy.tk/blog/elasticsearchyu-solr/" target="_blank" rel="external">搜索引擎选择: Elasticsearch与Solr的调研文档</a></li>
<li><a href="http://www.ctolib.com/topics/79270.html" target="_blank" rel="external">ElasticSearch 5.0 测评以及使用体验</a></li>
<li><a href="http://blog.pengqiuyuan.com/ji-chu-jie-shao-ji-suo-yin-yuan-li-fen-xi/" target="_blank" rel="external">Elasticsearch-基础介绍及索引原理分析</a></li>
<li><a href="http://blog.csdn.net/laoyang360/article/details/52244917" target="_blank" rel="external">Elasticsearch学习,请先看这一篇!</a></li>
<li><a href="http://blog.csdn.net/laoyang360/article/details/52227541" target="_blank" rel="external">Elasticsearch的使用场景深入详解</a></li>
<li><a href="http://www.voidcn.com/blog/bigbigtreewhu/article/p-6323823.html" target="_blank" rel="external">Elasticsearch——Query DSL</a></li>
<li><a href="https://my.oschina.net/jhao104/blog/644909" target="_blank" rel="external">Elasticsearch笔记(一)—Elasticsearch安装配置</a></li>
<li><a href="http://chenghuiz.iteye.com/blog/2310377" target="_blank" rel="external">elasticsearch以及其常用插件安装</a></li>
</ul>
]]></content>
<summary type="html">
<h3 id="ElasticSearch调研总结"><a href="#ElasticSearch调研总结" class="headerlink" title="ElasticSearch调研总结"></a>ElasticSearch调研总结</h3><h4 id="Elast
</summary>
<category term="ElasticSearch" scheme="http://yoursite.com/categories/ElasticSearch/"/>
<category term="ElasticSearch" scheme="http://yoursite.com/tags/ElasticSearch/"/>
</entry>
<entry>
<title>Spark Streaming处理流程源码走读</title>
<link href="http://yoursite.com/2017/05/03/SparkStreaming%E5%A4%84%E7%90%86%E6%B5%81%E7%A8%8B/"/>
<id>http://yoursite.com/2017/05/03/SparkStreaming处理流程/</id>
<published>2017-05-03T00:59:37.651Z</published>
<updated>2017-05-03T02:17:34.120Z</updated>
<content type="html"><![CDATA[<h4 id="Spark-Streaming处理流程简介"><a href="#Spark-Streaming处理流程简介" class="headerlink" title="Spark Streaming处理流程简介"></a>Spark Streaming处理流程简介</h4><p> Spark Streaming 是基于Spark的流式处理框架,会将流式计算分解成一系列短小的批处理作业。Spark Streaming会不停地接收、存储外部数据(如Kafka、MQTT、Socket等),然后每隔一定时间(称之为batch,通常为秒级别的)启动Spark Job来处理这段时间内接收到的数据。</p>
<p> 简单来说,Spark Streaming处理流程为:</p>
<ul>
<li>不停存储外部数据</li>
<li>定期启动Spark Job,处理一个时间段内的数据</li>
</ul>
<h4 id="存储外部数据"><a href="#存储外部数据" class="headerlink" title="存储外部数据"></a>存储外部数据</h4><ul>
<li><p>程序调用流程</p>
<blockquote>
<p>StreamingContext.start() -> JobScheduler.start() -> ReceiverTracker.start() && JobGenerator.start()</p>
</blockquote>
<ul>
<li>ReceiverTracker.start() 不停地存储外部数据</li>
<li>JobGenerator.start() 用于处理数据</li>
</ul>
</li>
<li><p>ReceiverTracker</p>
<ul>
<li><p>位于Driver端,用于管理所有的Receiver</p>
<ul>
<li><p>Note:所有ReceiverInputDStream类型的DStream 都对应一个Receiver(用于接收外部数据)</p>
<blockquote>
<p>ReceiverInputDStream.getReceiver() 返回Receiver</p>
</blockquote>
</li>
</ul>
</li>
<li><p>内含ReceivedBlockTracker类型成员</p>
<ul>
<li><p>ReceivedBlockTracker的2个重要方法</p>
<ul>
<li><p>def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean</p>
<blockquote>
<p>记录worker发送过来的BlockInfo,存储格式:streamId -> mutable.Queue[ReceivedBlockInfo]</p>
</blockquote>
</li>
<li><p>def allocateBlocksToBatch(batchTime: Time): Unit</p>
<blockquote>
<p>构造Time -> Map[Int, Seq[ReceivedBlockInfo]] ,即构造每个batch对应的流以及Blocks的映射关系</p>
</blockquote>
</li>
</ul>
</li>
</ul>
</li>
<li><p>构造endpoint,处理ReceiverTrackerLocalMessage类型的本地消息</p>
<ul>
<li>case StartAllReceivers(receivers)</li>
<li>case RestartReceiver(receiver)</li>
<li>case c: CleanupOldBlocks</li>
<li>case UpdateReceiverRateLimit(streamUID, newRate)</li>
<li>case ReportError(streamId, message, error)</li>
</ul>
</li>
<li><p>start()</p>
<ul>
<li><p>启动endpoint,发送StartAllReceivers Message</p>
</li>
<li><p>Note: 在发送StartAllReceivers Message前,执行了runDummySparkJob,用于避免所有receiver被分配到同一个Executor。(设置分区数为50可以确保启动的所有task很小的概率分配到同一host上,而该Spark Job运行结束后,未执行SparkContext.stop,故而BlockManagerMaster中存储的各work的executor信息未清空,可以用于后续需求)</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line"><span class="comment">/**</span></div><div class="line"> * Run the dummy Spark job to ensure that all slaves have registered. This avoids all the</div><div class="line"> * receivers to be scheduled on the same node.</div><div class="line"> *</div><div class="line"> * TODO Should poll the executor number and wait for executors according to</div><div class="line"> * "spark.scheduler.minRegisteredResourcesRatio" and</div><div class="line"> * "spark.scheduler.maxRegisteredResourcesWaitingTime" rather than running a dummy job.</div><div class="line"> */</div><div class="line"><span class="keyword">private</span> <span class="function"><span class="keyword">def</span> <span class="title">runDummySparkJob</span></span>(): <span class="type">Unit</span> = {</div><div class="line"> <span class="keyword">if</span> (!ssc.sparkContext.isLocal) {</div><div class="line"> ssc.sparkContext.makeRDD(<span class="number">1</span> to <span class="number">50</span>, <span class="number">50</span>).map(x => (x, <span class="number">1</span>)).reduceByKey(_ + _, <span class="number">20</span>).collect()</div><div class="line"> }</div><div class="line"> assert(getExecutors.nonEmpty)</div><div class="line">}</div></pre></td></tr></table></figure>
</li>
<li><p>startReceiver()。ReceiverTracker收到自己发送的StartAllReceivers Message后,对每个receiver执行startReceiver()</p>
<ul>
<li><p>构造RDD</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> receiverRDD: <span class="type">RDD</span>[<span class="type">Receiver</span>[_]] = ssc.sc.makeRDD(<span class="type">Seq</span>(receiver), <span class="number">1</span>)</div></pre></td></tr></table></figure>
</li>
<li><p>指定RDD将执行的function</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">val</span> startReceiverFunc: <span class="type">Iterator</span>[<span class="type">Receiver</span>[_]] => <span class="type">Unit</span> =</div><div class="line">(iterator: <span class="type">Iterator</span>[<span class="type">Receiver</span>[_]]) => {</div><div class="line"> <span class="keyword">if</span> (!iterator.hasNext) {</div><div class="line"> <span class="keyword">throw</span> <span class="keyword">new</span> <span class="type">SparkException</span>(</div><div class="line"> <span class="string">"Could not start receiver as object not found."</span>)</div><div class="line"> }</div><div class="line"> <span class="keyword">if</span> (<span class="type">TaskContext</span>.get().attemptNumber() == <span class="number">0</span>) {</div><div class="line"> <span class="keyword">val</span> receiver = iterator.next()</div><div class="line"> assert(iterator.hasNext == <span class="literal">false</span>)</div><div class="line"> <span class="keyword">val</span> supervisor = <span class="keyword">new</span> <span class="type">ReceiverSupervisorImpl</span>(</div><div class="line"> receiver, <span class="type">SparkEnv</span>.get, serializableHadoopConf.value, checkpointDirOption)</div><div class="line"> supervisor.start()</div><div class="line"> supervisor.awaitTermination()</div><div class="line"> } <span class="keyword">else</span> {</div><div class="line"> <span class="comment">// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.</span></div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<blockquote>
<p>Note:RDD实际执行ReceiverSupervisorImpl.start(),task失败后重试时将重新调度</p>
</blockquote>
</li>
<li><p>提交Spark Job</p>
<figure class="highlight scala"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"> <span class="keyword">val</span> future = ssc.sparkContext.submitJob[<span class="type">Receiver</span>[_], <span class="type">Unit</span>, <span class="type">Unit</span>](</div><div class="line">receiverRDD, startReceiverFunc, <span class="type">Seq</span>(<span class="number">0</span>), (_, _) => <span class="type">Unit</span>, ())</div></pre></td></tr></table></figure>