-
Notifications
You must be signed in to change notification settings - Fork 51
/
Copy pathREADME
723 lines (554 loc) · 28.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
Magpie
------
Magpie contains a number of scripts for running Big Data software in
HPC environments. Thus far, Hadoop, Spark, Hbase, Hive, Storm, Pig,
Phoenix, Kafka, Zeppelin, and Zookeeper are supported. It
currently supports running over the parallel file system Lustre and
running over any generic network filesytem. There is
scheduler/resource manager support for Slurm, Moab, Torque, and LSF.
Some of the features presently supported:
- Run jobs interactively or via scripts.
- Run Mapreduce 1.0 or 2.0 jobs via Hadoop 1.0 or 2.0
- Run against a number of filesystem options, such as HDFS, HDFS over
Lustre, HDFS over a generic network filesystem, Lustre directly, or
a generic network filesystem.
- Take advantage of SSDs/NVRAM for local caching if available
- Make decent optimizations for your hardware
Experimental support for several distributed machine learning
frameworks has also been added. Presently tensorflow and tensorflow
w/ horovod is supported.
Basic Idea
----------
The basic idea behind these scripts are to:
1) Submit a Magpie batch script to allocate nodes on a cluster using
your HPC scheduler/resource manager. Slurm, Slurm+mpirun,
Moab+Slurm, Moab+Torque and LSF+mpirun are currently supported.
2) The batch script will create configuration files for all
appropriate projects (Hadoop, Spark, etc.) The configuration files
will be setup so the rank 0 node is the "master". All compute
nodes will have configuration files created that point to the node
designated as the master server.
The configuration files will be populated with values for your
filesystem choice and the hardware that exists in your cluster.
Reasonable attempts are made to determine optimal values for your
system and hardware (they are almost certainly better than the
default values). A number of options exist in the batch scripts to
adjust these values for individual jobs.
3) Launch daemons on all nodes. The rank 0 node will run master
daemons, such as the Hadoop Namenode. All remaining nodes will run
appropriate worker daemons, such as the Hadoop Datanodes.
4) Now you have a mini big data cluster to do whatever you want. You
can log into the master node and interact with your mini big data
cluster however you want. Or you could have Magpie run a script to
execute your big data calculation instead.
5) When your job completes or your allocation time has run out, Magpie
will cleanup your job by tearing down daemons. When appropriate,
Magpie may also do some additional cleanup work to hopefully make
re-execution on later runs cleaner and faster.
Requirements
------------
1) Magpie and all big data projects (Hadoop, Spark, etc.) should be
installed on all cluster nodes. It can be in a known location or
perhaps via a network file system location. Many users may simply
install them into their NFS home directories. These paths will be
later specified in job submission scripts.
Note that not all distributions of big data projects (Hadoop,
Spark, etc.) are supported. Generally speaking, only versions
from Apache have been tested. Your mileage mary vary with other
distributions.
Some projects may need patches applied. You can find patches in
Magpie's 'patches' directory. Most patches are only needed against
scripts within the projects, but on occassion a recompilation of
the source may also be necessary.
If you are unfamiliar with patches, see documentation for the
`patch` command. Although in most cases you can patch your project
via:
cd PROJECT-VERSION
patch -p1 PATH-TO-MAGPIE/patches/PROJECT/PROJECT-VERSION.patch
For example, to apply the alternate-ssh patch to hadoop
cd hadoop-2.9.2
patch -p1 < ../magpie/patches/hadoop/hadoop-2.9.2-alternate-ssh.patch
2) A passwordless remote shell execution mechanism must be available
for scripts to launch big data daemons (e.g. Hadoop Datanodes) on
all appropriate nodes. The most popular (and default mechanism) is
passwordless ssh. However, other mechanisms are more than
suitable.
3) A temporary local scratch space is needed on each node for Magpie
to store configuration files, log files, and other miscellaneous
files. A very small amount of scratch space is needed.
This local scratch space need not be a local disk. It could
hypothetically be memory based tmpfs.
Beginning with Magpie 1.60 the ability to use network file paths
for "local scratch" space was supported, but requires some extra
work. See README.no-local-dir for details.
4) Magpie and the projects it supports generally assume that all
software and the OS environment consistently use short hostnames or
fully qualified qualified domain names. For example, if the
"hostname" command returns a short hostname (e.g. 'foo' and not
'foo.host.com'), then the scheduler/resource manager should output
shortened hostnames in its output environment variables
(e.g. SLURM_JOB_NODELIST w/ Slurm, MOAB_NODELIST w/ Moab, etc.)
There are mechanisms in place to work around this if your
environment does not match in this way. See README.hostname for
details.
5) A minor set of software dependencies are required depending on your
environment.
The Moab+Torque submission scripts use Pdsh
(https://github.com/chaos/pdsh) to launch/run scripts across
cluster nodes.
The LSF submission scripts use mpirun to launch/run scripts across
cluster nodes.
The 'hostlist' command from lua-hostlist
(https://github.com/grondo/lua-hostlist) is preferred for a variety
of hostrange parsing needs in Magpie. If it is not available,
Magpie will use its internal tool 'magpie-expand-nodes', which
should be sufficient for most hostrange parsing, but may not
function for a number of nuanced corner cases.
Several checks for Zookeeper functionality assume netcat and the
'nc' command are available. If it is not available, the checks
cannot be done.
Local Configuration
-------------------
All HPC sites will have local differences and nuances to running jobs.
The job submission scripts in submission-scripts/ have a number of
defaults, such as the default location for network file systems, local
scratch space, etc.
You can adjust these defaults by editing the defaults listed in
submission-scripts/script-templates/Makefile and running 'make'
afterwards.
In addition, if your site requires local special requirements, such as
setting unique paths or loading specific modules before executing a
job, this can also be configured via the LOCAL_REQUIREMENTS
configuration in the same Makefile.
Supported Packages & Versions
-----------------------------
The following packages and their versions have been tested for minimal
support in this version of Magpie.
Versions not listed below should work with Magpie if the
configuration/setup of those versions is compatible with the versions
listed below. However, certain features or options may not work with
those versions.
* + - Requires patch against binary distro's scripts, no re-compilation needed
* ^ - Requires patch against source, requires re-compilation
* ! - Some issues may exist, see project readmes (i.e. README.hadoop) for details
Hadoop - 2.2.0+, 2.3.0+, 2.4.0+, 2.4.1+, 2.5.0+, 2.5.1+, 2.5.2+,
2.6.0+, 2.6.1+, 2.6.2+, 2.6.3+, 2.6.4+, 2.6.5+, 2.7.0+,
2.7.1+, 2.7.2+, 2.7.3+, 2.7.4+, 2.7.5+, 2.7.6+, 2.7.7+,
2.8.0+, 2.8.1+, 2.8.2+, 2.8.3+, 2.8.4+, 2.8.5+, 2.9.0+,
2.9.1+, 2.9.2+, 3.0.0+, 3.0.1+, 3.0.2+, 3.0.3+, 3.1.0+,
3.1.1+, 3.1.2+, 3.1.3+, 3.1.4+, 3.2.0+, 3.2.1+, 3.2.2+,
3.2.3+, 3.2.4+, 3.3.0+, 3.3.1+, 3.3.2+, 3.3.3+, 3.3.4+,
3.3.5+, 3.3.6+
Spark - 1.1.0-bin-hadoop2.3+, 1.1.0-bin-hadoop2.4+,
1.1.1-bin-hadoop2.3+, 1.1.1-bin-hadoop2.4+,
1.2.0-bin-hadoop2.3+, 1.2.0-bin-hadoop2.4+,
1.2.1-bin-hadoop2.3+, 1.2.1-bin-hadoop2.4+,
1.2.2-bin-hadoop2.3+, 1.2.2-bin-hadoop2.4+,
1.3.0-bin-hadoop2.3+, 1.3.0-bin-hadoop2.4+,
1.3.1-bin-hadoop2.3+, 1.3.1-bin-hadoop2.4+,
1.3.1-bin-hadoop2.6+, 1.4.0-bin-hadoop2.3+,
1.4.0-bin-hadoop2.4+, 1.4.0-bin-hadoop2.6+,
1.4.1-bin-hadoop2.3+, 1.4.1-bin-hadoop2.4+,
1.4.1-bin-hadoop2.6+, 1.5.0-bin-hadoop2.6+,
1.5.1-bin-hadoop2.6+, 1.5.2-bin-hadoop2.6+,
1.6.0-bin-hadoop2.6+, 1.6.1-bin-hadoop2.6+,
1.6.2-bin-hadoop2.6+, 1.6.3-bin-hadoop2.6+,
2.0.0-bin-hadoop2.6+, 2.0.0-bin-hadoop2.7+,
2.0.1-bin-hadoop2.6+, 2.0.1-bin-hadoop2.7+,
2.0.2-bin-hadoop2.6+, 2.0.2-bin-hadoop2.7+,
2.1.0-bin-hadoop2.6+, 2.1.0-bin-hadoop2.7+,
2.1.1-bin-hadoop2.6+, 2.1.1-bin-hadoop2.7+,
2.1.2-bin-hadoop2.6+, 2.1.2-bin-hadoop2.7+,
2.2.0-bin-hadoop2.6+!, 2.2.0-bin-hadoop2.7+!,
2.2.1-bin-hadoop2.6+!, 2.2.1-bin-hadoop2.7+!,
2.3.0-bin-hadoop2.6+!, 2.3.0-bin-hadoop2.7+!,
2.3.1-bin-hadoop2.6+!, 2.3.1-bin-hadoop2.7+!,
2.3.2-bin-hadoop2.6+!, 2.3.2-bin-hadoop2.7+!,
2.3.3-bin-hadoop2.6+!, 2.3.3-bin-hadoop2.7+!,
2.3.4-bin-hadoop2.6+!, 2.3.4-bin-hadoop2.7+!,
2.4.0-bin-hadoop2.6+!, 2.4.0-bin-hadoop2.7+!,
2.4.1-bin-hadoop2.6+!, 2.4.1-bin-hadoop2.7+!,
2.4.2-bin-hadoop2.6+!, 2.4.2-bin-hadoop2.7+!,
2.4.3-bin-hadoop2.6+!, 2.4.3-bin-hadoop2.7+!,
2.4.4-bin-hadoop2.6+!, 2.4.4-bin-hadoop2.7+!,
2.4.5-bin-hadoop2.6+!, 2.4.5-bin-hadoop2.7+!,
2.4.6-bin-hadoop2.6+!, 2.4.6-bin-hadoop2.7+!,
2.4.7-bin-hadoop2.6+!, 2.4.7-bin-hadoop2.7+!,
2.4.8-bin-hadoop2.6+!, 2.4.8-bin-hadoop2.7+!,
3.0.0-bin-hadoop2.7+!, 3.0.0-bin-hadoop3.2+!,
3.0.1-bin-hadoop2.7+!, 3.0.1-bin-hadoop3.2+!,
3.0.2-bin-hadoop2.7+!, 3.0.2-bin-hadoop3.2+!,
3.0.3-bin-hadoop2.7+!, 3.0.3-bin-hadoop3.2+!,
3.1.1-bin-hadoop2.7+!, 3.1.1-bin-hadoop3.2+!,
3.1.2-bin-hadoop2.7+!, 3.1.2-bin-hadoop3.2+!,
3.1.3-bin-hadoop2.7+!, 3.1.3-bin-hadoop3.2+!,
3.2.0-bin-hadoop2.7+!, 3.2.0-bin-hadoop3.2+!,
3.2.1-bin-hadoop2.7+!, 3.2.1-bin-hadoop3.2+!,
3.2.2-bin-hadoop2.7+!, 3.2.2-bin-hadoop3.2+!,
3.2.3-bin-hadoop2.7+!, 3.2.3-bin-hadoop3.2+!,
3.2.4-bin-hadoop2.7+!, 3.2.4-bin-hadoop3.2+!,
3.3.0-bin-hadoop2.7+!, 3.3.0-bin-hadoop3.2+!,
3.3.1-bin-hadoop2.7+!, 3.3.1-bin-hadoop3.2+!,
3.3.2-bin-hadoop2.7+!, 3.3.2-bin-hadoop3.2+!,
3.3.3-bin-hadoop3+!
TensorFlow - 1.9, 1.12
Hbase - 1.0.0+, 1.0.1+, 1.0.1.1+, 1.0.2+, 1.0.3+, 1.1.0+, 1.1.0.1+,
1.1.1+, 1.1.2+, 1.1.3+, 1.1.4+, 1.1.5+, 1.1.6+, 1.1.7+,
1.1.8+, 1.1.9+, 1.1.10+, 1.1.11+, 1.1.12+, 1.1.13+, 1.2.0+,
1.2.1+, 1.2.2+, 1.2.3+, 1.2.4+, 1.2.5+, 1.2.6+, 1.2.6.1+,
1.2.7+, 1.3.0+, 1.3.1+, 1.3.2+, 1.3.2.1+, 1.3.3+, 1.3.4+,
1.3.5+, 1.4.0+!, 1.4.1+, 1.4.2+, 1.4.3+, 1.4.4+, 1.4.5+,
1.4.6+, 1.4.7+, 1.4.8+, 1.4.9+, 1.4.10+, 1.4.13+, 1.5.0+,
1.6.0+
Hive - 2.3.0 [HiveNote]
Pig - 0.13.0, 0.14.0, 0.15.0, 0.16.0, 0.17.0
Zookeeper - 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7,
3.4.8, 3.4.9, 3.4.10, 3.4.11, 3.4.12, 3.4.13, 3.4.14
Storm - 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.10.0, 0.10.1, 0.10.2,
1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.2.0,
1.2.1, 1.2.2, 1.2.3
Phoenix - 4.5.0-Hbase-1.0+, 4.5.0-Hbase-1.1+, 4.5.1-Hbase-1.0+,
4.5.1-Hbase-1.1+, 4.5.2-HBase-1.0+, 4.5.2-HBase-1.1+,
4.6.0-Hbase-1.0+, 4.6.0-Hbase-1.1, 4.7.0-Hbase-1.0+,
4.7.0-Hbase-1.1, 4.8.0-Hbase-1.0+, 4.8.0-Hbase-1.1,
4.8.0-Hbase-1.2, 4.8.1-Hbase-1.0+, 4.8.1-Hbase-1.1,
4.8.1-Hbase-1.2, 4.8.2-Hbase-1.0+, 4.8.2-Hbase-1.1,
4.8.2-Hbase-1.2, 4.9.0-Hbase-1.1, 4.9.0-Hbase-1.2,
4.10.0-Hbase-1.1, 4.10.0-Hbase-1.2, 4.11.0-Hbase-1.1,
4.11.0-Hbase-1.2, 4.11.0-Hbase-1.3, 4.12.0-Hbase-1.1,
4.12.0-Hbase-1.2, 4.12.0-Hbase-1.3, 4.13.0-Hbase-1.3,
4.13.1-Hbase-1.1, 4.13.1-Hbase-1.2, 4.13.1-Hbase-1.3,
4.14.0-Hbase-1.1, 4.14.0-Hbase-1.2, 4.14.0-Hbase-1.3,
4.14.0-Hbase-1.4
Kafka - 2.11-0.9.0.0
Zeppelin - 0.6.0, 0.6.1, 0.6.2, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0,
0.8.1, 0.8.2
[HiveNote] - Hive uses PostgreSQL, the minimum version required is 9.1.13
PostgreSQL can be found at: https://www.postgresql.org/download/
Package Version Combinations
----------------------------
Many packages function together, for example Pig requires Hadooop,
Spark may use Hadoop to access HDFS, Hbase and Storm require
Zookeeper, and Phoenix requires Hbase. While the range of project
versions that work together is very large, we've found the following
to be a good starting point to use in running jobs.
Pig 0.13.X, 0.14.X w/ Hadoop 2.6.X
Pig 0.15.X -> 0.17.X w/ Hadoop 2.7.X
Hbase 1.0.X -> 1.6.X w/ Hadoop 2.7.X, Zookeeper 3.4.X
Phoenix 4.4.X -> 4.13.X - Beginning w/ Phoenix 4.4.0, versions
prebuilt for Hbase 1.0, 1.1, etc. are available. Use the version it
is prebuilt for appropriately.
Spark 1.X - Beginning w/ Spark 1.1 versions prebuilt for Hadoop 2.3,
2.4, 2.6, etc. are available. Use the version it is prebuilt for
appropriately. See above for supported versions.
Spark 2.X - Builds against Hadoop 2.3, 2.4, 2.6, and 2.7. Use the
version it is prebuilt for appropriately. See above for supported
versions.
Spark 3.X - Builds against Hadoop 2.7 and 3.2. Use the version it is
prebuilt for appropriately. See above for supported versions.
Storm 0.9.X, 0.10.0, 1.X.0 w/ Zookeeper 3.4.X
Kafka 2.11-0.9.0.0 w/ Zookeeper 3.4.X
Zeppelin 0.6.0 w/ Spark 1.6.X
Package Java Versions
---------------------
Some package versions from Apache require minimum Java versions.
Although the minimums may be lower than those listed here, these are
our recommendations based on testing & experience.
Hadoop 2.0 -> 2.5 - Java 1.6
Hadoop 2.6 -> 2.7.3 - Java 1.7
Hadoop 2.7.4 -> 2.7.X - Java 1.8
Hadoop 2.8.0 -> ... - Java 1.7
Hadoop 3.0.0 -> ... - Java 1.8
Hbase 1.0 -> ... - Java 1.7
Hbase 1.5 -> ... - Java 1.8
Spark 1.1 -> 1.3 - Java 1.6
Spark 1.4 -> 1.6 - Java 1.7
Spark 2.0 -> 2.1 - Java 1.7
Spark 2.2 - ... - Java 1.8
Storm 0.9.3 -> 0.9.4 - Java 1.6
Storm 0.9.5 -> ... - Java 1.7
Zeppelin 0.6 -> 0.7 - Java 1.7
Zeppelin 0.8 -> ... - Java 1.8
Package Attention
-----------------
Not all software packages and features have been given the same level
of attention in Magpie, so we feel it is important to inform you of
the level of trust you can have in Magpie support for individual
projects and/or features.
Core packages/features are considered amongst the core supported
packages in Magpie. Magpie developers are confident in their
functionality under a wide range of use cases and scenarios.
Well supported packages/features are not given quite the same
attention as core ones. Magpie developers have confidence they
will work with common use case scenarios, but non-common scenarios
may have not been tested or tried.
Experimental supported packages/features are not maintained with deep
attention. They may have been developed against a specific project
version or with specific use scenario. Their support should be
considered on the side of experimental.
- Core
Packages: Hadoop, Spark, Hbase, Pig, Zookeeper
- Well
Packages: Storm, Phoenix
Features: No-local-dir
- Experimental
Packages: Kafka, Zeppelin, Hive, TensorFlow w/ & w/o
Horovod, Ray
Documentation
-------------
General information about all of Magpie can be found below. For
information on individual projects, please see the following README
files.
Hadoop - See README.hadoop
Pig - See README.pig
Hbase - See README.hbase
Hive - See README.hive
Spark - See README.spark
TensorFlow - See README.tensorflow
TensorFlow Horovod - See README.tensorflow-horovod
Ray - See README.ray
Storm - See README.storm
Phoenix - See README.phoenix
Kafka - See README.kafka
Zeppelin - See README.zeppelin
Zookeeper - See README.zookeeper
Documentation on some optional features:
- Support HPC systems without (or very small) /tmp filesystems - See README.no-local-dir
Some miscellaneous documentation:
- Testsuite information - See README.testsuite
- FAQ of random common questions - See README.faq
Exported Environment Variables
------------------------------
The following environment variables are exported when your job is run
and may be useful in scripts in your run or in pre/post run scripts.
Note that they may not be automatically exported if you remote login
into your master node. See MAGPIE_ENVIRONMENT_VARIABLE_SCRIPT for a
convenient mechanism to export commonly used environment variables
during a remote login session.
Project specific environment variable exports are also available, see
those sections for more information.
MAGPIE_CLUSTER_NODERANK : the rank of the node you are on. It's often
convenient to do something like
if [ $MAGPIE_CLUSTER_NODERANK == 0 ]
then
....
fi
To only do something on one node of your allocation.
MAGPIE_NODE_COUNT : Number of nodes in this allocation.
MAGPIE_NODELIST : Nodes in your allocation.
MAGPIE_JOB_NAME : Job name
MAGPIE_JOB_ID : Job ID
MAGPIE_TIMELIMIT_MINUTES : Timelimit of job in minutes
Convenience Scripts
-------------------
A number of convenience scripts are included in the scripts/
directory, both for possible usefulness and as examples. They are
organized within the directory as follows:
job-scripts - These are scripts that you would run as a possible job
in Magpie. You would set these scripts in the MAGPIE_JOB_SCRIPT
environment variable.
pre-job-run-scripts - These are scripts that you would run before the
actual calculation is executed. You would set these scripts in the
MAGPIE_PRE_JOB_RUN environment variable.
post-job-run-scripts - These are scripts that you would run after the
actual calculation is executed. You would set these scripts in the
MAGPIE_POST_JOB_RUN environment variable.
Notable scripts worth mentioning:
pre-job-run-scripts/magpie-output-config-files-script.sh - This script
will output all of the conf files from your job. It's convenient for
debugging.
post-job-run-scripts/magpie-gather-config-files-and-logs-script.sh -
This script will get all of the conf files and log files from Hadoop,
Hbase, Pig, Spark, Storm, and/or Zookeeper and store it in a location
for post-analysis of your job. It's convenient for debugging. By
default files are stored in ${HOME}/${MAGPIE_JOB_NAME}, but the base
directory can be altered with the first argument passed into the
script.
In addition, the misc/magpie-download-and-setup.sh script may be
convenient for initially downloading and patching Apache projects for
you so you don't have to manually download them. It'll also configure
several paths for you in the launch scripts automatically.
General Advanced Usage
----------------------
The following are additional tips for advanced usage of Magpie.
1) The Magpie environment variables of MAGPIE_PRE_JOB_RUN and
MAGPIE_POST_JOB_RUN can be used to run scripts before and after
your primary job script executes.
The MAGPIE_POST_JOB_RUN is particularly useful, as it can gather
logs and/or other debugging data for you. The convenience script
post-job-run-scripts/magpie-gather-config-files-and-logs-script.sh
gathers most configuration and log data and stores it to your home
directory.
2) The Magpie environment variable MAGPIE_ENVIRONMENT_VARIABLE_SCRIPT
is useful for creating a file of popular and useful environment
variables. The file it creates can be used within scripts you
write, or it can sourced into your environment when you try to
interact with your job.
3) All configuration files in conf/ can be modified to be tuned for
individual applications. For the brave and adventurous, various
configurations such as JVM options and other tunables can be
adjusted. If you wish to experiment with different sets of
configuration files, consider making different directories with
different conf files in them. Then a quick change to project
CONF_FILE settings (e.g. HADOOP_CONF_FILES, SPARK_CONF_FILES,
HBASE_CONF_FILES, etc.) can quickly allow different files to be
experimented with.
4) It is possible to run multiple instances of Hadoop, Hbase,
etc. simultaneously on a cluster. However, it is important to
isolate each of those instances. In particular, if using default
configurations, multiple instances may attempt to read/write
identical locations on network filesystems, leading to problems
between jobs. For example, if you configure HDFS to operate out of
/lustre/hdfsoverlustre/ on multiple jobs, only one namenode will be
able to operate correctly at a time.
In order to solve this problem, all you need to do is create
different directories for each service operating out of a network
file system. For example, /lustre/hdfsoverlustre1 and
/lustre/hdfsoverlustre2 for two different jobs using HDFS.
If you are not concerned about the specific path you are using,
perhaps because you never intend to reuse those paths, consider
using MAGPIE_ONE_TIME_RUN. This setting may be particularly useful
if you initially running tests/experiments on different CPU counts,
node counts, settings, etc. and want to run many jobs in parallel.
Be careful to cleanup these directories from time to time, as
Magpie will not clear data from prior jobs.
Security
--------
Users should be aware that running Magpie w/ the big data software
supported here may be insecure in your environment. While Magpie
makes attempts to configure software with good "sanity"
configurations, they are not foolproof. In addition, some software
may not yet have security infrastructure built in.
If you are not running in an environment where your cluster allocation
is isolated (through a private virtualized network or something
similar) other users on the cluster may be able to communicate with a
number of the big data services setup by Magpie.
These issues are due to a variety of factors, including:
1) In "traditional" big data clusters, system administrators control
what users are allowed on the cluster and who is not, limiting the
exposure of data stored there. In the Magpie model, a "big data
cluster" is instantiated within a larger multi-user HPC cluster. The
Magpie user cannot control what other users have access to the HPC
cluster. This population of HPC users could access to the data of the
Magpie user without the Magpie user's knowledge.
2) In "traditional" big data clusters, important daemons are
owned/executed by a special user (e.g. hdfs, yarn, etc.). This may
limit the type of the exposure a nefarious/rogue process can have on
the system. When running in an HPC environment with Magpie, the
processes are run under the user's ownership. Since users are
typically not root, they have no way to change the ownership of the
process to a "special" user.
3) Some big data software have Kerberos or similar security functions
built into it. However, it is beyond the scope of most HPC users to
get proper kerberos configuration of Hadoop, HDFS, etc. from their
site staff before running their job.
4) Some big data software just doesn't really have any security built
in at all.
A few examples of security issues are listed below:
Hadoop HDFS - The Hadoop Namenode is generally available on an open
and public port. While HDFS has been configured with a good
default umask and ACLs, other users on the system can override this by
setting HADOOP_USER_NAME environment variable.
Hadoop YARN - Similar to Hadoop HDFS, good default configurations have
been setup. However they can be overridden with the HADOOP_USER_NAME
environment variable. This allows users to potentially run jobs as
another user on the cluster. This in turn can open up all of a user's
data to others within the system.
Spark - Spark shared secret keys have been configured for sanity
configuration. However, since the shared secret may be easy to
determine, it will allow user to run jobs as another user on the
cluster. This in turn can open up all of a user's data to others
within the system.
Web UIs - Generally speaking, most web UIs will be viewable by other
users on the cluster if firewall rules (or similar) are not setup on
your cluster by default.
Contributions
-------------
Feel free to send me patches for new environment variables, new
adjustments, new optimization possibilities, alternate defaults that
you feel are better, etc.
Any patches you submit to me for fixes will be appreciated. I am by
no means a bash expert ... in fact I'm quite bad at it.
Other Projects
--------------
We welcome additions of other projects into Magpie. Here's a somewhat
general guide to including other projects in Magpie. This is very
high level. Please see internal implementation for details.
Hopefully it's somewhat obvious.
1) Add appropriate "templates" into
submission-scripts/script-templates/ for the new project so the
project can be setup. You can copy templates from other projects
to begin. Although there can be variations depending on the
project's purpose, you'll most likely want to add:
magpie-XXX
magpie-magpie-customizations-job-XXX
magpie-magpie-customizations-testall-XXX
files for project XXX.
Then after that, update
submission-scripts/script-templates/Makefile to add your project
into the primary job submission files. Generate additional
submission scripts for the new project if you desire to. After
this you can run make and ensure your new project has been added
correctly into the job submission scripts.
2) Add appropriate input checks to 'magpie-check-inputs'
3) Add an appropriate "setup" file to magpie/setup/
4) Add an appropriate "run" file to magpie/run/
5) Update 'magpie-setup-projects' and 'magpie-run' appropriately for
new calls.
6) If necessary create new directories and setup master/worker
files in 'magpie-setup-core'
7) If necessary, following libraries could warrant updates:
magpie/lib/magpie-lib-node-identification - to identify
master/worker nodes
magpie/lib/magpie-lib-paths - set various path defaults
magpie/lib/magpie-lib-defaults - set various defaults
8) Add any necessary patches to patches/
9) Add new tests into Magpie's testsuite
- Add new test-generate-XXX.sh file to generate new tests.
- Update test-generate.sh appropriately for new test generation.
- Add new test test-submit-XXX file to submit new tests.
- Update test-validate.sh for validate jobs suceeded.
- Update test-download-projects.sh to download & patch projects if
necessary.
10) (Optional) Add download options for the project in
misc/magpie-download-and-setup.sh
Other Schedulers/Resource Managers
----------------------------------
While Slurm, Moab+Slurm, Moab+Torque, and LSF+mpirun are the currently
supported schedulers/resource managers, there's no reason to believe
that other schedulers/resource managers couldn't be supported. I'd
gladly welcome patches to support them.
To support another scheduler or resource manager, you'll want to make
your equivalent scheduler/resource manager header, similar to
submission-scripts/script-templates/magpie-config-sbatch-srun. You
may also need to create a new job running variant, such as
submission-scripts/script-templates/magpie-run-job-srun. Then add an
appropriate new section to
submission-scripts/script-templates/Makefile and a new directory for
these new submission scripts in submission-scripts.
If a new MAGPIE_SUBMISSION_TYPE is needed, you'll want to update
magpie/exports/magpie-exports-submission-type and add appropriate
input checks in magpie-check-inputs.
I'd be glad to accept patches back for other schedulers/resource
managers. Please send me a pull request.
Author
------
This is me. Feel free to contact me about Magpie, however please
consider posting support questions to Github's issue tracker so
everyone can see the questions & solutions to your problem.
Albert Chu
Credit
------
Credit must be given to Kevin Regimbal @ PNNL. Initial experiments
were done using heavily modified versions of scripts Kevin developed
for running Hadoop w/ Slurm & Lustre. A number of the ideas from
Kevin's scripts continue in spirit in these scripts.
Special thanks to David Buttler who came up with the clever name for
this project.
Thanks
------
Thanks to the following for contributions
Felix-Antoine Fortin ([email protected]) - Msub-Torque-Pdsh support & other misc patches
Brian Panneton ([email protected]) - LSF support, Phoenix, Kafka and Zeppelin support, & Number of misc patches
Adam Childs ([email protected]) - Hive/Tez support