forked from dmlc/xgboost
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
155 additions
and
237 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
Distributed XGBoost Training | ||
============================ | ||
This is an tutorial of Distributed XGBoost Training. | ||
Currently xgboost supports distributed training via CLI program with the configuration file. | ||
There is also plan push distributed python and other language bindings, please open an issue | ||
if you are interested in contributing. | ||
|
||
Build XGBoost with Distributed Filesystem Support | ||
------------------------------------------------- | ||
To use distributed xgboost, you only need to turn the options on to build | ||
with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```. | ||
|
||
How to Use | ||
---------- | ||
* Input data format: LIBSVM format. The example here uses generated data in ../data folder. | ||
* Put the data into some distribute filesytem (S3 or HDFS) | ||
* Use tracker script in dmlc-core/tracker to submit the jobs | ||
* Like all other DMLC tools, xgboost support taking a path to a folder as input argument | ||
- All the files in the folder will be used as input | ||
* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>``` | ||
|
||
Example | ||
------- | ||
* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN. | ||
|
||
Single machine vs Distributed Version | ||
------------------------------------- | ||
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file. | ||
* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access | ||
* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file | ||
- ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf". | ||
- You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` | ||
- The local path of cached files in command is "./". | ||
* More details of submission can be referred to the usage of ```dmlc_yarn.py```. | ||
* The model saved by hadoop version is compatible with single machine version. | ||
|
||
Notes | ||
----- | ||
* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance. | ||
- You will want to set <n_thread_per_worker> to be number of cores you have on each machine. | ||
|
||
|
||
External Memory Version | ||
----------------------- | ||
XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data. | ||
See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory. | ||
|
||
You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as | ||
``` | ||
data=hdfs:///path-to-my-data/#dtrain.cache | ||
``` | ||
This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#!/bin/bash | ||
if [ "$#" -lt 3 ]; | ||
then | ||
echo "Usage: <nworkers> <nthreads> <path_in_HDFS>" | ||
exit -1 | ||
fi | ||
|
||
# put the local training file to HDFS | ||
hadoop fs -mkdir $3/data | ||
hadoop fs -put ../data/agaricus.txt.train $3/data | ||
hadoop fs -put ../data/agaricus.txt.test $3/data | ||
|
||
# running rabit, pass address in hdfs | ||
../../dmlc-core/tracker/dmlc_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\ | ||
data=hdfs://$3/data/agaricus.txt.train\ | ||
eval[test]=hdfs://$3/data/agaricus.txt.test\ | ||
model_out=hdfs://$3/mushroom.final.model | ||
|
||
# get the final model file | ||
hadoop fs -get $3/mushroom.final.model final.model | ||
|
||
# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env | ||
|
||
# output prediction task=pred | ||
#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test | ||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test | ||
# print the boosters of final.model in dump.raw.txt | ||
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt | ||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt | ||
# use the feature map in printing for better visualization | ||
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt | ||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt | ||
cat dump.nice.txt |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.