-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During parallel training, metrics of training set sometimes will be bad #886
Comments
@qrqpjxq You can add a parameter: |
thanks, but after set valid=trainset; num_machines=2 worker1: in each iteration, auc and logloss in worker0 and worker1 is not same. |
@qrqpjxq |
@qrqpjxq The result:
worker 2:
|
i tried again, if delete categorical_column = 0,4 there's noting wrong. metrics: logloss and auc in each worker is the same. |
|
ok, i've send the trainset to your email. |
@qrqpjxq updates ? |
what's your meaning? |
@qrqpjxq You can email it to me. |
@qrqpjxq |
is there any wrong in split_info.hpp/77 &cat_threshold.data()? |
@qrqpjxq can you delete LightGBM folder, then re-clone ? |
is ok, work very vell.thank you~ |
hi, there's another problem. but when i set workers = 16, the metrics for validation data is bad: 2017-09-11 16:53:34,441 INFO Iteration:1, valid_1 auc : 0.635971 more workers may make it work worse? |
@qrqpjxq what is number of your training data ? |
yes, dataset is a litter small, ony 20w. [LightGBM] [Info] Finished loading parameters my main config file is: but in another try, have't convert file to libsvm, so is csv type, the result is bellow: [LightGBM] [Info] Finished loading parameters |
in the last try, i set categorical_column = 0,16,17,18,20,24 |
@qrqpjxq it seems there is a bug when #data is large.
|
@qrqpjxq |
i tried train in 1worker(1000w data)/4workers(1200w data)/8workers(600w data) one worker is about similar as 4workers 8workers: |
@qrqpjxq |
yes, my config is like: task = train is_training_metric = true max_bin = 255 bagging_freq = 5 |
@qrqpjxq your data is pre-partitioned ? and you set |
yes, i changed data_loader.cpp a litter, just read file from hdfs. and file is not prepatitioned. but i think this will not effect training? |
i think the problem is here: //data_parallel_tree_learner.cpp after SyncUpGlobalBestSplit: then chose best_leaf: 26 before SyncUpGlobalBestSplit: after SyncUpGlobalBestSplit: then chose best_leaf: 196 |
@qrqpjxq |
I find in iteration one's first split, the gain is 6659, but in iteration 18, the first split gain is 136497, and 1.04708e+06 in third split. Iteration:1 Iteration:18: before SyncUpGlobalBestSplit: before SyncUpGlobalBestSplit: |
@qrqpjxq |
ok, in iteration 18, third split: sum_gradient: 12404.6; sum_hessian: 333.905 after SyncUpGlobalBestSplit: then chose best_leaf: 0 |
this worker maybe: sum_gradient: 12404.6; sum_hessian: 333.905 after SyncUpGlobalBestSplit: then chose best_leaf: 0 |
@qrqpjxq |
the message is like: after SyncUpGlobalBestSplit: then chose best_leaf: 0 before FindBestThreshold small leaf sum_gradients: 5561.03; sum_hessians: 22.7829 after SyncUpGlobalBestSplit: then chose best_leaf: 3 |
it that left_sum_hessian + right_sum_hessian != sun_hessian? |
@qrqpjxq can you provide your print code ? I want to check the sum_hessian is the local one or the global one |
serial_tree_learner.cpp: //data_parallel_tree_learner.cpp std::cout << "before SyncUpGlobalBestSplit: " << std::endl; |
in below split, in father node sum_gradients = 490.476 before FindBestThreshold small leaf sum_gradients: 490.476; sum_hessians: 494193 after SyncUpGlobalBestSplit: then chose best_leaf: 0 |
@qrqpjxq it is normal for the sum_gradient is bigger. |
i tried parallel learning and single training, in both the sum_hessian != left + right; in serial_tree_learner.cpp my print code is: |
@qrqpjxq |
but i tried the example in examples/binary_classification [LightGBM] [Info] Finished loading parameters before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 7000 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 7000 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 7000 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 7000 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 7000 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 7000 in the first feature sum = left + right, but then begin in second feature, is not the same |
then i tried in examples/parallel_learning/ data = binary.train; workers=2 [LightGBM] [Info] Finished loading parameters before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 3479 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 3479 before FindBestThreshold small leaf sum_gradients: -216; sum_hessians: 1750; sum_count: 3479 |
@qrqpjxq |
ok, i will try it. but if you can add my wechat: qjxtqrm ? then if there any questions i can ask you timely. |
When doing parallel training, metrics of training set sometimes will be bad in each worker at the same iteration.
i'm trying to train on 5 workers, the metrics on training set:
worker0:
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=19
[LightGBM] [Info] Iteration:1, training auc : 0.624454
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.680496
[LightGBM] [Info] 1.889427 seconds elapsed, finished iteration 1
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=16
[LightGBM] [Info] Iteration:2, training auc : 0.618313
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.67641
[LightGBM] [Info] 3.768745 seconds elapsed, finished iteration 2
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=14
[LightGBM] [Info] Iteration:3, training auc : 0.603125
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.676286
[LightGBM] [Info] 5.559710 seconds elapsed, finished iteration 3
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:4, training auc : 0.56756
[LightGBM] [Info] Iteration:4, training binary_logloss : 0.70803
[LightGBM] [Info] 7.490879 seconds elapsed, finished iteration 4
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=15
[LightGBM] [Info] Iteration:5, training auc : 0.588966
[LightGBM] [Info] Iteration:5, training binary_logloss : 0.697803
[LightGBM] [Info] 9.537317 seconds elapsed, finished iteration 5
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:6, training auc : 0.606358
[LightGBM] [Info] Iteration:6, training binary_logloss : 0.689927
[LightGBM] [Info] 11.492579 seconds elapsed, finished iteration 6
worker1:
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=19
[LightGBM] [Info] Iteration:1, training auc : 0.610784
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.684069
[LightGBM] [Info] 1.917517 seconds elapsed, finished iteration 1
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=16
[LightGBM] [Info] Iteration:2, training auc : 0.625668
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.677478
[LightGBM] [Info] 3.796817 seconds elapsed, finished iteration 2
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=14
[LightGBM] [Info] Iteration:3, training auc : 0.60758
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.670345
[LightGBM] [Info] 5.561043 seconds elapsed, finished iteration 3
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:4, training auc : 0.594692
[LightGBM] [Info] Iteration:4, training binary_logloss : 0.682121
[LightGBM] [Info] 7.489096 seconds elapsed, finished iteration 4
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=15
[LightGBM] [Info] Iteration:5, training auc : 0.611087
[LightGBM] [Info] Iteration:5, training binary_logloss : 0.675639
[LightGBM] [Info] 9.505840 seconds elapsed, finished iteration 5
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:6, training auc : 0.622268
[LightGBM] [Info] Iteration:6, training binary_logloss : 0.670829
[LightGBM] [Info] 11.492364 seconds elapsed, finished iteration 6
worker2:
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=19
[LightGBM] [Info] Iteration:1, training auc : 0.605063
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.692491
[LightGBM] [Info] 1.946944 seconds elapsed, finished iteration 1
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=16
[LightGBM] [Info] Iteration:2, training auc : 0.607614
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.689697
[LightGBM] [Info] 3.822381 seconds elapsed, finished iteration 2
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=14
[LightGBM] [Info] Iteration:3, training auc : 0.592087
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.688123
[LightGBM] [Info] 5.589290 seconds elapsed, finished iteration 3
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:4, training auc : 0.55509
[LightGBM] [Info] Iteration:4, training binary_logloss : 0.69431
[LightGBM] [Info] 7.516317 seconds elapsed, finished iteration 4
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=15
[LightGBM] [Info] Iteration:5, training auc : 0.58136
[LightGBM] [Info] Iteration:5, training binary_logloss : 0.685505
[LightGBM] [Info] 9.563150 seconds elapsed, finished iteration 5
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:6, training auc : 0.591031
[LightGBM] [Info] Iteration:6, training binary_logloss : 0.680882
[LightGBM] [Info] 11.554436 seconds elapsed, finished iteration 6
worker3:
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=19
[LightGBM] [Info] Iteration:1, training auc : 0.594895
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.697414
[LightGBM] [Info] 1.913700 seconds elapsed, finished iteration 1
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=16
[LightGBM] [Info] Iteration:2, training auc : 0.607586
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.690211
[LightGBM] [Info] 3.792895 seconds elapsed, finished iteration 2
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=14
[LightGBM] [Info] Iteration:3, training auc : 0.611669
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.678102
[LightGBM] [Info] 5.528864 seconds elapsed, finished iteration 3
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:4, training auc : 0.585117
[LightGBM] [Info] Iteration:4, training binary_logloss : 0.683948
[LightGBM] [Info] 7.485415 seconds elapsed, finished iteration 4
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=15
[LightGBM] [Info] Iteration:5, training auc : 0.601168
[LightGBM] [Info] Iteration:5, training binary_logloss : 0.677983
[LightGBM] [Info] 9.530298 seconds elapsed, finished iteration 5
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:6, training auc : 0.615217
[LightGBM] [Info] Iteration:6, training binary_logloss : 0.672196
[LightGBM] [Info] 11.488655 seconds elapsed, finished iteration 6
worker4:
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=19
[LightGBM] [Info] Iteration:1, training auc : 0.639623
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.680161
[LightGBM] [Info] 1.913545 seconds elapsed, finished iteration 1
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=16
[LightGBM] [Info] Iteration:2, training auc : 0.651382
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.674585
[LightGBM] [Info] 3.759978 seconds elapsed, finished iteration 2
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=14
[LightGBM] [Info] Iteration:3, training auc : 0.630517
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.66882
[LightGBM] [Info] 5.556613 seconds elapsed, finished iteration 3
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:4, training auc : 0.571659
[LightGBM] [Info] Iteration:4, training binary_logloss : 0.703941
[LightGBM] [Info] 7.452347 seconds elapsed, finished iteration 4
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=15
[LightGBM] [Info] Iteration:5, training auc : 0.58287
[LightGBM] [Info] Iteration:5, training binary_logloss : 0.69834
[LightGBM] [Info] 9.527238 seconds elapsed, finished iteration 5
[LightGBM] [Info] Trained a tree with leaves=63 and max_depth=17
[LightGBM] [Info] Iteration:6, training auc : 0.600549
[LightGBM] [Info] Iteration:6, training binary_logloss : 0.693129
[LightGBM] [Info] 11.483836 seconds elapsed, finished iteration 6
train.conf:
task = train
boosting_type = gbdt
objective = binary
metric = binary_logloss,auc
metric_freq = 1
is_training_metric = true
max_bin = 255
data = trainset
num_trees = 50
learning_rate = 0.1
num_leaves = 63
tree_learner = data
is_pre_partition = false
categorical_column = 0,4
bagging_freq = 5
min_data_in_leaf = 20
min_sum_hessian_in_leaf = 5.0
is_enable_sparse = true
use_two_round_loading = false
is_save_binary_file = false
num_machines = 5
trainset:
1 8.0 -0.635 0.226 0.327 2.0 0.754 -0.249 -1.092
1 8.0 0.329 0.359 1.498 2.0 1.096 -0.558 -1.588
1 8.0 1.471 -1.636 0.454 5.0 1.105 1.282 1.382
0 9.0 -0.877 0.936 1.992 6.0 1.786 -1.647 -0.942
1 8.0 0.321 1.522 0.883 2.0 0.681 -1.07 -0.922
head(5) of trainset.
in the test, there are 7000 train data.
The text was updated successfully, but these errors were encountered: