-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtrain.log
579 lines (576 loc) · 61.1 KB
/
train.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
[2023-09-18 13:22:24,610] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 13:22:26,172] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-18 13:22:26,193] [INFO] [runner.py:570:main] cmd = /home/hyx/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --method gc_bc --steps 300000 --warmup_steps 10000 --save_dir gc_bc_save --random_seed 42
[2023-09-18 13:22:28,248] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 13:22:29,763] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-09-18 13:22:29,763] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-09-18 13:22:29,763] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-09-18 13:22:29,763] [INFO] [launch.py:163:main] dist_world_size=2
[2023-09-18 13:22:29,763] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-09-18 13:22:32,071] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-18 13:22:32,078] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Namespace(local_rank=1, sample_weights='balance', num_workers=8, relabel_actions=True, goal_relabeling_strategy='uniform', augment=True, dtype='fp32', encoder='resnetv1-34-bridge', save_dir='gc_bc_save', ckpt_id=None, train_batch_size=256, gradient_accumulation_steps=1, eval_batch_size=256, max_lr=0.0003, min_lr=1e-05, weight_decay=1e-06, max_grad_norm=5.0, epochs=None, steps=300000, warmup_steps=10000, decay_steps=None, log_interval=5000, eval_interval=10000, save_interval=10000, save_best=True, main_metric='log_probs', method='gc_bc', random_seed=42, datasets=['/data/hyx/raw_icra_trajs', '/data/hyx/raw_flap_trajs', '/data/hyx/raw_bridge_data_v1_trajs', '/data/hyx/raw_rss_trajs', '/data/hyx/raw_bridge_data_v2_trajs', '/home/hyx/bridge_data_v2/bridge_torch/data_processing/scipted_trajs'], act_mean=[0.00019296819, 0.00013667766, -0.00014583133, -0.00018390431, -0.00030808983, 0.0002742527, 0.59716219], act_std=[0.00912848, 0.0127196, 0.01229497, 0.02606696, 0.02875283, 0.07807977, 0.48710242], goal_relabeling_kwargs={'reached_proportion': 0.0}, augment_kwargs={'random_resized_crop': {'size': [128, 128], 'scale': [0.8, 1.0], 'ratio': [0.9, 1.1], 'antialias': True}, 'color_jitter': {'brightness': 0.2, 'contrast': [0.8, 1.2], 'saturation': [0.8, 1.2], 'hue': 0.1}, 'augment_order': ['random_resized_crop', 'color_jitter']}, encoder_kwargs={'pooling_method': 'avg', 'add_spatial_coordinates': True, 'act': 'SiLU', 'input_img_shape': [128, 128], 'input_channels': 6})
[2023-09-18 13:22:33,753] [INFO] [comm.py:637:init_distributed] cdb=None
Namespace(local_rank=0, sample_weights='balance', num_workers=8, relabel_actions=True, goal_relabeling_strategy='uniform', augment=True, dtype='fp32', encoder='resnetv1-34-bridge', save_dir='gc_bc_save', ckpt_id=None, train_batch_size=256, gradient_accumulation_steps=1, eval_batch_size=256, max_lr=0.0003, min_lr=1e-05, weight_decay=1e-06, max_grad_norm=5.0, epochs=None, steps=300000, warmup_steps=10000, decay_steps=None, log_interval=5000, eval_interval=10000, save_interval=10000, save_best=True, main_metric='log_probs', method='gc_bc', random_seed=42, datasets=['/data/hyx/raw_icra_trajs', '/data/hyx/raw_flap_trajs', '/data/hyx/raw_bridge_data_v1_trajs', '/data/hyx/raw_rss_trajs', '/data/hyx/raw_bridge_data_v2_trajs', '/home/hyx/bridge_data_v2/bridge_torch/data_processing/scipted_trajs'], act_mean=[0.00019296819, 0.00013667766, -0.00014583133, -0.00018390431, -0.00030808983, 0.0002742527, 0.59716219], act_std=[0.00912848, 0.0127196, 0.01229497, 0.02606696, 0.02875283, 0.07807977, 0.48710242], goal_relabeling_kwargs={'reached_proportion': 0.0}, augment_kwargs={'random_resized_crop': {'size': [128, 128], 'scale': [0.8, 1.0], 'ratio': [0.9, 1.1], 'antialias': True}, 'color_jitter': {'brightness': 0.2, 'contrast': [0.8, 1.2], 'saturation': [0.8, 1.2], 'hue': 0.1}, 'augment_order': ['random_resized_crop', 'color_jitter']}, encoder_kwargs={'pooling_method': 'avg', 'add_spatial_coordinates': True, 'act': 'SiLU', 'input_img_shape': [128, 128], 'input_channels': 6})
[2023-09-18 13:22:33,841] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-18 13:22:33,842] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-18 13:22:35,212] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown
[2023-09-18 13:22:38,192] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/hyx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home/hyx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/hyx/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.12854266166687012 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1022942066192627 seconds
[2023-09-18 13:22:38,851] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-09-18 13:22:38,858] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-09-18 13:22:38,858] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-09-18 13:22:38,858] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2023-09-18 13:22:38,859] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fda1fbb3be0>
[2023-09-18 13:22:38,859] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
[2023-09-18 13:22:38,859] [INFO] [config.py:963:print] DeepSpeedEngine configuration:
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] amp_enabled .................. False
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] amp_params ................... False
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] bfloat16_enabled ............. False
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] checkpoint_parallel_write_pipeline False
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] checkpoint_tag_validation_enabled True
[2023-09-18 13:22:38,859] [INFO] [config.py:967:print] checkpoint_tag_validation_fail False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fda1fbb3880>
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] communication_data_type ...... None
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] curriculum_enabled_legacy .... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] curriculum_params_legacy ..... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] data_efficiency_enabled ...... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] dataloader_drop_last ......... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] disable_allgather ............ False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] dump_state ................... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] dynamic_loss_scale_args ...... None
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_enabled ........... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_gas_boundary_resolution 1
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_layer_num ......... 0
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_max_iter .......... 100
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_stability ......... 1e-06
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_tol ............... 0.01
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] eigenvalue_verbose ........... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] elasticity_enabled ........... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] fp16_auto_cast ............... None
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] fp16_enabled ................. False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] fp16_master_weights_and_gradients False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] global_rank .................. 0
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] grad_accum_dtype ............. None
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] gradient_accumulation_steps .. 1
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] gradient_clipping ............ 5.0
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] gradient_predivide_factor .... 1.0
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] initial_dynamic_scale ........ 65536
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] load_universal_checkpoint .... False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] loss_scale ................... 0
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] memory_breakdown ............. False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] mics_hierarchial_params_gather False
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] mics_shard_size .............. -1
[2023-09-18 13:22:38,860] [INFO] [config.py:967:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] optimizer_legacy_fusion ...... False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] optimizer_name ............... adam
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] optimizer_params ............. {'weight_decay': 1e-06}
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] pld_enabled .................. False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] pld_params ................... False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] prescale_gradients ........... False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] scheduler_name ............... WarmupDecayLR
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] scheduler_params ............. {'total_num_steps': 300000, 'warmup_min_lr': 1e-05, 'warmup_max_lr': 0.0003, 'warmup_num_steps': 10000}
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] sparse_attention ............. None
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] sparse_gradients_enabled ..... False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] steps_per_print .............. 5000
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] train_batch_size ............. 256
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] train_micro_batch_size_per_gpu 128
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] use_node_local_storage ....... False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] wall_clock_breakdown ......... False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] world_size ................... 2
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] zero_allow_untested_optimizer False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] zero_enabled ................. False
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] zero_force_ds_cpu_optimizer .. True
[2023-09-18 13:22:38,861] [INFO] [config.py:967:print] zero_optimization_stage ...... 0
[2023-09-18 13:22:38,861] [INFO] [config.py:953:print_user_config] json = {
"train_batch_size": 256,
"gradient_accumulation_steps": 1,
"steps_per_print": 5.000000e+03,
"optimizer": {
"type": "Adam",
"params": {
"weight_decay": 1e-06
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": 3.000000e+05,
"warmup_min_lr": 1e-05,
"warmup_max_lr": 0.0003,
"warmup_num_steps": 1.000000e+04
}
},
"gradient_clipping": 5.0,
"bf16": {
"enabled": false
},
"fp16": {
"enabled": false,
"fp16_master_weights_and_grads": false,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 15
}
}
2057 trajs in /data/hyx/raw_icra_trajs/train
796 trajs in /data/hyx/raw_flap_trajs/train
11442 trajs in /data/hyx/raw_bridge_data_v1_trajs/train
8195 trajs in /data/hyx/raw_rss_trajs/train
22072 trajs in /data/hyx/raw_bridge_data_v2_trajs/train
8675 trajs in /home/hyx/bridge_data_v2/bridge_torch/data_processing/scipted_trajs/train
[# train trajs before repeating]: 53237
[# train trajs after repeating]: 109316
237 trajs in /data/hyx/raw_icra_trajs/val
148 trajs in /data/hyx/raw_flap_trajs/val
1749 trajs in /data/hyx/raw_bridge_data_v1_trajs/val
966 trajs in /data/hyx/raw_rss_trajs/val
2752 trajs in /data/hyx/raw_bridge_data_v2_trajs/val
1024 trajs in /home/hyx/bridge_data_v2/bridge_torch/data_processing/scipted_trajs/val
[# val trajs before repeating]: 6876
[# val trajs after repeating]: 13752
237 trajs in /data/hyx/raw_icra_trajs/val
148 trajs in /data/hyx/raw_flap_trajs/val
1749 trajs in /data/hyx/raw_bridge_data_v1_trajs/val
966 trajs in /data/hyx/raw_rss_trajs/val
2752 trajs in /data/hyx/raw_bridge_data_v2_trajs/val
1024 trajs in /home/hyx/bridge_data_v2/bridge_torch/data_processing/scipted_trajs/val
[# val trajs before repeating]: 6876
[# val trajs after repeating]: 13752
[2023-09-18 13:59:13,416] [INFO] [logging.py:96:log_dist] [Rank 0] step=5000, skipped=0, lr=[0.00027817532531436136], mom=[(0.9, 0.999)]
[2023-09-18 13:59:13,419] [INFO] [timer.py:260:stop] epoch=0/micro_step=5000/global_step=5000, RunningAvgSamplesPerSec=1222.2744055077999, CurrSamplesPerSec=337.4697961054039, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 5000] loss: 9.15594084777832
[2023-09-18 14:56:53,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=10000, skipped=0, lr=[0.0002999999999999999], mom=[(0.9, 0.999)]
[2023-09-18 14:56:53,991] [INFO] [timer.py:260:stop] epoch=0/micro_step=10000/global_step=10000, RunningAvgSamplesPerSec=609.2016370291394, CurrSamplesPerSec=3650.9785002907206, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 10000] loss: 8.793172532844544
[Step 10000] evaluating...
{'log_probs': -8.597314862315411, 'mse': 4.329490734318821, 'pi_actions': -0.017135016633149338} [Local Rank]: 0
{'log_probs': -8.597314862315411, 'mse': 4.329490734318821, 'pi_actions': -0.017135016633149338} [Local Rank]: 1
[2023-09-18 15:00:42,970] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 10000 is about to be saved!
/home/hyx/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/hyx/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-09-18 15:00:42,981] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/10000/mp_rank_00_model_states.pt
[2023-09-18 15:00:42,981] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/10000/mp_rank_00_model_states.pt...
[2023-09-18 15:00:42,984] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 10000 is ready now!
[2023-09-18 15:00:43,329] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/10000/mp_rank_00_model_states.pt.
[2023-09-18 15:00:43,329] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 10000 is ready now!
[2023-09-18 15:30:20,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=15000, skipped=0, lr=[0.000295001], mom=[(0.9, 0.999)]
[2023-09-18 15:30:20,670] [INFO] [timer.py:260:stop] epoch=0/micro_step=15000/global_step=15000, RunningAvgSamplesPerSec=836.7568187787807, CurrSamplesPerSec=3684.63038114553, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 15000] loss: 8.587415156650543
[2023-09-18 15:59:53,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=20000, skipped=0, lr=[0.00029000099999999996], mom=[(0.9, 0.999)]
[2023-09-18 15:59:53,939] [INFO] [timer.py:260:stop] epoch=0/micro_step=20000/global_step=20000, RunningAvgSamplesPerSec=923.3578112774292, CurrSamplesPerSec=3768.7449729209884, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 20000] loss: 8.551584357357026
[Step 20000] evaluating...
{'log_probs': -8.399529860076573, 'mse': 3.93392072700915, 'pi_actions': -0.00349259346524391} [Local Rank]: 0
{'log_probs': -8.399529860076573, 'mse': 3.93392072700915, 'pi_actions': -0.00349259346524391} [Local Rank]: 1
[2023-09-18 16:03:38,689] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 20000 is about to be saved!
[2023-09-18 16:03:38,692] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/20000/mp_rank_00_model_states.pt
[2023-09-18 16:03:38,692] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/20000/mp_rank_00_model_states.pt...
[2023-09-18 16:03:38,693] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 20000 is ready now!
[2023-09-18 16:03:39,034] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/20000/mp_rank_00_model_states.pt.
[2023-09-18 16:03:39,035] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 20000 is ready now!
[2023-09-18 16:32:50,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=25000, skipped=0, lr=[0.000285001], mom=[(0.9, 0.999)]
[2023-09-18 16:32:50,922] [INFO] [timer.py:260:stop] epoch=0/micro_step=25000/global_step=25000, RunningAvgSamplesPerSec=890.4391378697214, CurrSamplesPerSec=3449.1745175134274, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 25000] loss: 8.493643455410004
[2023-09-18 17:02:56,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=30000, skipped=0, lr=[0.000280001], mom=[(0.9, 0.999)]
[2023-09-18 17:02:56,546] [INFO] [timer.py:260:stop] epoch=0/micro_step=30000/global_step=30000, RunningAvgSamplesPerSec=879.6906404630806, CurrSamplesPerSec=3625.730044403924, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 30000] loss: 8.278657208442688
[Step 30000] evaluating...
{'log_probs': -8.284678819122645, 'mse': 3.7042186387037814, 'pi_actions': -0.010512826201073537} [Local Rank]: 0
{'log_probs': -8.284678819122645, 'mse': 3.7042186387037814, 'pi_actions': -0.010512826201073537} [Local Rank]: 1
[2023-09-18 17:06:46,145] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 30000 is about to be saved!
[2023-09-18 17:06:46,149] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/30000/mp_rank_00_model_states.pt
[2023-09-18 17:06:46,149] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/30000/mp_rank_00_model_states.pt...
[2023-09-18 17:06:46,150] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 30000 is ready now!
[2023-09-18 17:06:46,510] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/30000/mp_rank_00_model_states.pt.
[2023-09-18 17:06:46,510] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 30000 is ready now!
[2023-09-18 17:35:47,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=35000, skipped=0, lr=[0.000275001], mom=[(0.9, 0.999)]
[2023-09-18 17:35:47,305] [INFO] [timer.py:260:stop] epoch=0/micro_step=35000/global_step=35000, RunningAvgSamplesPerSec=979.9296330370739, CurrSamplesPerSec=3560.6359750496586, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 35000] loss: 8.297751696586609
[2023-09-18 18:05:07,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=40000, skipped=0, lr=[0.00027000099999999996], mom=[(0.9, 0.999)]
[2023-09-18 18:05:07,771] [INFO] [timer.py:260:stop] epoch=0/micro_step=40000/global_step=40000, RunningAvgSamplesPerSec=1070.183104507772, CurrSamplesPerSec=2840.8348453700983, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 40000] loss: 8.196546172332763
[Step 40000] evaluating...
{'log_probs': -8.19576238292934, 'mse': 3.5263857204952815, 'pi_actions': 0.01146076119811706} [Local Rank]: 0
{'log_probs': -8.19576238292934, 'mse': 3.5263857204952815, 'pi_actions': 0.01146076119811706} [Local Rank]: 1
[2023-09-18 18:08:59,607] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 40000 is about to be saved!
[2023-09-18 18:08:59,611] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/40000/mp_rank_00_model_states.pt
[2023-09-18 18:08:59,611] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/40000/mp_rank_00_model_states.pt...
[2023-09-18 18:08:59,611] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 40000 is ready now!
[2023-09-18 18:08:59,944] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/40000/mp_rank_00_model_states.pt.
[2023-09-18 18:08:59,944] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 40000 is ready now!
[2023-09-18 18:37:52,632] [INFO] [logging.py:96:log_dist] [Rank 0] step=45000, skipped=0, lr=[0.00026500099999999995], mom=[(0.9, 0.999)]
[2023-09-18 18:37:52,635] [INFO] [timer.py:260:stop] epoch=0/micro_step=45000/global_step=45000, RunningAvgSamplesPerSec=1152.0392496483662, CurrSamplesPerSec=3370.970733943226, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 45000] loss: 8.221417389774322
[2023-09-18 19:06:46,029] [INFO] [logging.py:96:log_dist] [Rank 0] step=50000, skipped=0, lr=[0.000260001], mom=[(0.9, 0.999)]
[2023-09-18 19:06:46,030] [INFO] [timer.py:260:stop] epoch=0/micro_step=50000/global_step=50000, RunningAvgSamplesPerSec=1146.1907064809222, CurrSamplesPerSec=3391.1670251303576, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 50000] loss: 8.098470838069916
[Step 50000] evaluating...
{'log_probs': -8.147740361385386, 'mse': 3.430341704661289, 'pi_actions': 0.008123599326412217} [Local Rank]: 0
{'log_probs': -8.147740361385386, 'mse': 3.430341704661289, 'pi_actions': 0.008123599326412217} [Local Rank]: 1
[2023-09-18 19:10:57,278] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 50000 is about to be saved!
[2023-09-18 19:10:57,281] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/50000/mp_rank_00_model_states.pt
[2023-09-18 19:10:57,281] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/50000/mp_rank_00_model_states.pt...
[2023-09-18 19:10:57,281] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 50000 is ready now!
[2023-09-18 19:10:57,613] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/50000/mp_rank_00_model_states.pt.
[2023-09-18 19:10:57,614] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 50000 is ready now!
[2023-09-18 19:40:18,788] [INFO] [logging.py:96:log_dist] [Rank 0] step=55000, skipped=0, lr=[0.000255001], mom=[(0.9, 0.999)]
[2023-09-18 19:40:18,790] [INFO] [timer.py:260:stop] epoch=0/micro_step=55000/global_step=55000, RunningAvgSamplesPerSec=1152.1513556292994, CurrSamplesPerSec=3482.161223265392, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 55000] loss: 8.128771208763123
[2023-09-18 20:09:40,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=60000, skipped=0, lr=[0.00025000099999999997], mom=[(0.9, 0.999)]
[2023-09-18 20:09:40,064] [INFO] [timer.py:260:stop] epoch=0/micro_step=60000/global_step=60000, RunningAvgSamplesPerSec=1219.4037156546792, CurrSamplesPerSec=3397.7666305923153, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 60000] loss: 8.044233550071716
[Step 60000] evaluating...
{'log_probs': -8.12783839418159, 'mse': 3.390537810464543, 'pi_actions': 0.02757127570123346} [Local Rank]: 0
{'log_probs': -8.12783839418159, 'mse': 3.390537810464543, 'pi_actions': 0.02757127570123346} [Local Rank]: 1
[2023-09-18 20:13:34,089] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 60000 is about to be saved!
[2023-09-18 20:13:34,093] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/60000/mp_rank_00_model_states.pt
[2023-09-18 20:13:34,093] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/60000/mp_rank_00_model_states.pt...
[2023-09-18 20:13:34,094] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 60000 is ready now!
[2023-09-18 20:13:34,425] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/60000/mp_rank_00_model_states.pt.
[2023-09-18 20:13:34,425] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 60000 is ready now!
[2023-09-18 20:42:46,504] [INFO] [logging.py:96:log_dist] [Rank 0] step=65000, skipped=0, lr=[0.00024500099999999995], mom=[(0.9, 0.999)]
[2023-09-18 20:42:46,505] [INFO] [timer.py:260:stop] epoch=0/micro_step=65000/global_step=65000, RunningAvgSamplesPerSec=1282.7562153740132, CurrSamplesPerSec=3894.912992694375, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 65000] loss: 8.042824406337738
[2023-09-18 21:12:02,539] [INFO] [logging.py:96:log_dist] [Rank 0] step=70000, skipped=0, lr=[0.00024000099999999997], mom=[(0.9, 0.999)]
[2023-09-18 21:12:02,540] [INFO] [timer.py:260:stop] epoch=0/micro_step=70000/global_step=70000, RunningAvgSamplesPerSec=1343.2475226774636, CurrSamplesPerSec=3370.5263052158407, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 70000] loss: 8.01523163576126
[Step 70000] evaluating...
{'log_probs': -8.084249163397997, 'mse': 3.303359313872268, 'pi_actions': 0.021130069268499328} [Local Rank]: 0
{'log_probs': -8.084249163397997, 'mse': 3.303359313872268, 'pi_actions': 0.021130069268499328} [Local Rank]: 1
[2023-09-18 21:15:43,525] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 70000 is about to be saved!
[2023-09-18 21:15:43,528] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/70000/mp_rank_00_model_states.pt
[2023-09-18 21:15:43,528] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/70000/mp_rank_00_model_states.pt...
[2023-09-18 21:15:43,528] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 70000 is ready now!
[2023-09-18 21:15:43,866] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/70000/mp_rank_00_model_states.pt.
[2023-09-18 21:15:43,867] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 70000 is ready now!
[2023-09-18 21:45:06,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=75000, skipped=0, lr=[0.00023500099999999995], mom=[(0.9, 0.999)]
[2023-09-18 21:45:06,414] [INFO] [timer.py:260:stop] epoch=0/micro_step=75000/global_step=75000, RunningAvgSamplesPerSec=1296.117667728439, CurrSamplesPerSec=3689.225914625766, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 75000] loss: 7.9314358229637145
[2023-09-18 22:14:15,648] [INFO] [logging.py:96:log_dist] [Rank 0] step=80000, skipped=0, lr=[0.00023000099999999994], mom=[(0.9, 0.999)]
[2023-09-18 22:14:15,649] [INFO] [timer.py:260:stop] epoch=0/micro_step=80000/global_step=80000, RunningAvgSamplesPerSec=1242.6942810985956, CurrSamplesPerSec=3975.717204480237, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 80000] loss: 7.97638144788742
[Step 80000] evaluating...
{'log_probs': -8.076254759434764, 'mse': 3.2873704888091133, 'pi_actions': -0.001978929603174748} [Local Rank]: 0
{'log_probs': -8.076254759434764, 'mse': 3.2873704888091133, 'pi_actions': -0.001978929603174748} [Local Rank]: 1
[2023-09-18 22:17:49,084] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 80000 is about to be saved!
[2023-09-18 22:17:49,087] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/80000/mp_rank_00_model_states.pt
[2023-09-18 22:17:49,087] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/80000/mp_rank_00_model_states.pt...
[2023-09-18 22:17:49,088] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 80000 is ready now!
[2023-09-18 22:17:49,421] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/80000/mp_rank_00_model_states.pt.
[2023-09-18 22:17:49,422] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 80000 is ready now!
[2023-09-18 22:47:01,799] [INFO] [logging.py:96:log_dist] [Rank 0] step=85000, skipped=0, lr=[0.00022500099999999998], mom=[(0.9, 0.999)]
[2023-09-18 22:47:01,800] [INFO] [timer.py:260:stop] epoch=0/micro_step=85000/global_step=85000, RunningAvgSamplesPerSec=1200.8224543178499, CurrSamplesPerSec=3501.716462350758, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 85000] loss: 7.854315842437744
[2023-09-18 23:16:46,994] [INFO] [logging.py:96:log_dist] [Rank 0] step=90000, skipped=0, lr=[0.00022000099999999997], mom=[(0.9, 0.999)]
[2023-09-18 23:16:46,996] [INFO] [timer.py:260:stop] epoch=0/micro_step=90000/global_step=90000, RunningAvgSamplesPerSec=1188.9308473656113, CurrSamplesPerSec=3148.295213382006, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 90000] loss: 7.915860023212433
[Step 90000] evaluating...
{'log_probs': -8.066618900133575, 'mse': 3.268098768806657, 'pi_actions': -0.011847509348681617} [Local Rank]: 0
{'log_probs': -8.066618900133575, 'mse': 3.268098768806657, 'pi_actions': -0.011847509348681617} [Local Rank]: 1
[2023-09-18 23:20:52,792] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 90000 is about to be saved!
[2023-09-18 23:20:52,796] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/90000/mp_rank_00_model_states.pt
[2023-09-18 23:20:52,797] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/90000/mp_rank_00_model_states.pt...
[2023-09-18 23:20:52,797] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 90000 is ready now!
[2023-09-18 23:20:53,134] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/90000/mp_rank_00_model_states.pt.
[2023-09-18 23:20:53,135] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 90000 is ready now!
[2023-09-18 23:50:28,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=95000, skipped=0, lr=[0.00021500099999999996], mom=[(0.9, 0.999)]
[2023-09-18 23:50:28,150] [INFO] [timer.py:260:stop] epoch=0/micro_step=95000/global_step=95000, RunningAvgSamplesPerSec=1226.9545657312062, CurrSamplesPerSec=3402.4609573544417, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 95000] loss: 7.815212815761567
[2023-09-19 00:23:06,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=100000, skipped=0, lr=[0.00021000099999999997], mom=[(0.9, 0.999)]
[2023-09-19 00:23:06,501] [INFO] [timer.py:260:stop] epoch=0/micro_step=100000/global_step=100000, RunningAvgSamplesPerSec=1266.075278158499, CurrSamplesPerSec=2628.80785012682, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 100000] loss: 7.887723042201996
[Step 100000] evaluating...
{'log_probs': -8.055112656078215, 'mse': 3.2450862897287402, 'pi_actions': 0.004071119671167497} [Local Rank]: 0
{'log_probs': -8.055112656078215, 'mse': 3.2450862897287402, 'pi_actions': 0.004071119671167497} [Local Rank]: 1
[2023-09-19 00:27:00,050] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 100000 is about to be saved!
[2023-09-19 00:27:00,053] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/100000/mp_rank_00_model_states.pt
[2023-09-19 00:27:00,053] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/100000/mp_rank_00_model_states.pt...
[2023-09-19 00:27:00,054] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 100000 is ready now!
[2023-09-19 00:27:00,399] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/100000/mp_rank_00_model_states.pt.
[2023-09-19 00:27:00,400] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 100000 is ready now!
[2023-09-19 01:09:49,309] [INFO] [logging.py:96:log_dist] [Rank 0] step=105000, skipped=0, lr=[0.00020500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 01:09:49,318] [INFO] [timer.py:260:stop] epoch=0/micro_step=105000/global_step=105000, RunningAvgSamplesPerSec=1290.1539682708838, CurrSamplesPerSec=1593.6798871985159, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 105000] loss: 7.831401924705506
[2023-09-19 01:53:41,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=110000, skipped=0, lr=[0.00020000099999999997], mom=[(0.9, 0.999)]
[2023-09-19 01:53:41,424] [INFO] [timer.py:260:stop] epoch=0/micro_step=110000/global_step=110000, RunningAvgSamplesPerSec=1300.794954089489, CurrSamplesPerSec=2036.429027418746, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 110000] loss: 7.767519180393219
[Step 110000] evaluating...
{'log_probs': -8.035453913682453, 'mse': 3.205768836543762, 'pi_actions': 0.011725295149793681} [Local Rank]: 0
{'log_probs': -8.035453913682453, 'mse': 3.205768836543762, 'pi_actions': 0.011725295149793681} [Local Rank]: 1
[2023-09-19 01:58:04,395] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 110000 is about to be saved!
[2023-09-19 01:58:04,408] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/110000/mp_rank_00_model_states.pt
[2023-09-19 01:58:04,408] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/110000/mp_rank_00_model_states.pt...
[2023-09-19 01:58:04,411] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 110000 is ready now!
[2023-09-19 01:58:04,950] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/110000/mp_rank_00_model_states.pt.
[2023-09-19 01:58:04,951] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 110000 is ready now!
[2023-09-19 02:38:37,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=115000, skipped=0, lr=[0.00019500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 02:38:37,887] [INFO] [timer.py:260:stop] epoch=0/micro_step=115000/global_step=115000, RunningAvgSamplesPerSec=1255.4634653714234, CurrSamplesPerSec=3625.852479443497, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 115000] loss: 7.786235194587707
[2023-09-19 03:20:28,569] [INFO] [logging.py:96:log_dist] [Rank 0] step=120000, skipped=0, lr=[0.00019000099999999997], mom=[(0.9, 0.999)]
[2023-09-19 03:20:28,571] [INFO] [timer.py:260:stop] epoch=0/micro_step=120000/global_step=120000, RunningAvgSamplesPerSec=1220.9090521382611, CurrSamplesPerSec=490.96607495754455, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 120000] loss: 7.707000985431671
[Step 120000] evaluating...
{'log_probs': -8.025492899847134, 'mse': 3.185846787024745, 'pi_actions': 0.009835024074176043} [Local Rank]: 0
{'log_probs': -8.025492899847134, 'mse': 3.185846787024745, 'pi_actions': 0.009835024074176043} [Local Rank]: 1
[2023-09-19 03:24:55,203] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 120000 is about to be saved!
[2023-09-19 03:24:55,216] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/120000/mp_rank_00_model_states.pt
[2023-09-19 03:24:55,216] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/120000/mp_rank_00_model_states.pt...
[2023-09-19 03:24:55,218] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 120000 is ready now!
[2023-09-19 03:24:55,627] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/120000/mp_rank_00_model_states.pt.
[2023-09-19 03:24:55,627] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 120000 is ready now!
[2023-09-19 04:08:06,464] [INFO] [logging.py:96:log_dist] [Rank 0] step=125000, skipped=0, lr=[0.00018500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 04:08:06,466] [INFO] [timer.py:260:stop] epoch=0/micro_step=125000/global_step=125000, RunningAvgSamplesPerSec=1218.0200970078688, CurrSamplesPerSec=126.63858293914379, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 125000] loss: 7.782660653877258
[2023-09-19 04:37:31,428] [INFO] [logging.py:96:log_dist] [Rank 0] step=130000, skipped=0, lr=[0.00018000099999999995], mom=[(0.9, 0.999)]
[2023-09-19 04:37:31,429] [INFO] [timer.py:260:stop] epoch=0/micro_step=130000/global_step=130000, RunningAvgSamplesPerSec=1190.3928469639832, CurrSamplesPerSec=121.22356222397792, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 130000] loss: 7.684497141933441
[Step 130000] evaluating...
{'log_probs': -8.0464293157201, 'mse': 3.2277196193888065, 'pi_actions': 0.011998693914251728} [Local Rank]: 0
{'log_probs': -8.0464293157201, 'mse': 3.2277196193888065, 'pi_actions': 0.011998693914251728} [Local Rank]: 1
[2023-09-19 05:10:37,541] [INFO] [logging.py:96:log_dist] [Rank 0] step=135000, skipped=0, lr=[0.000175001], mom=[(0.9, 0.999)]
[2023-09-19 05:10:37,542] [INFO] [timer.py:260:stop] epoch=0/micro_step=135000/global_step=135000, RunningAvgSamplesPerSec=1166.681474239383, CurrSamplesPerSec=120.1661583557496, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 135000] loss: 7.69478170785904
[2023-09-19 05:40:23,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=140000, skipped=0, lr=[0.00017000099999999997], mom=[(0.9, 0.999)]
[2023-09-19 05:40:23,359] [INFO] [timer.py:260:stop] epoch=0/micro_step=140000/global_step=140000, RunningAvgSamplesPerSec=1143.6692896668085, CurrSamplesPerSec=119.4597722620474, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 140000] loss: 7.70408306684494
[Step 140000] evaluating...
{'log_probs': -8.035924074913572, 'mse': 3.2067091370730925, 'pi_actions': 0.0015470712540374258} [Local Rank]: 1
{'log_probs': -8.035924074913572, 'mse': 3.2067091370730925, 'pi_actions': 0.0015470712540374258} [Local Rank]: 0
[2023-09-19 06:13:05,346] [INFO] [logging.py:96:log_dist] [Rank 0] step=145000, skipped=0, lr=[0.00016500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 06:13:05,348] [INFO] [timer.py:260:stop] epoch=0/micro_step=145000/global_step=145000, RunningAvgSamplesPerSec=1128.600013002278, CurrSamplesPerSec=3507.206605847403, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 145000] loss: 7.662439316177368
[2023-09-19 06:42:09,864] [INFO] [logging.py:96:log_dist] [Rank 0] step=150000, skipped=0, lr=[0.00016000099999999997], mom=[(0.9, 0.999)]
[2023-09-19 06:42:09,865] [INFO] [timer.py:260:stop] epoch=0/micro_step=150000/global_step=150000, RunningAvgSamplesPerSec=1110.7890168003066, CurrSamplesPerSec=3556.9794314752426, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 150000] loss: 7.685475374794007
[Step 150000] evaluating...
{'log_probs': -8.020658979395206, 'mse': 3.176178941946947, 'pi_actions': 0.016080594299464773} [Local Rank]: 1
{'log_probs': -8.020658979395206, 'mse': 3.176178941946947, 'pi_actions': 0.016080594299464773} [Local Rank]: 0
[2023-09-19 06:45:51,259] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 150000 is about to be saved!
[2023-09-19 06:45:51,265] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/150000/mp_rank_00_model_states.pt
[2023-09-19 06:45:51,265] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/150000/mp_rank_00_model_states.pt...
[2023-09-19 06:45:51,266] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 150000 is ready now!
[2023-09-19 06:45:51,603] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/150000/mp_rank_00_model_states.pt.
[2023-09-19 06:45:51,603] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 150000 is ready now!
[2023-09-19 07:15:07,238] [INFO] [logging.py:96:log_dist] [Rank 0] step=155000, skipped=0, lr=[0.00015500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 07:15:07,239] [INFO] [timer.py:260:stop] epoch=0/micro_step=155000/global_step=155000, RunningAvgSamplesPerSec=1094.5213260560727, CurrSamplesPerSec=3131.92185230342, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 155000] loss: 7.599652797317505
[2023-09-19 07:44:45,316] [INFO] [logging.py:96:log_dist] [Rank 0] step=160000, skipped=0, lr=[0.00015000099999999998], mom=[(0.9, 0.999)]
[2023-09-19 07:44:45,317] [INFO] [timer.py:260:stop] epoch=0/micro_step=160000/global_step=160000, RunningAvgSamplesPerSec=1081.543739351686, CurrSamplesPerSec=3670.0339200875005, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 160000] loss: 7.62955633468628
[Step 160000] evaluating...
{'log_probs': -8.035175667719313, 'mse': 3.205212329457414, 'pi_actions': 0.0016742034144919707} [Local Rank]: 1
{'log_probs': -8.035175667719313, 'mse': 3.205212329457414, 'pi_actions': 0.0016742034144919707} [Local Rank]: 0
[2023-09-19 08:16:46,194] [INFO] [logging.py:96:log_dist] [Rank 0] step=165000, skipped=0, lr=[0.00014500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 08:16:46,195] [INFO] [timer.py:260:stop] epoch=0/micro_step=165000/global_step=165000, RunningAvgSamplesPerSec=1070.8975784514266, CurrSamplesPerSec=3619.6310189992046, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 165000] loss: 7.580167293548584
[2023-09-19 08:45:09,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=170000, skipped=0, lr=[0.00014000099999999998], mom=[(0.9, 0.999)]
[2023-09-19 08:45:09,607] [INFO] [timer.py:260:stop] epoch=0/micro_step=170000/global_step=170000, RunningAvgSamplesPerSec=1059.5137859774009, CurrSamplesPerSec=3575.7781818424014, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 170000] loss: 7.563489454650879
[Step 170000] evaluating...
{'log_probs': -8.018541605819074, 'mse': 3.171944185575717, 'pi_actions': -0.0031861377639318222} [Local Rank]: 1
{'log_probs': -8.018541605819074, 'mse': 3.171944185575717, 'pi_actions': -0.0031861377639318222} [Local Rank]: 0
[2023-09-19 08:48:50,708] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 170000 is about to be saved!
[2023-09-19 08:48:50,714] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/170000/mp_rank_00_model_states.pt
[2023-09-19 08:48:50,714] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/170000/mp_rank_00_model_states.pt...
[2023-09-19 08:48:50,714] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 170000 is ready now!
[2023-09-19 08:48:51,049] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/170000/mp_rank_00_model_states.pt.
[2023-09-19 08:48:51,050] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 170000 is ready now!
[2023-09-19 09:17:24,546] [INFO] [logging.py:96:log_dist] [Rank 0] step=175000, skipped=0, lr=[0.00013500099999999996], mom=[(0.9, 0.999)]
[2023-09-19 09:17:24,548] [INFO] [timer.py:260:stop] epoch=0/micro_step=175000/global_step=175000, RunningAvgSamplesPerSec=1048.9830212210568, CurrSamplesPerSec=3737.6532893339877, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 175000] loss: 7.588256110572815
[2023-09-19 09:46:40,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=180000, skipped=0, lr=[0.00013000099999999998], mom=[(0.9, 0.999)]
[2023-09-19 09:46:40,955] [INFO] [timer.py:260:stop] epoch=0/micro_step=180000/global_step=180000, RunningAvgSamplesPerSec=1059.7379436958245, CurrSamplesPerSec=2415.11728506136, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 180000] loss: 7.565172501659394
[Step 180000] evaluating...
{'log_probs': -8.028141353549254, 'mse': 3.1911436979607233, 'pi_actions': -0.0013295874994503683} [Local Rank]: 1
{'log_probs': -8.028141353549254, 'mse': 3.1911436979607233, 'pi_actions': -0.0013295874994503683} [Local Rank]: 0
[2023-09-19 10:19:05,651] [INFO] [logging.py:96:log_dist] [Rank 0] step=185000, skipped=0, lr=[0.000125001], mom=[(0.9, 0.999)]
[2023-09-19 10:19:05,652] [INFO] [timer.py:260:stop] epoch=0/micro_step=185000/global_step=185000, RunningAvgSamplesPerSec=1049.93600888072, CurrSamplesPerSec=3456.5917150620016, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 185000] loss: 7.561333601951599
[2023-09-19 10:47:35,991] [INFO] [logging.py:96:log_dist] [Rank 0] step=190000, skipped=0, lr=[0.00012000099999999998], mom=[(0.9, 0.999)]
[2023-09-19 10:47:35,992] [INFO] [timer.py:260:stop] epoch=0/micro_step=190000/global_step=190000, RunningAvgSamplesPerSec=1040.5077523128878, CurrSamplesPerSec=3736.7297632139425, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 190000] loss: 7.508249244403839
[Step 190000] evaluating...
{'log_probs': -8.003352032411124, 'mse': 3.1415650523325565, 'pi_actions': 0.0012040147317874314} [Local Rank]: 1
{'log_probs': -8.003352032411124, 'mse': 3.1415650523325565, 'pi_actions': 0.0012040147317874314} [Local Rank]: 0
[2023-09-19 10:51:23,071] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint 190000 is about to be saved!
[2023-09-19 10:51:23,078] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: gc_bc_save/190000/mp_rank_00_model_states.pt
[2023-09-19 10:51:23,078] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving gc_bc_save/190000/mp_rank_00_model_states.pt...
[2023-09-19 10:51:23,078] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 190000 is ready now!
[2023-09-19 10:51:23,452] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved gc_bc_save/190000/mp_rank_00_model_states.pt.
[2023-09-19 10:51:23,453] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint 190000 is ready now!
[2023-09-19 11:20:27,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=195000, skipped=0, lr=[0.00011500099999999998], mom=[(0.9, 0.999)]
[2023-09-19 11:20:27,362] [INFO] [timer.py:260:stop] epoch=0/micro_step=195000/global_step=195000, RunningAvgSamplesPerSec=1032.3220257332948, CurrSamplesPerSec=461.7769717435498, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 195000] loss: 7.5537709192276
[2023-09-19 11:49:54,923] [INFO] [logging.py:96:log_dist] [Rank 0] step=200000, skipped=0, lr=[0.00011000099999999999], mom=[(0.9, 0.999)]
[2023-09-19 11:49:54,924] [INFO] [timer.py:260:stop] epoch=0/micro_step=200000/global_step=200000, RunningAvgSamplesPerSec=1022.711106226442, CurrSamplesPerSec=113.33398605762339, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 200000] loss: 7.485577699184418
[Step 200000] evaluating...
{'log_probs': -8.009139987755235, 'mse': 3.1531410000630595, 'pi_actions': 0.007152957130714553} [Local Rank]: 1
{'log_probs': -8.009139987755235, 'mse': 3.1531410000630595, 'pi_actions': 0.007152957130714553} [Local Rank]: 0
[2023-09-19 12:22:17,025] [INFO] [logging.py:96:log_dist] [Rank 0] step=205000, skipped=0, lr=[0.00010500099999999998], mom=[(0.9, 0.999)]
[2023-09-19 12:22:17,028] [INFO] [timer.py:260:stop] epoch=0/micro_step=205000/global_step=205000, RunningAvgSamplesPerSec=1015.2062562258378, CurrSamplesPerSec=121.81297937813929, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 205000] loss: 7.489214339447021
[2023-09-19 12:50:53,223] [INFO] [logging.py:96:log_dist] [Rank 0] step=210000, skipped=0, lr=[0.00010000099999999998], mom=[(0.9, 0.999)]
[2023-09-19 12:50:53,224] [INFO] [timer.py:260:stop] epoch=0/micro_step=210000/global_step=210000, RunningAvgSamplesPerSec=1007.9050441247427, CurrSamplesPerSec=118.72810344080257, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 210000] loss: 7.488288639068603
[Step 210000] evaluating...
{'log_probs': -8.014695930274087, 'mse': 3.1642528367473863, 'pi_actions': 0.005336214542321473} [Local Rank]: 1
{'log_probs': -8.014695930274087, 'mse': 3.1642528367473863, 'pi_actions': 0.005336214542321473} [Local Rank]: 0
[2023-09-19 13:23:49,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=215000, skipped=0, lr=[9.500099999999998e-05], mom=[(0.9, 0.999)]
[2023-09-19 13:23:49,011] [INFO] [timer.py:260:stop] epoch=0/micro_step=215000/global_step=215000, RunningAvgSamplesPerSec=1005.4321218502996, CurrSamplesPerSec=181.75948994894105, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 215000] loss: 7.497615436077118
[2023-09-19 13:52:49,676] [INFO] [logging.py:96:log_dist] [Rank 0] step=220000, skipped=0, lr=[9.000099999999998e-05], mom=[(0.9, 0.999)]
[2023-09-19 13:52:49,681] [INFO] [timer.py:260:stop] epoch=0/micro_step=220000/global_step=220000, RunningAvgSamplesPerSec=998.961101887498, CurrSamplesPerSec=120.37971824821315, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 220000] loss: 7.473186588191986
[Step 220000] evaluating...
{'log_probs': -8.013586050310776, 'mse': 3.1620330669264463, 'pi_actions': 0.0030152955375352436} [Local Rank]: 1
{'log_probs': -8.013586050310776, 'mse': 3.1620330669264463, 'pi_actions': 0.0030152955375352436} [Local Rank]: 0
[2023-09-19 14:31:52,344] [INFO] [logging.py:96:log_dist] [Rank 0] step=225000, skipped=0, lr=[8.5001e-05], mom=[(0.9, 0.999)]
[2023-09-19 14:31:52,364] [INFO] [timer.py:260:stop] epoch=0/micro_step=225000/global_step=225000, RunningAvgSamplesPerSec=1001.3929453262509, CurrSamplesPerSec=2181.3468769172637, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 225000] loss: 7.4388184783935545
[2023-09-19 15:10:29,819] [INFO] [logging.py:96:log_dist] [Rank 0] step=230000, skipped=0, lr=[8.000099999999998e-05], mom=[(0.9, 0.999)]
[2023-09-19 15:10:29,822] [INFO] [timer.py:260:stop] epoch=0/micro_step=230000/global_step=230000, RunningAvgSamplesPerSec=1008.2000332929935, CurrSamplesPerSec=3776.4328537212436, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 230000] loss: 7.495466359043121
[Step 230000] evaluating...
{'log_probs': -8.02150839311182, 'mse': 3.1778777700034544, 'pi_actions': 0.0024866384097912606} [Local Rank]: 0
{'log_probs': -8.02150839311182, 'mse': 3.1778777700034544, 'pi_actions': 0.0024866384097912606} [Local Rank]: 1
[2023-09-19 15:44:32,374] [INFO] [logging.py:96:log_dist] [Rank 0] step=235000, skipped=0, lr=[7.5001e-05], mom=[(0.9, 0.999)]
[2023-09-19 15:44:32,384] [INFO] [timer.py:260:stop] epoch=0/micro_step=235000/global_step=235000, RunningAvgSamplesPerSec=1010.6517369559525, CurrSamplesPerSec=1985.7263773047546, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 235000] loss: 7.411904956245422
[2023-09-19 16:26:39,164] [INFO] [logging.py:96:log_dist] [Rank 0] step=240000, skipped=0, lr=[7.0001e-05], mom=[(0.9, 0.999)]
[2023-09-19 16:26:39,166] [INFO] [timer.py:260:stop] epoch=0/micro_step=240000/global_step=240000, RunningAvgSamplesPerSec=994.0006632962738, CurrSamplesPerSec=2122.636070162676, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 240000] loss: 7.40043066740036
[Step 240000] evaluating...
{'log_probs': -8.01685020215082, 'mse': 3.1685613725017676, 'pi_actions': 0.003223420586395162} [Local Rank]: 0
{'log_probs': -8.01685020215082, 'mse': 3.1685613725017676, 'pi_actions': 0.003223420586395162} [Local Rank]: 1
[2023-09-19 17:13:08,977] [INFO] [logging.py:96:log_dist] [Rank 0] step=245000, skipped=0, lr=[6.5001e-05], mom=[(0.9, 0.999)]
[2023-09-19 17:13:08,978] [INFO] [timer.py:260:stop] epoch=0/micro_step=245000/global_step=245000, RunningAvgSamplesPerSec=982.5735747427854, CurrSamplesPerSec=412.09494893437864, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 245000] loss: 7.415575291442871
[2023-09-19 17:54:53,950] [INFO] [logging.py:96:log_dist] [Rank 0] step=250000, skipped=0, lr=[6.000099999999998e-05], mom=[(0.9, 0.999)]
[2023-09-19 17:54:53,951] [INFO] [timer.py:260:stop] epoch=0/micro_step=250000/global_step=250000, RunningAvgSamplesPerSec=977.4321220538416, CurrSamplesPerSec=1797.8397696068582, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 250000] loss: 7.427030375099182
[Step 250000] evaluating...
{'log_probs': -8.005931204880655, 'mse': 3.1467234177428405, 'pi_actions': 0.008540935257776201} [Local Rank]: 0
{'log_probs': -8.005931204880655, 'mse': 3.1467234177428405, 'pi_actions': 0.008540935257776201} [Local Rank]: 1
[2023-09-19 18:40:16,356] [INFO] [logging.py:96:log_dist] [Rank 0] step=255000, skipped=0, lr=[5.500099999999999e-05], mom=[(0.9, 0.999)]
[2023-09-19 18:40:16,363] [INFO] [timer.py:260:stop] epoch=0/micro_step=255000/global_step=255000, RunningAvgSamplesPerSec=965.0354604131526, CurrSamplesPerSec=70.66143383094497, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 255000] loss: 7.403717599105835
[2023-09-19 19:22:55,875] [INFO] [logging.py:96:log_dist] [Rank 0] step=260000, skipped=0, lr=[5.000099999999999e-05], mom=[(0.9, 0.999)]
[2023-09-19 19:22:55,894] [INFO] [timer.py:260:stop] epoch=0/micro_step=260000/global_step=260000, RunningAvgSamplesPerSec=951.3830689844654, CurrSamplesPerSec=1047.2097185703028, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 260000] loss: 7.364878678131103
[Step 260000] evaluating...
{'log_probs': -8.017824186213364, 'mse': 3.170509365812165, 'pi_actions': 0.0052874688371468475} [Local Rank]: 1
{'log_probs': -8.017824186213364, 'mse': 3.170509365812165, 'pi_actions': 0.0052874688371468475} [Local Rank]: 0
[2023-09-19 20:09:45,556] [INFO] [logging.py:96:log_dist] [Rank 0] step=265000, skipped=0, lr=[4.500099999999999e-05], mom=[(0.9, 0.999)]
[2023-09-19 20:09:45,558] [INFO] [timer.py:260:stop] epoch=0/micro_step=265000/global_step=265000, RunningAvgSamplesPerSec=940.2788659118129, CurrSamplesPerSec=3052.050322617322, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 265000] loss: 7.396695149326325
[2023-09-19 20:38:57,352] [INFO] [logging.py:96:log_dist] [Rank 0] step=270000, skipped=0, lr=[4.0001e-05], mom=[(0.9, 0.999)]
[2023-09-19 20:38:57,354] [INFO] [timer.py:260:stop] epoch=0/micro_step=270000/global_step=270000, RunningAvgSamplesPerSec=946.7470330181029, CurrSamplesPerSec=3489.9252898407053, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 270000] loss: 7.366204547595978
[Step 270000] evaluating...
{'log_probs': -8.013375094655796, 'mse': 3.1616111749948734, 'pi_actions': 0.007725271128825179} [Local Rank]: 1
{'log_probs': -8.013375094655796, 'mse': 3.1616111749948734, 'pi_actions': 0.007725271128825179} [Local Rank]: 0
[2023-09-19 21:11:46,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=275000, skipped=0, lr=[3.500099999999999e-05], mom=[(0.9, 0.999)]
[2023-09-19 21:11:46,980] [INFO] [timer.py:260:stop] epoch=0/micro_step=275000/global_step=275000, RunningAvgSamplesPerSec=953.2006410328388, CurrSamplesPerSec=3529.58405322604, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 275000] loss: 7.3379805869102475
[2023-09-19 21:41:17,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=280000, skipped=0, lr=[3.0001e-05], mom=[(0.9, 0.999)]
[2023-09-19 21:41:17,071] [INFO] [timer.py:260:stop] epoch=0/micro_step=280000/global_step=280000, RunningAvgSamplesPerSec=965.0968736051406, CurrSamplesPerSec=3482.9631897860413, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 280000] loss: 7.3530766962051395
[Step 280000] evaluating...
{'log_probs': -8.024061072881207, 'mse': 3.182983145525982, 'pi_actions': 0.004444020223077043} [Local Rank]: 1
{'log_probs': -8.024061072881207, 'mse': 3.182983145525982, 'pi_actions': 0.004444020223077043} [Local Rank]: 0
[2023-09-19 22:14:50,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=285000, skipped=0, lr=[2.5001e-05], mom=[(0.9, 0.999)]
[2023-09-19 22:14:50,536] [INFO] [timer.py:260:stop] epoch=0/micro_step=285000/global_step=285000, RunningAvgSamplesPerSec=975.3299263339744, CurrSamplesPerSec=2961.6787543615274, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 285000] loss: 7.356830669307708
[2023-09-19 22:44:23,649] [INFO] [logging.py:96:log_dist] [Rank 0] step=290000, skipped=0, lr=[2.0001e-05], mom=[(0.9, 0.999)]
[2023-09-19 22:44:23,651] [INFO] [timer.py:260:stop] epoch=0/micro_step=290000/global_step=290000, RunningAvgSamplesPerSec=977.061332677786, CurrSamplesPerSec=3057.6735703932363, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 290000] loss: 7.343284949207306
[Step 290000] evaluating...
{'log_probs': -8.018855612914113, 'mse': 3.172572241673608, 'pi_actions': 0.0049114941149492825} [Local Rank]: 1
{'log_probs': -8.018855612914113, 'mse': 3.172572241673608, 'pi_actions': 0.0049114941149492825} [Local Rank]: 0
[2023-09-19 23:17:51,255] [INFO] [logging.py:96:log_dist] [Rank 0] step=295000, skipped=0, lr=[1.5001000000000001e-05], mom=[(0.9, 0.999)]
[2023-09-19 23:17:51,256] [INFO] [timer.py:260:stop] epoch=0/micro_step=295000/global_step=295000, RunningAvgSamplesPerSec=988.7937682620928, CurrSamplesPerSec=3608.8765561560595, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 295000] loss: 7.308890386676788
[2023-09-19 23:48:02,700] [INFO] [logging.py:96:log_dist] [Rank 0] step=300000, skipped=0, lr=[1.0001000000000001e-05], mom=[(0.9, 0.999)]
[2023-09-19 23:48:02,702] [INFO] [timer.py:260:stop] epoch=0/micro_step=300000/global_step=300000, RunningAvgSamplesPerSec=999.8572250766173, CurrSamplesPerSec=3394.232918066782, MemAllocated=0.27GB, MaxMemAllocated=2.15GB
[Step 300000] loss: 7.345600463199616
[Step 300000] evaluating...
{'log_probs': -8.014521787688944, 'mse': 3.1639045559057712, 'pi_actions': 0.002820882243218548} [Local Rank]: 1
{'log_probs': -8.013018625677278, 'mse': 3.160898217934492, 'pi_actions': 0.0030043455672759804} [Local Rank]: 0
Achieve best log_probs -8.003352032411124 on evaluation set at step 190000.
Achieve best log_probs -8.003352032411124 on evaluation set at step 190000.
[2023-09-19 23:52:01,649] [INFO] [launch.py:347:main] Process 449190 exits successfully.
[2023-09-19 23:52:02,657] [INFO] [launch.py:347:main] Process 449189 exits successfully.