Skip to content

Latest commit

 

History

History

workloads

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Distributed DLRM

The implementation is developed based on DLRM dist_exp branch and add Facebook features and optimizations.

Currently you need to download the following PR to get the latest update. (Will be fixed soon.)

Usage

Currently, it is launched with mpirun on multi-nodes. The hostfile need to be created or a host list should be given. The DLRM parameters should be given in the same way as single node master branch.

mpirun -np 128 -hostfile hostfile python dlrm_s_pytorch.py ...

Example

large_arch_emb=$(printf '14000%.0s' {1..64}) large_arch_emb=${large_arch_emb_ads//"01"/"0-1"}

python dlrm_s_pytorch.py
   --arch-sparse-feature-size=128
   --arch-mlp-bot="2000-1024-1024-128"
   --arch-mlp-top="4096-4096-4096-1"
   --arch-embedding-size=$large_arch_emb
   --data-generation=random
   --loss-function=bce
   --round-targets=True
   --learning-rate=0.1
   --mini-batch-size=2048
   --print-freq=10240
   --print-time
   --test-mini-batch-size=16384
   --test-num-workers=16
   --num-indices-per-lookup-fixed=1
   --num-indices-per-lookup=100
   --arch-projection-size 30
   --use-gpu

Please check the README.md in the PR for more details.