Skip to content

Commit

Permalink
Make Filament2's FBLearner flow launch partition servers if requested
Browse files Browse the repository at this point in the history
Summary: Partition servers can either be spawned as subprocesses by each of the trainers (when `numPartitionServers == -1`) or can be separate processes that need to be spawned separately. This second case wasn't supported on FBLearner, now it is.

Reviewed By: lw

Differential Revision: D18547595

fbshipit-source-id: a5b899c1c28e223a0464d5eaf3c61600555db6f2
  • Loading branch information
adamlerer authored and facebook-github-bot committed Mar 11, 2020
1 parent 3c53aa9 commit e9142c9
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 6 deletions.
4 changes: 0 additions & 4 deletions test/test_functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -576,10 +576,6 @@ def test_distributed_with_partition_servers(self):
if not trainer1.is_alive() and not done[1]:
self.assertEqual(trainer1.exitcode, 0)
done[1] = True
if not partition_server.is_alive():
self.fail("Partition server died with exit code %d"
% partition_server.exitcode)
partition_server.terminate() # Cannot be shut down gracefully.
partition_server.join()
logger.info(f"Partition server died with exit code {partition_server.exitcode}")
self.assertCheckpointWritten(train_config, version=1)
Expand Down
4 changes: 2 additions & 2 deletions torchbiggraph/checkpoint_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -508,8 +508,8 @@ def close(self) -> None:
self.pool.close()
self.pool.join()

self.join()

def join(self) -> None:
# FIXME: this whole join thing doesn't work with torch.distributed
# can just get rid of it
if self.partition_client is not None:
self.partition_client.join()

0 comments on commit e9142c9

Please sign in to comment.