Skip to content

Commit

Permalink
[DOCFIX] Update scalability doc
Browse files Browse the repository at this point in the history
### What changes are proposed in this pull request?
Added a few improvements to the scalability doc

pr-link: Alluxio#15355
change-id: cid-f8691f17eec18b221091554bfa65639f7c939edf
  • Loading branch information
jiacheliu3 authored Apr 21, 2022
1 parent 2f2894b commit 8d28dac
Showing 1 changed file with 54 additions and 14 deletions.
68 changes: 54 additions & 14 deletions docs/en/operation/Scalability-Tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,12 @@ Other metrics related to the number of files and the number of blocks
* Master.TotalPath
* Master.InodeHeapSize (Estimate only)
* Master.UniqueBlocks and Master.BlockReplicaCount
* Master.BlockHeapSize
In addition to the total path, the total number of blocks also affects the heap usage, also to a lesser degree.
* Master.BlockHeapSize - In addition to the total path,
the total number of blocks also affects the heap usage, also to a lesser degree.
Be sure to monitor this metric if you expect your total data size to be very large or grow quickly.

* Cluster.CapacityTotal
* Cluster.CapacityUsed
* Cluster.CapacityFree
Monitor if worker capacity is always full, consider more workers if that is the case.
* Cluster.CapacityFree - Monitor if worker capacity is always full, consider more workers if that is the case.

### Number of Concurrent Clients

Expand Down Expand Up @@ -160,7 +158,28 @@ size must be large enough to fit ALL inodes.
If using the `ROCKS` off-heap metastore, the master heap size must be large enough to fit the inode
cache. See the [RocksDB section]({{ '/en/operation/Metastore.html#rocksdb-metastore' | relativize_url }})
for more information.


Note that the master heap memory is not only allocated to metadata storage but also RPC logic and
internal management tasks. You should leave sufficient memory to those too.
The master heap allocation can be abstracted as:
```
METADATA_STORAGE_SIZE + RPC_CONSUMPTION + INTERNAL_MGMT_TASKS
```

The RPC memory consumption varies highly to your workload pattern. In general, below are the known top memory consumers:
1. Recursive 'rm' operations
2. Large metadata sync operation, typically triggered by recursive `ls` or `loadMetadata` operations
3. Worker registration
4. Small but frequent RPC surges like scanning thousands of files

The internal management tasks typically execute at intervals. On execution,
they will scan a number of files and perform certain management operations.
For example, the ReplicationChecker will regularly scan files in the namespace and check their replica numbers.

It is hard to estimate the memory consumption of `RPC_CONSUMPTION` and `INTERNAL_MGMT_TASKS`.
A general good start is to leave at least 30% of heap to these two factors, monitor the heap usage
for a long period of time like a week, adjust accordingly and leave enough buffer for a workload surge.

The following JVM options, set in `alluxio-env.sh`, determine the respective maximum heap sizes for
the Alluxio master and standby master processes to `256 GB`:

Expand Down Expand Up @@ -311,15 +330,18 @@ nodes (across workers), we recommend having 1 Gbit/s bandwidth (across workers)
gives a ratio of at least 10:1. The UFS link throughput can be greatly decreased based on the
expected cache hit ratio.

You may use the [UfsIOBench]({{ '/en/operation/StressBench.html#ufs-io-bench' | relativize_url }})
tool to measure the worker-UFS network bandwidth.

### Disk

The Alluxio worker needs local disk space for writing logs and temporary files to object stores.

We recommend at least 8 GB of disk space for writing logs. The write speed of the disk should be at
least 128 MB/s.

We recommend 8 GB + expected number of concurrent writers * max size of file written to object
stores disk space for writes to an object store. This disk should be a dedicated SSD supporting
We recommend `8 GB + expected number of concurrent writers * max size of file written to object
stores disk space` for writes to an object store. This disk should be a dedicated SSD supporting
512 MB/s read and write.

### Worker Cache Storage
Expand Down Expand Up @@ -359,7 +381,7 @@ The following properties tune RPC retry intervals:

```properties
alluxio.user.rpc.retry.max.duration=2min
alluxio.user.rpc.retry.base.sleep=1s
alluxio.user.rpc.retry.base.sleep=50ms
```

The retry duration and sleep duration should be increased if frequent timeouts are observed
Expand All @@ -371,7 +393,7 @@ The Alluxio client can also be configured to check the health of connected worke
pings.
This is controlled by the following properties
```properties
alluxio.user.network.streaming.keepalive.time=2h
alluxio.user.network.streaming.keepalive.time=Long.MAX_VALUE
alluxio.user.network.streaming.keepalive.timeout=30s
```
`alluxio.user.network.streaming.keepalive.time` controls the maximum wait time since a worker sent the last
Expand Down Expand Up @@ -434,8 +456,13 @@ So it is recommended to leave at least 10-15GB for that purpose.
The Presto coordinator / Spark driver needs sufficient memory to launch and complete queries.
So sufficient `COMPUTE_JVM_SIZE` would demand the next highest priority.

The Alluxio master's memory requirement depends on the amount of file metadata.
See the [Master Heap Size Estimation](#heap-size) section for more details.
The Alluxio master's memory requirement depends on the amount of file metadata and workload.
See the [Master Heap Size Estimation](#heap-size) section for more details.
If the heap size is sufficient for all the metadata in your namespace, you may put all metadata in
the heap to achieve the best performance. Otherwise, we recommend using RocksDB as metadata storage
to reduce the Alluxio master heap consumption. It is wise to always leave enough space to
prepare for unexpected workload surges. If you are using RocksDB as metadata store but have
extra space on the heap, you may increase `alluxio.metadata.cache.max.size` to allocate more cache space for the RocksDB.

#### Co-located Workers

Expand Down Expand Up @@ -610,6 +637,16 @@ That means your Alluxio master process requires tuning.
You should either increase the resources/concurrency allowed on the master,
or reduce the pressure.

The master distinguishes the worker by (hostname, ports, domain socket path, tiered identity).
If any of these attributes change, the master will think the worker is a new one
and allocate one new worker ID to it.
Consequently, the "old" worker will be removed and the "new" worker will be added,
which are expensive metadata updates.
Therefore, in order to avoid worker starting with different ports, it is recommended to
configure workers to use static ports. In other words, we do not recommend setting workers to use port 0 like
`alluxio.worker.rpc.port=0` because the port will be decided dynamically at runtime.
All port configurations in Alluxio are by default static.

#### Worker-side
The frequency with which a worker checks in with the master is set by the following property:
```properties
Expand Down Expand Up @@ -643,10 +680,13 @@ The blocks that belong to pinned files cannot be evicted from the worker.
In other words, if the worker storage is full, the worker will evict blocks that do not
belong to pinned files.

Currently the worker will sync with the master on the pinned file list once every 1 second.
Currently, the worker will sync with the master on the pinned file list once every 1 second.
```properties
alluxio.worker.block.heartbeat.interval.ms=1sec
```

If the cluster scale is large or there are many pinned files, this will create significant
pressure on the master. In that case we recommend to increase this interval.
pressure on the master. In that case we recommend increasing this interval.
The pinned files come from:
1. Files that are manually pinned using the `alluxio fs pin` command.
2. Files written with ASYNC_THROUGH are pinned until they are persisted into UFS.

0 comments on commit 8d28dac

Please sign in to comment.