Skip to content

Latest commit

 

History

History
93 lines (74 loc) · 5.12 KB

upgrading.md

File metadata and controls

93 lines (74 loc) · 5.12 KB
license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Upgrading

Rolling upgrade

It is necessary to support a fast rolling upgrade process for the Celeborn cluster. In order to achieve a fast and unaffected rolling upgrade process, Celeborn should support that the written file in the worker should be committed and support reading after the worker restarted. Celeborn have done the following mechanism to support rolling upgrade.

Background

Fixed fetch port and client retry

In the shuffle reduce side, the read client will obtain the worker's host/port and information of the file to be read. In order to ensure that the data can be read normally after the rolling restart process of the worker is completed, the worker needs to use a fixed fetch service port, the configuration is celeborn.worker.fetch.port, the default value is 0. At startup, it will automatically select a free port, user need to set a fixed value, such as 9092.

At the same time, users need to adjust the number of retry times and retry wait time of the client according to cluster rolling restart situation to support the shuffle client to read data through retries after worker restarted. The shuffle client fetch data retry times configuration is celeborn.client.fetch.maxRetriesForEachReplica, default value is 3. The shuffle client fetch data retry wait time configuration is celeborn.data.io.retryWait, default value is 5s. Users can increase the configuration value appropriately according to the situation.

Worker store file meta information

Shuffle client records the shuffle partition location's host, service port, and filename, to support workers recovering reading existing shuffle data after worker restart, during worker shutdown, workers should store the meta about reading shuffle partition files in LevelDB, and restore the meta after restarting workers. Users should set celeborn.worker.graceful.shutdown.enabled to true to enable graceful shutdown. During this process, worker will wait all allocated partition's in this worker to be committed within a timeout of celeborn.worker.graceful.shutdown.checkSlotsFinished.timeout, which default value is 480s. Then worker will wait for partition sorter finish all sort task within a timeout of celeborn.worker.graceful.shutdown.partitionSorter.shutdownTimeout, which default value is 120s. The whole graceful shutdown process should be finished within a timeout of celeborn.worker.graceful.shutdown.timeout, which default value is 600s.

Allocated partition do hard split and Pre-commit hard split partition

As mentioned in the previous section that the worker needs to wait for all allocated partition files to be committed during the restart process, which means that the worker need to wait for all the shuffle running on this worker to finish running before restarting the worker, otherwise part of the information will be lost, and abnormal partition files are left, and reading cannot be resumed.

In order to speed up the restart process, worker let all push data requests return the HARD_SPLIT flag during worker shutdown, and shuffle client will re-apply for a new partition location for these allocated partitions. Then client side can record all HARD_SPLIT partition information and pre-commit these partition, then the worker side allocated partitions can be committed in a very short time. User should enable celeborn.client.shuffle.batchHandleCommitPartition.enabled, the default value is false.

Example setting

Worker

Key Value
celeborn.worker.graceful.shutdown.enabled true
celeborn.worker.graceful.shutdown.checkSlotsFinished.timeout 480s
celeborn.worker.graceful.shutdown.partitionSorter.shutdownTimeout 120s
celeborn.worker.graceful.shutdown.timeout 600s
celeborn.worker.fetch.port 9092

Client

Key Value
spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled true
spark.celeborn.client.fetch.maxRetriesForEachReplica 5
spark.celeborn.data.io.retryWait 10s