Skip to content

Commit

Permalink
GEODE-9656: Document Async disk writer exit behavior (apache#7062)
Browse files Browse the repository at this point in the history
  • Loading branch information
davebarnes97 authored Oct 28, 2021
1 parent b4deab0 commit ec9fd00
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 10 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -57,5 +57,5 @@ While a member is running, its disk stores are online. When the member exits and

- Online, a disk store is owned and managed by its member process. To run operations on an online disk store, use API calls in the member process, or use the `gfsh` command-line interface.
- Offline, the disk store is just a collection of files in the host file system. The files are accessible based on file system permissions. You can copy the files for backup or to move the member’s disk store location. You can also run some maintenance operations, such as file compaction and validation, by using the `gfsh` command-line interface. When offline, the disk store's information is unavailable to the cluster.
For partitioned regions, region data is split between multiple members, and therefore the start up of a member is dependent onall members, and must wait for all members to be online. An attempt to access an entry that is stored on disk by an offline member results in a `PartitionOfflineException`.
For partitioned regions, region data is split between multiple members, and therefore the start up of a member is dependent on all members, and must wait for all members to be online. An attempt to access an entry that is stored on disk by an offline member results in a `PartitionOfflineException`.

Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This section describes alerts for and appropriate responses to various kinds of

If a system member withdraws from the cluster involuntarily because the member, host, or network fails, the other members automatically adapt to the loss and continue to operate. The cluster does not experience any disturbance such as timeouts.

## <a id="sys_failure__section_846B00118184487FB8F1E0CD1DC3A81B" class="no-quick-link"></a>Planning for Data Recovery
## <a id="sys_failure__section_846B00118184487FB8F1E0CD1DC3A81B"></a>Planning for Data Recovery

In planning a strategy for data recovery, consider these factors:

Expand All @@ -37,7 +37,7 @@ In planning a strategy for data recovery, consider these factors:

The rest of this section provides recovery instructions for various kinds system failures.

## <a id="sys_failure__section_2C390F0783724048A6E12F7F369EB8DC" class="no-quick-link"></a>Network Partitioning, Slow Response, and Member Removal Alerts
## <a id="sys_failure__section_2C390F0783724048A6E12F7F369EB8DC"></a>Network Partitioning, Slow Response, and Member Removal Alerts

When a network partition detection or slow responses occur, these alerts are generated:

Expand All @@ -49,7 +49,7 @@ When a network partition detection or slow responses occur, these alerts are gen

For information on configuring system members to help avoid a network partition configuration condition in the presence of a network failure or when members lose the ability to communicate to each other, refer to [Understanding and Recovering from Network Outages](recovering_from_network_outages.html#rec_network_crash).

### <a id="sys_failure__section_D52D902E665F4F038DA4B8298E3F8681" class="no-quick-link"></a>Network Partitioning Detected
### <a id="sys_failure__section_D52D902E665F4F038DA4B8298E3F8681"></a>Network Partitioning Detected

Alert:

Expand All @@ -71,7 +71,7 @@ Response:

Check the network connectivity and health of the listed cache processes.

### <a id="sys_failure__section_2C5E8A37733D4B31A12F22B9155796FD" class="no-quick-link"></a>Member Taking Too Long to Respond
### <a id="sys_failure__section_2C5E8A37733D4B31A12F22B9155796FD"></a>Member Taking Too Long to Respond

Alert:

Expand Down Expand Up @@ -167,7 +167,7 @@ Response:

None.

### <a id="sys_failure__section_AF4F913C244044E7A541D89EC6BCB961" class="no-quick-link"></a>No Locators Can Be Found
### <a id="sys_failure__section_AF4F913C244044E7A541D89EC6BCB961"></a>No Locators Can Be Found

**Note:**
It is likely that all processes using the locators will exit with the same message.
Expand Down Expand Up @@ -234,7 +234,7 @@ Response:

The operator should examine and restart the disconnected process.

### <a id="sys_failure__section_77BDB0886A944F87BDA4C5408D9C2FC4" class="no-quick-link"></a>Warning Notifications Before Removal
### <a id="sys_failure__section_77BDB0886A944F87BDA4C5408D9C2FC4"></a>Warning Notifications Before Removal

Alert:

Expand Down Expand Up @@ -265,7 +265,7 @@ Response:

The operator can turn this off by setting the system property gemfire.disable-same-machine-warnings to true. However, it is best to run locator processes, which act as membership coordinators when network partition detection is enabled, on separate machines from cache processes.

### <a id="sys_failure__section_E777C6EC8DEC4FE692AC5863C4420238" class="no-quick-link"></a>Member Is Forced Out
### <a id="sys_failure__section_E777C6EC8DEC4FE692AC5863C4420238"></a>Member Is Forced Out

Alert:

Expand All @@ -285,7 +285,41 @@ Response:

The operator should examine the locator processes and logs.

## How Data is Recovered From Persistent Regions
### <a id="sys_failure__section_disk_access_exceptions"></a>Disk Access Exceptions

Alert:

``` pre
A DiskAccessException has occurred while writing to the disk for region <region-name>.
The cache will be closed. For Region: <region-name>: Failed writing key
to <disk-store-name>
```

or

``` pre
A DiskAccessException has occurred while writing to the disk for region <region-name>.
The cache will be closed.
For DiskStore: <disk-store-name>: Could not schedule asynchronous write because
the flusher thread had been terminated
```

Description:

A write was prevented by an underlying disk issue, such as a full disk.

The first alert form is reported when disk writes are synchronous (`disk-synchronous=true`),
and the second form is reported when disk writes are asynchronous (`disk-synchronous=false`).

In either case, the member shuts down when an operation attempts to update the disk store.

Response:

You must address the underlying disk issue and restart the server.
See [Preventing and Recovering from Disk Full Errors](prevent_and_recover_disk_full_errors.html) for suggestions.


## <a id="sys_failure__section_how_data_is_recovered"></a>How Data is Recovered From Persistent Regions

A persistent region is one whose contents (keys and values) can be restored from disk. Upon
restart, data recovery of a persistent region always recovers keys. Under the default behavior, the
Expand Down Expand Up @@ -338,4 +372,3 @@ properties allow the developer to modify the recovery behavior for persistent re
When `true`, prolongs restart time, but ensures that when available for use, the cache is fully
populated and data retrieval times will be optimal.


0 comments on commit ec9fd00

Please sign in to comment.