GEODE-9656: Document Async disk writer exit behavior (apache#7062)

ringles · Oct 28, 2021 · ec9fd00 · ec9fd00
1 parent b4deab0
commit ec9fd00
Show file tree

Hide file tree

Showing 2 changed files with 43 additions and 10 deletions.
diff --git a/geode-docs/managing/disk_storage/how_disk_stores_work.html.md.erb b/geode-docs/managing/disk_storage/how_disk_stores_work.html.md.erb
@@ -57,5 +57,5 @@ While a member is running, its disk stores are online. When the member exits and
 
 -   Online, a disk store is owned and managed by its member process. To run operations on an online disk store, use API calls in the member process, or use the `gfsh` command-line interface.
 -   Offline, the disk store is just a collection of files in the host file system. The files are accessible based on file system permissions. You can copy the files for backup or to move the member’s disk store location. You can also run some maintenance operations, such as file compaction and validation, by using the `gfsh` command-line interface. When offline, the disk store's information is unavailable to the cluster. 
-For partitioned regions, region data is split between multiple members, and therefore the start up of a member is dependent onall members, and must wait for all members to be online. An attempt to access an entry that is stored on disk by an offline member results in a `PartitionOfflineException`.
+For partitioned regions, region data is split between multiple members, and therefore the start up of a member is dependent on all members, and must wait for all members to be online. An attempt to access an entry that is stored on disk by an offline member results in a `PartitionOfflineException`.
 
diff --git a/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb b/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb
@@ -23,7 +23,7 @@ This section describes alerts for and appropriate responses to various kinds of
 
 If a system member withdraws from the cluster involuntarily because the member, host, or network fails, the other members automatically adapt to the loss and continue to operate. The cluster does not experience any disturbance such as timeouts.
 
-## <a id="sys_failure__section_846B00118184487FB8F1E0CD1DC3A81B" class="no-quick-link"></a>Planning for Data Recovery
+## <a id="sys_failure__section_846B00118184487FB8F1E0CD1DC3A81B"></a>Planning for Data Recovery
 
 In planning a strategy for data recovery, consider these factors:
 
@@ -37,7 +37,7 @@ In planning a strategy for data recovery, consider these factors:
 
 The rest of this section provides recovery instructions for various kinds system failures.
 
-## <a id="sys_failure__section_2C390F0783724048A6E12F7F369EB8DC" class="no-quick-link"></a>Network Partitioning, Slow Response, and Member Removal Alerts
+## <a id="sys_failure__section_2C390F0783724048A6E12F7F369EB8DC"></a>Network Partitioning, Slow Response, and Member Removal Alerts
 
 When a network partition detection or slow responses occur, these alerts are generated:
 
@@ -49,7 +49,7 @@ When a network partition detection or slow responses occur, these alerts are gen
 
 For information on configuring system members to help avoid a network partition configuration condition in the presence of a network failure or when members lose the ability to communicate to each other, refer to [Understanding and Recovering from Network Outages](recovering_from_network_outages.html#rec_network_crash).
 
-### <a id="sys_failure__section_D52D902E665F4F038DA4B8298E3F8681" class="no-quick-link"></a>Network Partitioning Detected
+### <a id="sys_failure__section_D52D902E665F4F038DA4B8298E3F8681"></a>Network Partitioning Detected
 
 Alert:
 
@@ -71,7 +71,7 @@ Response:
 
 Check the network connectivity and health of the listed cache processes.
 
-### <a id="sys_failure__section_2C5E8A37733D4B31A12F22B9155796FD" class="no-quick-link"></a>Member Taking Too Long to Respond
+### <a id="sys_failure__section_2C5E8A37733D4B31A12F22B9155796FD"></a>Member Taking Too Long to Respond
 
 Alert:
 
@@ -167,7 +167,7 @@ Response:
 
 None.
 
-### <a id="sys_failure__section_AF4F913C244044E7A541D89EC6BCB961" class="no-quick-link"></a>No Locators Can Be Found
+### <a id="sys_failure__section_AF4F913C244044E7A541D89EC6BCB961"></a>No Locators Can Be Found
 
 **Note:**
 It is likely that all processes using the locators will exit with the same message.
@@ -234,7 +234,7 @@ Response:
 
 The operator should examine and restart the disconnected process.
 
-### <a id="sys_failure__section_77BDB0886A944F87BDA4C5408D9C2FC4" class="no-quick-link"></a>Warning Notifications Before Removal
+### <a id="sys_failure__section_77BDB0886A944F87BDA4C5408D9C2FC4"></a>Warning Notifications Before Removal
 
 Alert:
 
@@ -265,7 +265,7 @@ Response:
 
 The operator can turn this off by setting the system property gemfire.disable-same-machine-warnings to true. However, it is best to run locator processes, which act as membership coordinators when network partition detection is enabled, on separate machines from cache processes.
 
-### <a id="sys_failure__section_E777C6EC8DEC4FE692AC5863C4420238" class="no-quick-link"></a>Member Is Forced Out
+### <a id="sys_failure__section_E777C6EC8DEC4FE692AC5863C4420238"></a>Member Is Forced Out
 
 Alert:
 
@@ -285,7 +285,41 @@ Response:
 
 The operator should examine the locator processes and logs.
 
-## How Data is Recovered From Persistent Regions
+### <a id="sys_failure__section_disk_access_exceptions"></a>Disk Access Exceptions
+
+Alert:
+
+``` pre
+A DiskAccessException has occurred while writing to the disk for region <region-name>.
+The cache will be closed.  For Region: <region-name>: Failed writing key
+to <disk-store-name>
+```
+
+or
+
+``` pre
+A DiskAccessException has occurred while writing to the disk for region <region-name>.
+The cache will be closed.
+For DiskStore: <disk-store-name>: Could not schedule asynchronous write because
+the flusher thread had been terminated
+```
+
+Description:
+
+A write was prevented by an underlying disk issue, such as a full disk.
+
+The first alert form is reported when disk writes are synchronous (`disk-synchronous=true`),
+and the second form is reported when disk writes are asynchronous (`disk-synchronous=false`).
+
+In either case, the member shuts down when an operation attempts to update the disk store.
+
+Response:
+
+You must address the underlying disk issue and restart the server.
+See [Preventing and Recovering from Disk Full Errors](prevent_and_recover_disk_full_errors.html) for suggestions.
+
+
+## <a id="sys_failure__section_how_data_is_recovered"></a>How Data is Recovered From Persistent Regions
 
 A persistent region is one whose contents (keys and values) can be restored from disk.  Upon
 restart, data recovery of a persistent region always recovers keys.  Under the default behavior, the
@@ -338,4 +372,3 @@ properties allow the developer to modify the recovery behavior for persistent re
   When `true`, prolongs restart time, but ensures that when available for use, the cache is fully
   populated and data retrieval times will be optimal.
 
-