Skip to content

Commit

Permalink
GEODE-7955: Docs for redundancy internal API and redundancy commands (a…
Browse files Browse the repository at this point in the history
…pache#5005)

* GEODE-7955: Docs for redundancy internal API and redundancy commands

- Document restore redundancy gfsh command
- Document status redundancy gfsh command
- Document new RestoreRedundancyOperation API usage
- Add rebalance and restore redundancy stats to stats list
- Removed suggestion to use show metrics command to determine redundancy
status
- Removed suggestions to use rebalance when only restoring redundancy is
wanted

Authored-by: Donal Evans <[email protected]>
  • Loading branch information
DonalEvans authored May 1, 2020
1 parent 84387ce commit a820c59
Show file tree
Hide file tree
Showing 10 changed files with 224 additions and 41 deletions.
9 changes: 9 additions & 0 deletions geode-book/master_middleman/source/subnavs/geode-subnav.erb
Original file line number Diff line number Diff line change
Expand Up @@ -931,6 +931,9 @@ limitations under the License.
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/developing/partitioned_regions/checking_region_redundancy.html">Checking Redundancy in Partitioned Regions</a>
</li>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/developing/partitioned_regions/restoring_region_redundancy.html">Restoring Redundancy in Partitioned Regions</a>
</li>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/developing/partitioned_regions/moving_partitioned_data.html">Moving Partitioned Region Data to Another Member</a>
</li>
Expand Down Expand Up @@ -1931,6 +1934,9 @@ gfsh</a>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/tools_modules/gfsh/command-pages/remove.html">remove</a>
</li>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/tools_modules/gfsh/command-pages/restore.html">restore redundancy</a>
</li>
<li class="has_submenu">
<a href="/docs/guide/<%=vars.product_version_nodot%>/tools_modules/gfsh/command-pages/resume.html">resume</a>
<ul>
Expand Down Expand Up @@ -2025,6 +2031,9 @@ gfsh</a>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/tools_modules/gfsh/command-pages/status.html#topic_E96D0EFA513C4CD79B833FCCDD69C832">status locator</a>
</li>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/tools_modules/gfsh/command-pages/status.html#topic_status_redundancy">status redundancy</a>
</li>
<li>
<a href="/docs/guide/<%=vars.product_version_nodot%>/tools_modules/gfsh/command-pages/status.html#topic_E5DB49044978404D9D6B1971BF5D400D">status server</a>
</li>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,34 +21,28 @@ limitations under the License.

Under some circumstances, it can be important to verify that your partitioned region data is redundant and that upon member restart, redundancy has been recovered properly across partitioned region members.

You can verify partitioned region redundancy by making sure that the `numBucketsWithoutRedundancy` statistic is **zero** for all your partitioned regions. To check this statistic, use the following `gfsh` command:

``` pre
gfsh>show metrics --categories=partition --region=region_name
```

For example:

``` pre
gfsh>show metrics --categories=partition --region=posts

Cluster-wide Region Metrics
--------- | --------------------------- | -----
partition | putLocalRate | 0
| putRemoteRate | 0
| putRemoteLatency | 0
| putRemoteAvgLatency | 0
| bucketCount | 1
| primaryBucketCount | 1
| numBucketsWithoutRedundancy | 1
| minBucketSize | 1
| maxBucketSize | 0
| totalBucketSize | 1
| averageBucketSize | 1

```

If you have `start-recovery-delay=-1` configured for your partitioned region, you will need to perform a rebalance on your region after you restart any members in your cluster in order to recover redundancy.
Initiate an operation to report the current redundancy status of regions using one of the following:

- `gfsh` command. Start `gfsh` and connect to the cluster. Then type the following command:

``` pre
gfsh>status redundancy
```

Optionally, you can specify regions to include or exclude from restoring redundancy. Type `help restore redundancy` or see [status redundancy](../../tools_modules/gfsh/command-pages/status.html#topic_status_redundancy) for more information.

- API call:

``` pre
ResourceManager manager = cache.getResourceManager();
RestoreRedundancyResults currentStatus = manager.createRestoreRedundancyOperation().redundancyStatus();
//These are some of the details we can get about the run from the API
System.out.println("Status for all regions: " + currentStatus.getMessage());
System.out.println("Number of regions with no redundant copies: " + currentStatus.getZeroRedundancyRegionResults().size();
System.out.println("Status for region " + regionName + ": " + currentStatus.getRegionResult(regionName).getMessage();
```

If you have `start-recovery-delay=-1` configured for your partitioned region, you will need to trigger a restore redundancy operation on your region after you restart any members in your cluster in order to recover redundancy. See [Restoring Redundancy in Partitioned Regions](restoring_region_redundancy.html).

If you have `start-recovery-delay` set to a low number, you may need to wait extra time until the region has recovered redundancy.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ Here are the main steps for configuring high availability for a partitioned regi
5. Decide how many buckets <%=vars.product_name%> should attempt to recover in parallel when performing redundancy recovery. By default, the system recovers up to 8 buckets in parallel. Use the `gemfire.MAX_PARALLEL_BUCKET_RECOVERIES` system property to increase or decrease the maximum number of buckets to recover in parallel any time redundancy recovery is performed.
6. For all but fixed partitioned regions, review the points at which you kick off rebalancing. Redundancy recovery is done automatically at the start of any rebalancing. This is most important if you run with no automated recovery after member crashes or joins. See [Rebalancing Partitioned Region Data](rebalancing_pr_data.html#rebalancing_pr_data).
For all partitioned regions, redundancy can be restored without moving buckets between members using a restore redundancy operation. See [Restoring Redundancy in Partitioned Regions](restoring_region_redundancy.html).
During runtime, you can add capacity by adding new members for the region. For regions that do not use fixed partitioning, you can also kick off a rebalancing operation to spread the region buckets among all members.
- **[Set the Number of Redundant Copies](set_pr_redundancy.html)**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,9 @@ Partitioned region rebalancing:

You typically want to trigger rebalancing when capacity is increased or reduced through member startup, shut down or failure.

You may also need to rebalance when:
You may also need to rebalance when you have uneven hashing of data. Uneven hashing can occur if your keys do not have a hash code method, which ensures uniform distribution, or if you use a `PartitionResolver` to colocate your partitioned region data (see [Colocate Data from Different Partitioned Regions](colocating_partitioned_region_data.html#colocating_partitioned_region_data)). In either case, some buckets may receive more data than others. Rebalancing can be used to even out the load between data stores by putting fewer buckets on members that are hosting large buckets.

- You use redundancy for high availability and have configured your region to not automatically recover redundancy after a loss. In this case, <%=vars.product_name%> only restores redundancy when you invoke a rebalance. See [Configure High Availability for a Partitioned Region](configuring_ha_for_pr.html).
- You have uneven hashing of data. Uneven hashing can occur if your keys do not have a hash code method, which ensures uniform distribution, or if you use a `PartitionResolver` to colocate your partitioned region data (see [Colocate Data from Different Partitioned Regions](colocating_partitioned_region_data.html#colocating_partitioned_region_data)). In either case, some buckets may receive more data than others. Rebalancing can be used to even out the load between data stores by putting fewer buckets on members that are hosting large buckets.
Rebalancing solely for the purpose of restoring lost redundancy, when redundancy is being used for high availability and the region has been configured to not automatically recover redundancy after a loss, is not necessary. Instead, the restore redundancy operation should be triggered. See [Restoring Redundancy in Partitioned Regions](restoring_region_redundancy.html).

## <a id="rebalancing_pr_data__section_495FEE48ED60433BADB7D36C73279C89" class="no-quick-link"></a>How to Simulate Region Rebalancing

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: Restoring Redundancy in Partitioned Regions
---

<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
Restoring redundancy is a member operation. It affects all partitioned regions defined by the member, regardless of whether the member hosts data for the regions.

Restoring redundancy creates new redundant copies of buckets on members hosting the region and by default reassigns which members host the primary buckets to give better load balancing. It does not move buckets from one member to another. The reassignment of primary hosts can be prevented using the appropriate flags, as described below. See [Configure High Availability for a Partitioned Region](configuring_ha_for_pr.html) for further detail on redundancy.

For efficiency, when starting multiple members, trigger the restore redundancy a single time, after you have added all members.

Initiate a restore redundancy operation using one of the following:

- `gfsh` command. First, starting a `gfsh` prompt and connect to the cluster. Then type the following command:

``` pre
gfsh>restore redundancy
```

Optionally, you can specify regions to include or exclude from restoring redundancy, and prevent the operation from reassigning which members host primary copies. Type `help restore redundancy` or see [restore redundancy](../../tools_modules/gfsh/command-pages/restore.html) for more information.

- API call:

``` pre
ResourceManager manager = cache.getResourceManager();
CompletableFuture<RestoreRedundancyResults> future = manager.createRestoreRedundancyOperation()
.includeRegions(regionsToInclude)
.excludeRegions(regionsToExclude)
.shouldReassignPrimaries(false)
.start();
//Get the results
RestoreRedundancyResults results = future.get();
//These are some of the details we can get about the run from the API
System.out.println("Restore redundancy operation status is " + results.getStatus());
System.out.println("Results for each included region: " + results.getMessage());
System.out.println("Number of regions with no redundant copies: " + results.getZeroRedundancyRegionResults().size();
System.out.println("Results for region " + regionName + ": " + results.getRegionResult(regionName).getMessage();
```

If you have `start-recovery-delay=-1` configured for your partitioned region, you will need to perform a restore redundancy operation on your region after you restart any members in your cluster in order to recover any lost redundancy.

If you have `start-recovery-delay` set to a low number, you may need to wait extra time until the region has recovered redundancy.


22 changes: 13 additions & 9 deletions geode-docs/reference/statistics_list.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -1060,15 +1060,19 @@ These statistics are for outgoing gateway queue and its connection. The primary
Statistics related to the <%=vars.product_name%>'s resource manager. Use these to help analyze and tune your JVM memory settings and the <%=vars.product_name%> resource manager settings. The primary statistics are:
| Statistic | Description |
|-----------------------|--------------------------------------------------------------------------------------------------------------|
| `criticalThreshold` | The cache resource-manager setting critical-heap-percentage. |
| `evictionStartEvents` | Number of times eviction activities were started due to the heap use going over the eviction threshold. |
| `evictionStopEvents` | Number of times eviction activities were stopped due to the heap use going below the eviction threshold. |
| `evictionThreshold` | The cache resource-manager setting eviction-heap-percentage. |
| `heapCriticalEvents` | Number of times incoming cache activities were blocked due to heap use going over the critical threshold. |
| `heapSafeEvents` | Number of times incoming cache activities were unblocked due to heap use going under the critical threshold. |
| `tenuredHeapUsed` | Percentage of tenured heap currently in use. |
| Statistic | Description |
|---------------------------------|--------------------------------------------------------------------------------------------------------------|
| `criticalThreshold` | The cache resource-manager setting critical-heap-percentage. |
| `evictionStartEvents` | Number of times eviction activities were started due to the heap use going over the eviction threshold. |
| `evictionStopEvents` | Number of times eviction activities were stopped due to the heap use going below the eviction threshold. |
| `evictionThreshold` | The cache resource-manager setting eviction-heap-percentage. |
| `heapCriticalEvents` | Number of times incoming cache activities were blocked due to heap use going over the critical threshold. |
| `heapSafeEvents` | Number of times incoming cache activities were unblocked due to heap use going under the critical threshold. |
| `rebalancesCompleted` | Total number of cache rebalance operations that have occurred. |
| `rebalancesInProgress` | Current number of cache rebalance operations in process. |
| `restoreRedundanciesCompleted` | Total number of cache restore redundancy operations that have occurred. |
| `restoreRedundanciesInProgress` | Current number of cache restore redundancy operations in process. |
| `tenuredHeapUsed` | Percentage of tenured heap currently in use. |
## JVM Java Runtime (VMStats)
Expand Down
64 changes: 64 additions & 0 deletions geode-docs/tools_modules/gfsh/command-pages/restore.html.md.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: restore redundancy
---

<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Restore redundancy to partitioned regions and optionally reassign which members host the primary copies.

The default is for all partitioned regions to have redundancy restored and to reassign primary hosts. If any region that would have redundancy restored is a member of a colocated group, all other regions that are part of that group will also have their redundancy restored. This behavior takes precedence over any included or excluded regions specified as part of the command. See [Data Colocation Between Regions](../../../developing/partitioned_regions/custom_partitioning_and_data_colocation.html#custom_partitioning_and_data_colocation__section_D2C66951FE38426F9C05050D2B9028D8)

**Availability:** Online. You must be connected in `gfsh` to a JMX Manager member to use this command.

**Syntax:**

``` pre
restore redundancy [--include-region=value(,value)*] [--exclude-region=value(,value)*] [--reassign-primaries(=value)]
```

| Name | Description | Default Value |
|------|-------------|---------------|
| &#8209;&#8209;include&#8209;region | Partitioned Region paths to be included for restore redundancy operation. Includes take precedence over excludes. | |
| &#8209;&#8209;exclude&#8209;region | Partitioned Region paths to be excluded for restore redundancy operation. | |
| &#8209;&#8209;reassign&#8209;primaries | If false, this operation will not attempt to reassign which members host primary buckets. | true |

**Example Commands:**

``` pre
restore redundancy
restore redundancy --include-region=/region3,/region2 --exclude-region=/region1
```

**Sample Output:**

``` pre
restore redundancy --include-region=/region3,/region2 --exclude-region=/region1

Number of regions with zero redundant copies = 0
Number of regions with partially satisfied redundancy = 0
Number of regions with fully satisfied redundancy = 2

Redundancy is fully satisfied for regions:
region3 redundancy status: SATISFIED. Desired redundancy is 2 and actual redundancy is 2.
region2 redundancy status: SATISFIED. Desired redundancy is 1 and actual redundancy is 1.

Total primary transfers completed = 224
Total primary transfer time (ms) = 4134
```


Loading

0 comments on commit a820c59

Please sign in to comment.