-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request ceph#35906 from gregsfortytwo/wip-stretch-mode
Add a new stretch mode for 2-site Ceph clusters Reviewed-by: Josh Durgin <[email protected]>
- Loading branch information
Showing
96 changed files
with
4,733 additions
and
193 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
================= | ||
Monitor Elections | ||
================= | ||
|
||
The Original Algorithm | ||
====================== | ||
Historically, monitor leader elections have been very simple: the lowest-ranked | ||
monitor wins! | ||
|
||
This is accomplished using a low-state "Elector" module (though it has now | ||
been split into an Elector that handles message-passing, and an ElectionLogic | ||
that makes the voting choices). It tracks the election epoch and not much | ||
else. Odd epochs are elections; even epochs have a leader and let the monitor | ||
do its ongoing work. When a timeout occurs or the monitor asks for a | ||
new election, we bump the epoch and send out Propose messages to all known | ||
monitors. | ||
In general, if we receive an old message we either drop it or trigger a new | ||
election (if we think the sender is newly-booted and needs to join quorum). If | ||
we receive a message from a newer epoch, we bump up our epoch to match and | ||
either Defer to the Proposer or else bump the epoch again and Propose | ||
ourselves if we expect to win over them. When we receive a Propose within | ||
our current epoch, we either Defer to the sender or ignore them (we ignore them | ||
if they are of a higher rank than us, or higher than the rank we have already | ||
deferred to). | ||
(Note that if we have the highest rank it is possible for us to defer to every | ||
other monitor in sequence within the same election epoch!) | ||
|
||
This resolves under normal circumstances because all monitors agree on the | ||
priority voting order, and epochs are only bumped when a monitor isn't | ||
participating or sees a possible conflict with the known proposers. | ||
|
||
The Problems | ||
============== | ||
The original algorithm didn't work at all under a variety of netsplit | ||
conditions. This didn't manifest often in practice but has become | ||
important as the community and commercial vendors move Ceph into | ||
spaces requiring the use of "stretch clusters". | ||
|
||
The New Algorithms | ||
================== | ||
We still default to the original ("classic") election algorithm, but | ||
support letting users change to new ones via the CLI. These | ||
algorithms are implemented as different functions and switch statements | ||
within the ElectionLogic class. | ||
|
||
The first algorithm is very simple: "disallow" lets you add monitors | ||
to a list of disallowed leaders. | ||
The second, "connectivity", incorporates connection score ratings | ||
and elects the monitor with the best score. | ||
|
||
Algorithm: disallow | ||
=================== | ||
If a monitor is in the disallowed list, it always defers to another | ||
monitor, no matter the rank. Otherwise, it is the same as the classic | ||
algorithm is. | ||
Since changing the disallowed list requires a paxos update, monitors | ||
in an election together should always have the same set. This means | ||
the election order is constant and static across the full monitor set | ||
and elections resolve trivially (assuming a connected network). | ||
|
||
This algorithm really just exists as a demo and stepping-stone to | ||
the more advanced connectivity mode, but it may have utility in asymmetric | ||
networks and clusters. | ||
|
||
Algorithm: connectivity | ||
======================= | ||
This algorithm takes as input scores for each connection | ||
(both ways, discussed in the next section) and attempts to elect the monitor | ||
with the highest total score. We keep the same basic message-passing flow as the | ||
classic algorithm, in which elections are driven by reacting to Propose messages. | ||
But this has several challenges since unlike ranks, scores are not static (and | ||
might change during an election!). To guarantee an election epoch does not | ||
produce multiple leaders, we must maintain two key invariants: | ||
* Monitors must maintain static scores during an election epoch | ||
* Any deferral must be transitive -- if A defers to B and then to C, | ||
B had better defer to C as well! | ||
|
||
We handle these very explicitly: by branching a copy stable_peer_tracker | ||
of our peer_tracker scoring object whenever starting an election (or | ||
bumping the epoch), and by refusing to defer to a monitor if it won't | ||
be deferred to by our current leader choice. (All Propose messages include | ||
a copy of the scores the leader is working from, so peers can evaluate them.) | ||
|
||
Of course, those modifications can easily block. To guarantee forward progress, | ||
we make several further adjustments: | ||
* If we want to defer to a new peer, but have already deferred to a peer | ||
whose scores don't allow that, we bump the election epoch and start() | ||
the election over again. | ||
* All election messages include the scores the sender is aware of. | ||
|
||
This guarantees we will resolve the election as long as the network is | ||
reasonably stable (even if disconnected): As long as all score "views" | ||
result in the same deferral order, an election will complete normally. And by | ||
broadly sharing scores across the full set of monitors, monitors rapidly | ||
converge on the global newest state. | ||
|
||
This algorithm has one further important feature compared to the classic and | ||
disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever | ||
a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger | ||
a new election to give C an opportunity to join. But because the | ||
highest-scoring monitor A may be netsplit from C, this is not desirable. So in | ||
the connectivity election algorithm, B only "forwards" Propose messages when B's | ||
scores indicate the cluster would choose a leader other than A. | ||
|
||
Connection Scoring | ||
================== | ||
We implement scoring within the ConnectionTracker class, which is | ||
driven by the Elector and provided to ElectionLogic as a resource. Elector | ||
is responsible for sending out MMonPing messages, and for reporting the | ||
results in to the ConnectionTracker as calls to report_[live|dead]_connection | ||
with the relevant peer and the time units the call counts for. (These time units | ||
are seconds in the monitor, but the ConnectionTracker is agnostic and our unit | ||
tests count simple time steps.) | ||
|
||
We configure a "half life" and each report updates the peer's current status | ||
(alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course | ||
subtract the new delta, rather than adding it). | ||
|
||
We can further encode and decode the ConnectionTracker for wire transmission, | ||
and receive_peer_report()s of a full ConnectionTracker (containing all | ||
known scores) or a ConnectionReport (representing a single peer's scores) | ||
to slurp up the scores from peers. These scores are of course all versioned so | ||
we are in no danger of accidentally going backwards in time. | ||
We can query an individual connection score (if the connection is down, it's 0) | ||
or the total score of a specific monitor, which is the connection score from all | ||
other monitors going in to that one. | ||
|
||
By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout) | ||
and ping live connections every second (mon_elector_ping_divisor). The halflife | ||
is 12 hours (mon_con_tracker_score_halflife). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
.. _changing_monitor_elections: | ||
|
||
===================================== | ||
Configure Monitor Election Strategies | ||
===================================== | ||
|
||
By default, the monitors will use the classic option it has always used. We | ||
recommend you stay in this mode unless you require features in the other | ||
modes. | ||
|
||
If you want to switch modes BEFORE constructing the cluster, change | ||
the ``mon election default strategy`` option. This option is an integer value: | ||
|
||
* 1 for "classic" | ||
* 2 for "disallow" | ||
* 3 for "connectivity" | ||
|
||
Once your cluster is running, you can change strategies by running :: | ||
|
||
$ ceph mon set election_strategy {classic|disallow|connectivity} | ||
|
||
Choosing a mode | ||
=============== | ||
The modes other than classic provide different features. We recommend | ||
you stay in classic mode if you don't need the extra features as it is | ||
the simplest mode. | ||
|
||
The disallow Mode | ||
================= | ||
This mode lets you mark monitors as disallowd, in which case they will | ||
participate in the quorum and serve clients, but cannot be elected leader. You | ||
may wish to use this if you have some monitors which are known to be far away | ||
from clients. | ||
You can disallow a leader by running :: | ||
|
||
$ ceph mon add disallowed_leader {name} | ||
|
||
You can remove a monitor from the disallowed list, and allow it to become | ||
a leader again, by running :: | ||
|
||
$ ceph mon rm disallowed_leader {name} | ||
|
||
The list of disallowed_leaders is included when you run :: | ||
|
||
$ ceph mon dump | ||
|
||
The connectivity Mode | ||
===================== | ||
This mode evaluates connection scores provided by each monitor for its | ||
peers and elects the monitor with the highest score. This mode is designed | ||
to handle netsplits, which may happen if your cluster is stretched across | ||
multiple data centers or otherwise susceptible. | ||
|
||
This mode also supports disallowing monitors from being the leader | ||
using the same commands as above in disallow. | ||
|
||
Examining connectivity scores | ||
============================= | ||
The monitors maintain connection scores even if they aren't in | ||
the connectivity election mode. You can examine the scores a monitor | ||
has by running :: | ||
|
||
ceph daemon mon.{name} connection scores dump | ||
|
||
Scores for individual connections range from 0-1 inclusive, and also | ||
include whether the connection is considered alive or dead (determined by | ||
whether it returned its latest ping within the timeout). | ||
|
||
While this would be an unexpected occurrence, if for some reason you experience | ||
problems and troubleshooting makes you think your scores have become invalid, | ||
you can forget history and reset them by running :: | ||
|
||
ceph daemon mon.{name} connection scores reset | ||
|
||
While resetting scores has low risk (monitors will still quickly determine | ||
if a connection is alive or dead, and trend back to the previous scores if they | ||
were accurate!), it should also not be needed and is not recommended unless | ||
requested by your support team or a developer. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,6 +41,8 @@ CRUSH algorithm. | |
upmap | ||
crush-map | ||
crush-map-edits | ||
stretch-mode | ||
change-mon-elections | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
.. _stretch_mode: | ||
|
||
================ | ||
Stretch Clusters | ||
================ | ||
|
||
|
||
Stretch Clusters | ||
================ | ||
Ceph generally expects all parts of its network and overall cluster to be | ||
equally reliable, with failures randomly distributed across the CRUSH map. | ||
So you may lose a switch that knocks out a big segment of OSDs, but we expect | ||
the remaining OSDs and monitors to route around that. | ||
|
||
This is usually a good choice, but may not work well in some | ||
stretched cluster configurations where a significant part of your cluster | ||
is stuck behind a single network component. For instance, a single | ||
cluster which is located in multiple data centers, and you want to | ||
sustain the loss of a full DC. | ||
|
||
There are two standard configurations we've seen deployed, with either | ||
two or three data centers (or, in clouds, availability zones). With two | ||
zones, we expect each site to hold a copy of the data, and for a third | ||
site to have a tiebreaker monitor (this can be a VM or high-latency compared | ||
to the main sites) to pick a winner if the network connection fails and both | ||
DCs remain alive. For three sites, we expect a a copy of the data and an equal | ||
number of monitors in each site. | ||
|
||
Note, the standard Ceph configuration will survive MANY failures of | ||
the network or Data Centers, if you have configured it correctly, and it will | ||
never compromise data consistency -- if you bring back enough of the Ceph servers | ||
following a failure, it will recover. If you lose | ||
a data center and can still form a quorum of monitors and have all the data | ||
available (with enough copies to satisfy min_size, or CRUSH rules that will | ||
re-replicate to meet it), Ceph will maintain availability. | ||
|
||
What can't it handle? | ||
|
||
Stretch Cluster Issues | ||
====================== | ||
No matter what happens, Ceph will not compromise on data integrity | ||
and consistency. If there's a failure in your network or a loss of nodes and | ||
you can restore service, Ceph will return to normal functionality on its own. | ||
|
||
But there are scenarios where you lose data availibility despite having | ||
enough servers available to satisfy Ceph's consistency and sizing constraints, or | ||
where you may be surprised to not satisfy Ceph's constraints. | ||
The first important category of these failures resolve around inconsistent | ||
networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick | ||
them out of the acting PG sets despite the primary being unable to replicate data. | ||
If this happens, IO will not be permitted, because Ceph can't satisfy its durability | ||
guarantees. | ||
|
||
The second important category of failures is when you think you have data replicated | ||
across data centers, but the constraints aren't sufficient to guarantee this. | ||
For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies | ||
and places a copy in each data center with a min_size of 2. The PG may go active with | ||
2 copies in site A and no copies in site B, which means that if you then lose site A you | ||
have lost data and Ceph can't operate on it. This situation is surprisingly difficult | ||
to avoid with standard CRUSH rules. | ||
|
||
Stretch Mode | ||
============ | ||
The new stretch mode is designed to handle the 2-site case. (3 sites are | ||
just as susceptible to netsplit issues, but much more resilient to surprising | ||
data availability ones than 2-site clusters are.) | ||
|
||
To enter stretch mode, you must set the location of each monitor, matching | ||
your CRUSH map. For instance, to place mon.a in your first data center :: | ||
|
||
$ ceph mon set_location a datacenter=site1 | ||
|
||
Next, generate a CRUSH rule which will place 2 copies in each data center. This | ||
will require editing the crush map directly:: | ||
|
||
$ ceph osd getcrushmap > crush.map.bin | ||
$ crushtool -d crush.map.bin -o crush.map.txt | ||
|
||
Then edit the crush.map.txt file to add a new rule. Here | ||
there is only one other rule, so this is id 1, but you may need | ||
to use a different rule id. We also have two data center buckets | ||
named site1 and site2:: | ||
|
||
rule stretch_rule { | ||
id 1 | ||
type replicated | ||
min_size 1 | ||
max_size 10 | ||
step take site1 | ||
step chooseleaf firstn 2 type host | ||
step emit | ||
step take site2 | ||
step chooseleaf firstn 2 type host | ||
step emit | ||
} | ||
|
||
Finally, inject the crushmap to make the rule available to the cluster:: | ||
$ crushtool -c crush.map.txt -o crush2.map.bin | ||
$ ceph osd setcrushmap -i crush2.map.bin | ||
|
||
If you aren't already running your monitors in connectivity mode, do so with | ||
the instructions in `Changing Monitor Elections`_. | ||
|
||
.. _Changing Monitor elections: ../change-mon-elections | ||
|
||
|
||
And last, tell the cluster to enter stretch mode. Here, mon.e is the | ||
tiebreaker and we are splitting across datacenters :: | ||
|
||
$ ceph mon enable_stretch_mode e stretch_rule datacenter | ||
|
||
When stretch mode is enabled, the OSDs wlll only take PGs active when | ||
they peer across datacenters (or whatever other CRUSH bucket type | ||
you specified), assuming both are alive. Pools will increase in size | ||
from the default 3 to 4, expecting 2 copies in each site. OSDs will only | ||
be allowed to connect to monitors in the same data center. | ||
|
||
If all the OSDs and monitors from a data center become inaccessible | ||
at once, the surviving data center will enter a degraded stretch mode, | ||
reducing pool size to 2 and min_size to 1, issuing a warning, and | ||
going active by itself. | ||
|
||
When the missing data center comes back, the cluster will enter | ||
recovery stretch mode. It increases the pool size back to 4 and min_size to 2, | ||
but still only requires OSDs from the data center which was up the whole time. | ||
It continues issuing a warning. This mode then waits until all PGs are in | ||
a known state, and are neither degraded nor incomplete. At that point, | ||
it transitions back to regular stretch mode and the warning ends. | ||
|
||
|
||
Stretch Mode Limitations | ||
======================== | ||
As implied by the setup, stretch mode only handles 2 sites with OSDs. | ||
|
||
While it is not enforced, you should run 2 monitors in each site plus | ||
a tiebreaker, for a total of 5. This is because OSDs can only connect | ||
to monitors in their own site when in stretch mode. | ||
|
||
You cannot use erasure coded pools with stretch mode. If you try, it will | ||
refuse, and it will not allow you to create EC pools once in stretch mode. | ||
|
||
You must create your own CRUSH rule which provides 2 copies in each site, and | ||
you must use 4 total copies with 2 in each site. If you have existing pools | ||
with non-default size/min_size, Ceph will object when you attempt to | ||
enable_stretch_mode. | ||
|
||
Because it runs with min_size 1 when degraded, you should only use stretch mode | ||
with all-flash OSDs. | ||
|
||
Hopefully, future development will extend this feature to support EC pools and | ||
running with more than 2 full sites. | ||
|
||
Other commands | ||
============== | ||
When in stretch degraded mode, the cluster will go into "recovery" mode automatically | ||
when the disconnected data center comes back. If that doesn't work, or you want to | ||
enable recovery mode early, you can invoke :: | ||
|
||
$ ceph osd force_recovery_stretch_mode --yes-i-realy-mean-it | ||
|
||
But this command should not be necessary; it is included to deal with | ||
unanticipated situations. | ||
|
||
When in recovery mode, the cluster should go back into normal stretch mode | ||
when the PGs are healthy. If this doesn't happen, or you want to force the | ||
cross-data-center peering early and are willing to risk data downtime (or have | ||
verified separately that all the PGs can peer, even if they aren't fully | ||
recovered), you can invoke :: | ||
$ ceph osd force_healthy_stretch_mode --yes-i-really-mean-it | ||
|
||
This command should not be necessary; it is included to deal with | ||
unanticipated situations. But you might wish to invoke it to remove | ||
the HEALTH_WARN state which recovery mode generates. |
Oops, something went wrong.