Merge pull request ceph#35906 from gregsfortytwo/wip-stretch-mode

Add a new stretch mode for 2-site Ceph clusters Reviewed-by: Josh Durgin <[email protected]>
ypdai · Sep 18, 2020 · 8ba0a61 · 8ba0a61
2 parents 2aae719 + a8963cc
commit 8ba0a61
Show file tree

Hide file tree

Showing 96 changed files with 4,733 additions and 193 deletions.
diff --git a/doc/dev/mon-elections.rst b/doc/dev/mon-elections.rst
@@ -0,0 +1,130 @@
+=================
+Monitor Elections
+=================
+
+The Original Algorithm
+======================
+Historically, monitor leader elections have been very simple: the lowest-ranked
+monitor wins!
+
+This is accomplished using a low-state "Elector" module (though it has now
+been split into an Elector that handles message-passing, and an ElectionLogic
+that makes the voting choices). It tracks the election epoch and not much
+else. Odd epochs are elections; even epochs have a leader and let the monitor
+do its ongoing work. When a timeout occurs or the monitor asks for a
+new election, we bump the epoch and send out Propose messages to all known
+monitors.
+In general, if we receive an old message we either drop it or trigger a new
+election (if we think the sender is newly-booted and needs to join quorum). If
+we receive a message from a newer epoch, we bump up our epoch to match and
+either Defer to the Proposer or else bump the epoch again and Propose
+ourselves if we expect to win over them. When we receive a Propose within
+our current epoch, we either Defer to the sender or ignore them (we ignore them
+if they are of a higher rank than us, or higher than the rank we have already
+deferred to).
+(Note that if we have the highest rank it is possible for us to defer to every
+other monitor in sequence within the same election epoch!)
+
+This resolves under normal circumstances because all monitors agree on the
+priority voting order, and epochs are only bumped when a monitor isn't
+participating or sees a possible conflict with the known proposers.
+
+The Problems
+==============
+The original algorithm didn't work at all under a variety of netsplit
+conditions. This didn't manifest often in practice but has become
+important as the community and commercial vendors move Ceph into
+spaces requiring the use of "stretch clusters".
+
+The New Algorithms
+==================
+We still default to the original ("classic") election algorithm, but
+support letting users change to new ones via the CLI. These
+algorithms are implemented as different functions and switch statements
+within the ElectionLogic class.
+
+The first algorithm is very simple: "disallow" lets you add monitors
+to a list of disallowed leaders.
+The second, "connectivity", incorporates connection score ratings
+and elects the monitor with the best score.
+
+Algorithm: disallow
+===================
+If a monitor is in the disallowed list, it always defers to another
+monitor, no matter the rank. Otherwise, it is the same as the classic
+algorithm is.
+Since changing the disallowed list requires a paxos update, monitors
+in an election together should always have the same set. This means
+the election order is constant and static across the full monitor set
+and elections resolve trivially (assuming a connected network).
+
+This algorithm really just exists as a demo and stepping-stone to
+the more advanced connectivity mode, but it may have utility in asymmetric
+networks and clusters.
+
+Algorithm: connectivity
+=======================
+This algorithm takes as input scores for each connection
+(both ways, discussed in the next section) and attempts to elect the monitor
+with the highest total score. We keep the same basic message-passing flow as the
+classic algorithm, in which elections are driven by reacting to Propose messages.
+But this has several challenges since unlike ranks, scores are not static (and
+might change during an election!). To guarantee an election epoch does not
+produce multiple leaders, we must maintain two key invariants:
+* Monitors must maintain static scores during an election epoch
+* Any deferral must be transitive -- if A defers to B and then to C,
+B had better defer to C as well!
+
+We handle these very explicitly: by branching a copy stable_peer_tracker
+of our peer_tracker scoring object whenever starting an election (or
+bumping the epoch), and by refusing to defer to a monitor if it won't
+be deferred to by our current leader choice. (All Propose messages include
+a copy of the scores the leader is working from, so peers can evaluate them.)
+
+Of course, those modifications can easily block. To guarantee forward progress,
+we make several further adjustments:
+* If we want to defer to a new peer, but have already deferred to a peer
+whose scores don't allow that, we bump the election epoch and start()
+the election over again.
+* All election messages include the scores the sender is aware of.
+
+This guarantees we will resolve the election as long as the network is
+reasonably stable (even if disconnected): As long as all score "views"
+result in the same deferral order, an election will complete normally. And by
+broadly sharing scores across the full set of monitors, monitors rapidly
+converge on the global newest state.
+
+This algorithm has one further important feature compared to the classic and
+disallowed handlers: it can ignore out-of-quorum peers. Normally, whenever
+a monitor B receives a Propose from an out-of-quorum peer C, B will itself trigger
+a new election to give C an opportunity to join. But because the
+highest-scoring monitor A may be netsplit from C, this is not desirable. So in
+the connectivity election algorithm, B only "forwards" Propose messages when B's
+scores indicate the cluster would choose a leader other than A.
+
+Connection Scoring
+==================
+We implement scoring within the ConnectionTracker class, which is
+driven by the Elector and provided to ElectionLogic as a resource. Elector
+is responsible for sending out MMonPing messages, and for reporting the
+results in to the ConnectionTracker as calls to report_[live|dead]_connection
+with the relevant peer and the time units the call counts for. (These time units
+are seconds in the monitor, but the ConnectionTracker is agnostic and our unit
+tests count simple time steps.)
+
+We configure a "half life" and each report updates the peer's current status
+(alive or dead) and its total score. The new score is current_score * (1 - units_alive / (2 * half_life)) + (units_alive / (2 * half_life)). (For a dead report, we of course
+subtract the new delta, rather than adding it).
+
+We can further encode and decode the ConnectionTracker for wire transmission,
+and receive_peer_report()s of a full ConnectionTracker (containing all
+known scores) or a ConnectionReport (representing a single peer's scores)
+to slurp up the scores from peers. These scores are of course all versioned so
+we are in no danger of accidentally going backwards in time.
+We can query an individual connection score (if the connection is down, it's 0)
+or the total score of a specific monitor, which is the connection score from all
+other monitors going in to that one.
+
+By default, we consider pings failed after 2 seconds (mon_elector_ping_timeout)
+and ping live connections every second (mon_elector_ping_divisor). The halflife
+is 12 hours (mon_con_tracker_score_halflife).
diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst
@@ -0,0 +1,78 @@
+.. _changing_monitor_elections:
+
+=====================================
+Configure Monitor Election Strategies
+=====================================
+
+By default, the monitors will use the classic option it has always used. We
+recommend you stay in this mode unless you require features in the other
+modes.
+
+If you want to switch modes BEFORE constructing the cluster, change
+the ``mon election default strategy`` option. This option is an integer value:
+
+* 1 for "classic"
+* 2 for "disallow"
+* 3 for "connectivity"
+
+Once your cluster is running, you can change strategies by running ::
+
+  $ ceph mon set election_strategy {classic|disallow|connectivity}
+
+Choosing a mode
+===============
+The modes other than classic provide different features. We recommend
+you stay in classic mode if you don't need the extra features as it is
+the simplest mode.
+
+The disallow Mode
+=================
+This mode lets you mark monitors as disallowd, in which case they will
+participate in the quorum and serve clients, but cannot be elected leader. You
+may wish to use this if you have some monitors which are known to be far away
+from clients.
+You can disallow a leader by running ::
+
+  $ ceph mon add disallowed_leader {name}
+
+You can remove a monitor from the disallowed list, and allow it to become
+a leader again, by running ::
+
+  $ ceph mon rm disallowed_leader {name}
+
+The list of disallowed_leaders is included when you run ::
+
+  $ ceph mon dump
+
+The connectivity Mode
+=====================
+This mode evaluates connection scores provided by each monitor for its
+peers and elects the monitor with the highest score. This mode is designed
+to handle netsplits, which may happen if your cluster is stretched across
+multiple data centers or otherwise susceptible.
+
+This mode also supports disallowing monitors from being the leader
+using the same commands as above in disallow.
+
+Examining connectivity scores
+=============================
+The monitors maintain connection scores even if they aren't in
+the connectivity election mode. You can examine the scores a monitor
+has by running ::
+
+  ceph daemon mon.{name} connection scores dump
+
+Scores for individual connections range from 0-1 inclusive, and also
+include whether the connection is considered alive or dead (determined by
+whether it returned its latest ping within the timeout).
+
+While this would be an unexpected occurrence, if for some reason you experience
+problems and troubleshooting makes you think your scores have become invalid,
+you can forget history and reset them by running ::
+
+  ceph daemon mon.{name} connection scores reset
+
+While resetting scores has low risk (monitors will still quickly determine
+if a connection is alive or dead, and trend back to the previous scores if they
+were accurate!), it should also not be needed and is not recommended unless
+requested by your support team or a developer.
diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst
@@ -41,6 +41,8 @@ CRUSH algorithm.
 	upmap
 	crush-map
 	crush-map-edits
+	stretch-mode
+	change-mon-elections
 
 
 

diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst
@@ -0,0 +1,175 @@
+.. _stretch_mode:
+
+================
+Stretch Clusters
+================
+
+
+Stretch Clusters
+================
+Ceph generally expects all parts of its network and overall cluster to be
+equally reliable, with failures randomly distributed across the CRUSH map.
+So you may lose a switch that knocks out a big segment of OSDs, but we expect
+the remaining OSDs and monitors to route around that.
+
+This is usually a good choice, but may not work well in some
+stretched cluster configurations where a significant part of your cluster
+is stuck behind a single network component. For instance, a single
+cluster which is located in multiple data centers, and you want to
+sustain the loss of a full DC.
+
+There are two standard configurations we've seen deployed, with either
+two or three data centers (or, in clouds, availability zones). With two
+zones, we expect each site to hold a copy of the data, and for a third
+site to have a tiebreaker monitor (this can be a VM or high-latency compared
+to the main sites) to pick a winner if the network connection fails and both
+DCs remain alive. For three sites, we expect a a copy of the data and an equal
+number of monitors in each site.
+
+Note, the standard Ceph configuration will survive MANY failures of
+the network or Data Centers, if you have configured it correctly, and it will
+never compromise data consistency -- if you bring back enough of the Ceph servers
+following a failure, it will recover. If you lose
+a data center and can still form a quorum of monitors and have all the data
+available (with enough copies to satisfy min_size, or CRUSH rules that will
+re-replicate to meet it), Ceph will maintain availability.
+
+What can't it handle?
+
+Stretch Cluster Issues
+======================
+No matter what happens, Ceph will not compromise on data integrity
+and consistency. If there's a failure in your network or a loss of nodes and
+you can restore service, Ceph will return to normal functionality on its own.
+
+But there are scenarios where you lose data availibility despite having
+enough servers available to satisfy Ceph's consistency and sizing constraints, or
+where you may be surprised to not satisfy Ceph's constraints.
+The first important category of these failures resolve around inconsistent
+networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
+them out of the acting PG sets despite the primary being unable to replicate data.
+If this happens, IO will not be permitted, because Ceph can't satisfy its durability
+guarantees.
+
+The second important category of failures is when you think you have data replicated
+across data centers, but the constraints aren't sufficient to guarantee this.
+For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
+and places a copy in each data center with a min_size of 2. The PG may go active with
+2 copies in site A and no copies in site B, which means that if you then lose site A you
+have lost data and Ceph can't operate on it. This situation is surprisingly difficult
+to avoid with standard CRUSH rules.
+
+Stretch Mode
+============
+The new stretch mode is designed to handle the 2-site case. (3 sites are
+just as susceptible to netsplit issues, but much more resilient to surprising
+data availability ones than 2-site clusters are.)
+
+To enter stretch mode, you must set the location of each monitor, matching
+your CRUSH map. For instance, to place mon.a in your first data center ::
+
+  $ ceph mon set_location a datacenter=site1
+
+Next, generate a CRUSH rule which will place 2 copies in each data center. This
+will require editing the crush map directly::
+
+  $ ceph osd getcrushmap > crush.map.bin
+  $ crushtool -d crush.map.bin -o crush.map.txt
+
+Then edit the crush.map.txt file to add a new rule. Here
+there is only one other rule, so this is id 1, but you may need
+to use a different rule id. We also have two data center buckets
+named site1 and site2::
+
+  rule stretch_rule {
+          id 1
+          type replicated
+          min_size 1
+          max_size 10
+          step take site1
+          step chooseleaf firstn 2 type host
+          step emit
+          step take site2
+          step chooseleaf firstn 2 type host
+          step emit
+  }
+
+Finally, inject the crushmap to make the rule available to the cluster::
+  
+  $ crushtool -c crush.map.txt -o crush2.map.bin
+  $ ceph osd setcrushmap -i crush2.map.bin
+
+If you aren't already running your monitors in connectivity mode, do so with
+the instructions in `Changing Monitor Elections`_.
+
+.. _Changing Monitor elections: ../change-mon-elections
+
+
+And last, tell the cluster to enter stretch mode. Here, mon.e is the
+tiebreaker and we are splitting across datacenters ::
+
+  $ ceph mon enable_stretch_mode e stretch_rule datacenter
+
+When stretch mode is enabled, the OSDs wlll only take PGs active when
+they peer across datacenters (or whatever other CRUSH bucket type
+you specified), assuming both are alive. Pools will increase in size
+from the default 3 to 4, expecting 2 copies in each site. OSDs will only
+be allowed to connect to monitors in the same data center.
+
+If all the OSDs and monitors from a data center become inaccessible
+at once, the surviving data center will enter a degraded stretch mode,
+reducing pool size to 2 and min_size to 1, issuing a warning, and
+going active by itself.
+
+When the missing data center comes back, the cluster will enter
+recovery stretch mode. It increases the pool size back to 4 and min_size to 2,
+but still only requires OSDs from the data center which was up the whole time.
+It continues issuing a warning. This mode then waits until all PGs are in
+a known state, and are neither degraded nor incomplete. At that point,
+it transitions back to regular stretch mode and the warning ends.
+
+
+Stretch Mode Limitations
+========================
+As implied by the setup, stretch mode only handles 2 sites with OSDs.
+
+While it is not enforced, you should run 2 monitors in each site plus
+a tiebreaker, for a total of 5. This is because OSDs can only connect
+to monitors in their own site when in stretch mode.
+
+You cannot use erasure coded pools with stretch mode. If you try, it will
+refuse, and it will not allow you to create EC pools once in stretch mode.
+
+You must create your own CRUSH rule which provides 2 copies in each site, and
+you must use 4 total copies with 2 in each site. If you have existing pools
+with non-default size/min_size, Ceph will object when you attempt to
+enable_stretch_mode.
+
+Because it runs with min_size 1 when degraded, you should only use stretch mode
+with all-flash OSDs.
+
+Hopefully, future development will extend this feature to support EC pools and
+running with more than 2 full sites.
+
+Other commands
+==============
+When in stretch degraded mode, the cluster will go into "recovery" mode automatically
+when the disconnected data center comes back. If that doesn't work, or you want to
+enable recovery mode early, you can invoke ::
+
+  $ ceph osd force_recovery_stretch_mode --yes-i-realy-mean-it
+
+But this command should not be necessary; it is included to deal with
+unanticipated situations.
+
+When in recovery mode, the cluster should go back into normal stretch mode
+when the PGs are healthy. If this doesn't happen, or you want to force the
+cross-data-center peering early and are willing to risk data downtime (or have
+verified separately that all the PGs can peer, even if they aren't fully
+recovered), you can invoke ::
+  
+  $ ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
+
+This command should not be necessary; it is included to deal with
+unanticipated situations. But you might wish to invoke it to remove
+the HEALTH_WARN state which recovery mode generates.