Skip to content

Commit

Permalink
doc: document use of CRUSH tunables
Browse files Browse the repository at this point in the history
Signed-off-by: Sage Weil <[email protected]>
  • Loading branch information
Sage Weil committed Aug 14, 2012
1 parent b254ba7 commit 3267127
Showing 1 changed file with 104 additions and 0 deletions.
104 changes: 104 additions & 0 deletions doc/ops/manage/crush.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,3 +93,107 @@ You can remove a device from the crush map with::

$ ceph osd crush remove osd.123

Tunables
========

There are several magic numbers that were used in the original CRUSH
implementation that have proven to be poor choices. To support
the transition away from them, newer versions of CRUSH (starting with
the v0.48 argonaut series) allow the values to be adjusted or tuned.

Clusters running recent Ceph releases support using the tunable values
in the CRUSH maps. However, older clients and daemons will not correctly interact
with clusters using the "tuned" CRUSH maps. To detect this situation,
there is now a feature bit ``CRUSH_TUNABLES`` (value 0x40000) to
reflect support for tunables.

If the OSDMap currently used by the ``ceph-mon`` or ``ceph-osd``
daemon has non-legacy values, it will require the ``CRUSH_TUNABLES``
feature bit from clients and daemons who connect to it. This means
that old clients will not be able to connect.

At some future point in time, newly created clusters will have
improved default values for the tunables. This is a matter of waiting
until the support has been present in the Linux kernel clients long
enough to make this a painless transition for most users.

Impact of legacy values
~~~~~~~~~~~~~~~~~~~~~~~

The legacy values result in several misbehaviors:

* For hiearchies with a small number of devices in the leaf buckets,
some PGs map to fewer than the desired number of replicas. This
commonly happens for hiearchies with "host" nodes with a small
number (1-3) of OSDs nested beneath each one.

* For large clusters, some small percentages of PGs map to less than
the desired number of OSDs. This is more prevalent when there are
several layers of the hierarchy (e.g., row, rack, host, osd).

* When some OSDs are marked out, the data tends to get redistributed
to nearby OSDs instead of across the entire hierarchy.

Which client versions support tunables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* argonaut series, v0.48.1 or later
* v0.49 or later
* Linux kernel version v3.5 or later (for the file system and RBD kernel clients)

A few important points
~~~~~~~~~~~~~~~~~~~~~~

* Adjusting these values will result in the shift of some PGs between
storage nodes. If the Ceph cluster is already storing a lot of
data, be prepared for some fraction of the data to move.
* The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
``CRUSH_TUNABLES`` feature of new connections as soon as they get
the updated map. However, already-connected clients are
effectively grandfathered in, and will misbehave if they do not
support the new feature.
* If the CRUSH tunables are set to non-legacy values and then later
changed back to the defult values, ``ceph-osd`` daemons will not be
required to support the feature. However, the OSD peering process
requires examining and understanding old maps. Therefore, you
should not run old (pre-v0.48) versions of the ``ceph-osd`` daemon
if the cluster has previosly used non-legacy CRUSH values, even if
the latest version of the map has been switched back to using the
legacy defaults.

Tuning CRUSH
~~~~~~~~~~~~

If you can ensure that all clients are running recent code, you can
adjust the tunables by extracting the CRUSH map, modifying the values,
and reinjecting it into the cluster.

* Extract the latest CRUSH map::

ceph osd getcrushmap -o /tmp/crush

* Adjust tunables. These values appear to offer the best behavior
for both large and small clusters we tested with. You will need to
additionally specify the ``--enable-unsafe-tunables`` argument to
``crushtool`` for this to work. Please use this option with
extreme care.::

crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new

* Reinject modified map::

ceph osd setcrushmap -i /tmp/crush.new

Legacy values
~~~~~~~~~~~~~

For reference, the legacy values for the CRUSH tunables can be set
with::

crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 -o /tmp/crush.legacy

Again, the special ``--enable-unsafe-tunables`` option is required.
Further, as noted above, be careful running old versions of the
``ceph-osd`` daemon after reverting to legacy values as the feature
bit is not perfectly enforced.

0 comments on commit 3267127

Please sign in to comment.