Skip to content

Commit

Permalink
Merge branch 'wip-crush-tunables'
Browse files Browse the repository at this point in the history
Reviewed-by: Greg Farnum <[email protected]>
  • Loading branch information
Sage Weil committed Aug 14, 2012
2 parents 5ab4939 + 3267127 commit efe913b
Show file tree
Hide file tree
Showing 10 changed files with 230 additions and 11 deletions.
104 changes: 104 additions & 0 deletions doc/ops/manage/crush.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,3 +93,107 @@ You can remove a device from the crush map with::

$ ceph osd crush remove osd.123

Tunables
========

There are several magic numbers that were used in the original CRUSH
implementation that have proven to be poor choices. To support
the transition away from them, newer versions of CRUSH (starting with
the v0.48 argonaut series) allow the values to be adjusted or tuned.

Clusters running recent Ceph releases support using the tunable values
in the CRUSH maps. However, older clients and daemons will not correctly interact
with clusters using the "tuned" CRUSH maps. To detect this situation,
there is now a feature bit ``CRUSH_TUNABLES`` (value 0x40000) to
reflect support for tunables.

If the OSDMap currently used by the ``ceph-mon`` or ``ceph-osd``
daemon has non-legacy values, it will require the ``CRUSH_TUNABLES``
feature bit from clients and daemons who connect to it. This means
that old clients will not be able to connect.

At some future point in time, newly created clusters will have
improved default values for the tunables. This is a matter of waiting
until the support has been present in the Linux kernel clients long
enough to make this a painless transition for most users.

Impact of legacy values
~~~~~~~~~~~~~~~~~~~~~~~

The legacy values result in several misbehaviors:

* For hiearchies with a small number of devices in the leaf buckets,
some PGs map to fewer than the desired number of replicas. This
commonly happens for hiearchies with "host" nodes with a small
number (1-3) of OSDs nested beneath each one.

* For large clusters, some small percentages of PGs map to less than
the desired number of OSDs. This is more prevalent when there are
several layers of the hierarchy (e.g., row, rack, host, osd).

* When some OSDs are marked out, the data tends to get redistributed
to nearby OSDs instead of across the entire hierarchy.

Which client versions support tunables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* argonaut series, v0.48.1 or later
* v0.49 or later
* Linux kernel version v3.5 or later (for the file system and RBD kernel clients)

A few important points
~~~~~~~~~~~~~~~~~~~~~~

* Adjusting these values will result in the shift of some PGs between
storage nodes. If the Ceph cluster is already storing a lot of
data, be prepared for some fraction of the data to move.
* The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
``CRUSH_TUNABLES`` feature of new connections as soon as they get
the updated map. However, already-connected clients are
effectively grandfathered in, and will misbehave if they do not
support the new feature.
* If the CRUSH tunables are set to non-legacy values and then later
changed back to the defult values, ``ceph-osd`` daemons will not be
required to support the feature. However, the OSD peering process
requires examining and understanding old maps. Therefore, you
should not run old (pre-v0.48) versions of the ``ceph-osd`` daemon
if the cluster has previosly used non-legacy CRUSH values, even if
the latest version of the map has been switched back to using the
legacy defaults.

Tuning CRUSH
~~~~~~~~~~~~

If you can ensure that all clients are running recent code, you can
adjust the tunables by extracting the CRUSH map, modifying the values,
and reinjecting it into the cluster.

* Extract the latest CRUSH map::

ceph osd getcrushmap -o /tmp/crush

* Adjust tunables. These values appear to offer the best behavior
for both large and small clusters we tested with. You will need to
additionally specify the ``--enable-unsafe-tunables`` argument to
``crushtool`` for this to work. Please use this option with
extreme care.::

crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new

* Reinject modified map::

ceph osd setcrushmap -i /tmp/crush.new

Legacy values
~~~~~~~~~~~~~

For reference, the legacy values for the CRUSH tunables can be set
with::

crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 -o /tmp/crush.legacy

Again, the special ``--enable-unsafe-tunables`` option is required.
Further, as noted above, be careful running old versions of the
``ceph-osd`` daemon after reverting to legacy values as the feature
bit is not perfectly enforced.

4 changes: 1 addition & 3 deletions src/ceph_osd.cc
Original file line number Diff line number Diff line change
Expand Up @@ -348,9 +348,7 @@ int main(int argc, const char **argv)
CEPH_FEATURE_PGID64;

client_messenger->set_default_policy(Messenger::Policy::stateless_server(supported, 0));
client_messenger->set_policy(entity_name_t::TYPE_CLIENT,
Messenger::Policy::stateless_server(supported, 0));
client_messenger->set_policy_throttler(entity_name_t::TYPE_CLIENT, &client_throttler);
client_messenger->set_policy_throttler(entity_name_t::TYPE_CLIENT, &client_throttler); // default, actually
client_messenger->set_policy(entity_name_t::TYPE_MON,
Messenger::Policy::lossy_client(supported,
CEPH_FEATURE_UID |
Expand Down
7 changes: 7 additions & 0 deletions src/crush/CrushWrapper.h
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,13 @@ class CrushWrapper {
crush->choose_total_tries = n;
}

bool has_nondefault_tunables() const {
return
(crush->choose_local_tries != 2 ||
crush->choose_local_fallback_tries != 5 ||
crush->choose_total_tries != 19);
}

// bucket types
int get_num_type_names() const {
return type_map.size();
Expand Down
32 changes: 32 additions & 0 deletions src/mon/OSDMonitor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,38 @@ void OSDMonitor::update_from_paxos()

share_map_with_random_osd();
update_logger();

// make sure our feature bits reflect the latest map
update_msgr_features();
}

void OSDMonitor::update_msgr_features()
{
set<int> types;
types.insert((int)entity_name_t::TYPE_OSD);
types.insert((int)entity_name_t::TYPE_CLIENT);
types.insert((int)entity_name_t::TYPE_MDS);
types.insert((int)entity_name_t::TYPE_MON);

if (osdmap.crush->has_nondefault_tunables()) {
for (set<int>::iterator q = types.begin(); q != types.end(); ++q) {
if (!(mon->messenger->get_policy(*q).features_required & CEPH_FEATURE_CRUSH_TUNABLES)) {
dout(0) << "crush map has non-default tunables, requiring CRUSH_TUNABLES feature" << dendl;
Messenger::Policy p = mon->messenger->get_policy(*q);
p.features_required |= CEPH_FEATURE_CRUSH_TUNABLES;
mon->messenger->set_policy(*q, p);
}
}
} else {
for (set<int>::iterator q = types.begin(); q != types.end(); ++q) {
if (mon->messenger->get_policy(*q).features_required & CEPH_FEATURE_CRUSH_TUNABLES) {
dout(0) << "crush map has default tunables, not requiring CRUSH_TUNABLES feature" << dendl;
Messenger::Policy p = mon->messenger->get_policy(*q);
p.features_required &= ~CEPH_FEATURE_CRUSH_TUNABLES;
mon->messenger->set_policy(*q, p);
}
}
}
}

bool OSDMonitor::thrash()
Expand Down
2 changes: 2 additions & 0 deletions src/mon/OSDMonitor.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ class OSDMonitor : public PaxosService {
void encode_pending(bufferlist &bl);
void on_active();

void update_msgr_features();

void share_map_with_random_osd();

void update_logger();
Expand Down
20 changes: 20 additions & 0 deletions src/msg/Messenger.h
Original file line number Diff line number Diff line change
Expand Up @@ -230,9 +230,28 @@ class Messenger {
* @param p The policy to apply.
*/
virtual void set_policy(int type, Policy p) = 0;
/**
* Set the Policy associated with a type of peer.
*
* This can be called either on initial setup, or after connections
* are already established. However, the policies for existing
* connections will not be affected; the new policy will only apply
* to future connections.
*
* @param t The peer type to get the default policy for.
* @return A const Policy reference.
*/
virtual Policy get_policy(int t) = 0;
/**
* Get the default Policy
*
* @return A const Policy reference.
*/
virtual Policy get_default_policy() = 0;
/**
* Set a Throttler which is applied to all Messages from the given
* type of peer.
*
* This is an init-time function and cannot be called after calling
* start() or bind().
*
Expand All @@ -244,6 +263,7 @@ class Messenger {
virtual void set_policy_throttler(int type, Throttle *t) = 0;
/**
* Set the default send priority
*
* This is an init-time function and must be called *before* calling
* start().
*
Expand Down
1 change: 1 addition & 0 deletions src/msg/SimpleMessenger.cc
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ SimpleMessenger::SimpleMessenger(CephContext *cct, entity_name_t name,
lock("SimpleMessenger::lock"), need_addr(true), did_bind(false),
global_seq(0),
cluster_protocol(0),
policy_lock("SimpleMessenger::policy_lock"),
dispatch_throttler(cct, string("msgr_dispatch_throttler-") + mname, cct->_conf->ms_dispatch_throttle_bytes),
reaper_started(false), reaper_stop(false),
timeout(0),
Expand Down
22 changes: 16 additions & 6 deletions src/msg/SimpleMessenger.h
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ class SimpleMessenger : public Messenger {
* @param p The Policy to apply.
*/
void set_default_policy(Policy p) {
assert(!started && !did_bind);
Mutex::Locker l(policy_lock);
default_policy = p;
}
/**
Expand All @@ -148,7 +148,7 @@ class SimpleMessenger : public Messenger {
* @param p The policy to apply.
*/
void set_policy(int type, Policy p) {
assert(!started && !did_bind);
Mutex::Locker l(policy_lock);
policy_map[type] = p;
}
/**
Expand All @@ -163,9 +163,11 @@ class SimpleMessenger : public Messenger {
* you destroy SimpleMessenger.
*/
void set_policy_throttler(int type, Throttle *t) {
assert (!started && !did_bind);
assert(policy_map.count(type));
policy_map[type].throttler = t;
Mutex::Locker l(policy_lock);
if (policy_map.count(type))
policy_map[type].throttler = t;
else
default_policy.throttler = t;
}
/**
* Bind the SimpleMessenger to a specific address. If bind_addr
Expand Down Expand Up @@ -502,6 +504,9 @@ class SimpleMessenger : public Messenger {

/// internal cluster protocol version, if any, for talking to entities of the same type.
int cluster_protocol;

/// lock protecting policy
Mutex policy_lock;
/// the default Policy we use for Pipes
Policy default_policy;
/// map specifying different Policies for specific peer types
Expand Down Expand Up @@ -587,12 +592,17 @@ class SimpleMessenger : public Messenger {
*
* @return A const Policy reference.
*/
const Policy& get_policy(int t) {
Policy get_policy(int t) {
Mutex::Locker l(policy_lock);
if (policy_map.count(t))
return policy_map[t];
else
return default_policy;
}
Policy get_default_policy() {
Mutex::Locker l(policy_lock);
return default_policy;
}

/**
* Release memory accounting back to the dispatch throttler.
Expand Down
45 changes: 45 additions & 0 deletions src/osd/OSD.cc
Original file line number Diff line number Diff line change
Expand Up @@ -838,6 +838,8 @@ int OSD::init()
osdmap = get_map(superblock.current_epoch);
service.publish_map(osdmap);

check_osdmap_features();

bind_epoch = osdmap->get_epoch();

// load up pgs (as they previously existed)
Expand Down Expand Up @@ -3623,6 +3625,8 @@ void OSD::handle_osd_map(MOSDMap *m)
clear_map_bl_cache_pins();
map_lock.put_write();

check_osdmap_features();

// yay!
if (is_active())
activate_map();
Expand All @@ -3644,6 +3648,47 @@ void OSD::handle_osd_map(MOSDMap *m)
m->put();
}

void OSD::check_osdmap_features()
{
// adjust required feature bits?

// we have to be a bit careful here, because we are accessing the
// Policy structures without taking any lock. in particular, only
// modify integer values that can safely be read by a racing CPU.
// since we are only accessing existing Policy structures a their
// current memory location, and setting or clearing bits in integer
// fields, and we are the only writer, this is not a problem.

Messenger::Policy p = client_messenger->get_default_policy();
if (osdmap->crush->has_nondefault_tunables()) {
if (!(p.features_required & CEPH_FEATURE_CRUSH_TUNABLES)) {
dout(0) << "crush map has non-default tunables, requiring CRUSH_TUNABLES feature for clients" << dendl;
p.features_required |= CEPH_FEATURE_CRUSH_TUNABLES;
client_messenger->set_default_policy(p);
}
if (!(cluster_messenger->get_policy(entity_name_t::TYPE_OSD).features_required &
CEPH_FEATURE_CRUSH_TUNABLES)) {
dout(0) << "crush map has non-default tunables, requiring CRUSH_TUNABLES feature for osds" << dendl;
Messenger::Policy p = cluster_messenger->get_policy(entity_name_t::TYPE_OSD);
p.features_required |= CEPH_FEATURE_CRUSH_TUNABLES;
cluster_messenger->set_policy(entity_name_t::TYPE_OSD, p);
}
} else {
if (p.features_required & CEPH_FEATURE_CRUSH_TUNABLES) {
dout(0) << "crush map has default tunables, not requiring CRUSH_TUNABLES feature for clients" << dendl;
p.features_required &= ~CEPH_FEATURE_CRUSH_TUNABLES;
client_messenger->set_default_policy(p);
}
if (cluster_messenger->get_policy(entity_name_t::TYPE_OSD).features_required &
CEPH_FEATURE_CRUSH_TUNABLES) {
dout(0) << "crush map has default tunables, not requiring CRUSH_TUNABLES feature for osds" << dendl;
Messenger::Policy p = cluster_messenger->get_policy(entity_name_t::TYPE_OSD);
p.features_required &= ~CEPH_FEATURE_CRUSH_TUNABLES;
cluster_messenger->set_policy(entity_name_t::TYPE_OSD, p);
}
}
}

void OSD::advance_pg(epoch_t osd_epoch, PG *pg, PG::RecoveryCtx *rctx)
{
assert(pg->is_locked());
Expand Down
4 changes: 2 additions & 2 deletions src/osd/OSD.h
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,6 @@ class OSDService {
osdmap = map;
}


int get_nodeid() const { return whoami; }

// -- scrub scheduling --
Expand Down Expand Up @@ -282,7 +281,6 @@ class OSDService {
SimpleLRU<epoch_t, bufferlist> map_bl_cache;
SimpleLRU<epoch_t, bufferlist> map_bl_inc_cache;


OSDMapRef get_map(epoch_t e);
OSDMapRef add_map(OSDMap *o) {
Mutex::Locker l(map_cache_lock);
Expand Down Expand Up @@ -355,6 +353,8 @@ class OSD : public Dispatcher {
void _dispatch(Message *m);
void dispatch_op(OpRequestRef op);

void check_osdmap_features();

public:
ClassHandler *class_handler;
int get_nodeid() { return whoami; }
Expand Down

0 comments on commit efe913b

Please sign in to comment.