Skip to content

Commit

Permalink
Fix SIGSEGV in ksck
Browse files Browse the repository at this point in the history
ksck will segfault when some tablet servers that host tablet
replicas are missing. This happens, for example, if the master
is still restarting and has not yet fully populated its list of
live tablet servers.

The root cause is that a vector of replicas is being sorted by
tserver uuid obtained from the master even if the master is not
aware of the tablet server was not found, causing a segmentation
fault when trying to access the uuid. The fix just checks for a
missing tserver reference and sorts such replicas first.

The included test segfaults without the fix.

Change-Id: I66ff69bc3aa567863b61ee8e686fc13c81db6fdf
Reviewed-on: http://gerrit.cloudera.org:8080/7261
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <[email protected]>
  • Loading branch information
dralves committed Jun 23, 2017
1 parent d623942 commit def06d8
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 2 deletions.
13 changes: 13 additions & 0 deletions src/kudu/tools/ksck-test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -503,5 +503,18 @@ TEST_F(KsckTest, TestTabletNotRunning) {
" Last status: \n");
}

// Test for a bug where we weren't properly handling a tserver not reported by the master.
TEST_F(KsckTest, TestMissingTserver) {
CreateOneSmallReplicatedTable();

// Delete a tablet server from the master's list. This simulates a situation
// where the master is starting and hasn't listed all tablet servers yet, but
// tablets from other tablet servers are listing the missing tablet server as a peer.
EraseKeyReturnValuePtr(&master_->tablet_servers_, "ts-id-0");
Status s = RunKsck();
ASSERT_EQ("Corruption: 1 table(s) are bad", s.ToString());
ASSERT_STR_CONTAINS(err_stream_.str(), "Table test has 3 under-replicated tablet(s)");
}

} // namespace tools
} // namespace kudu
6 changes: 4 additions & 2 deletions src/kudu/tools/ksck.cc
Original file line number Diff line number Diff line change
Expand Up @@ -661,8 +661,8 @@ Ksck::CheckResult Ksck::VerifyTablet(const shared_ptr<KsckTablet>& tablet, int t

// Check for agreement on tablet assignment and state between the master
// and the tablet server.
auto ts = FindPtrOrNull(cluster_->tablet_servers(), replica->ts_uuid());
repl_info->ts = ts.get();
auto ts = FindPointeeOrNull(cluster_->tablet_servers(), replica->ts_uuid());
repl_info->ts = ts;
if (ts && ts->is_healthy()) {
repl_info->state = ts->ReplicaState(tablet->id());
if (ContainsKey(ts->tablet_status_map(), tablet->id())) {
Expand Down Expand Up @@ -727,6 +727,8 @@ Ksck::CheckResult Ksck::VerifyTablet(const shared_ptr<KsckTablet>& tablet, int t
}
std::sort(replica_infos.begin(), replica_infos.end(),
[](const ReplicaInfo& left, const ReplicaInfo& right) -> bool {
if (!left.ts) return true;
if (!right.ts) return false;
return left.ts->uuid() < right.ts->uuid();
});

Expand Down

0 comments on commit def06d8

Please sign in to comment.