Skip to content

Commit

Permalink
Test for bug in exactly-once during tablet bootstrap
Browse files Browse the repository at this point in the history
Here's a regression test for the bug which is causing
raft_consensus-itest to occasionally think it has inserted 23 rows when
in fact it has only inserted 20.

The issue is in the rewriting of logs during bootstrap. If we do a write
which gets a duplicate key error, the first time the COMMIT message is
written, it includes the error.

When the server restarts, it writes the COMMIT message again with only
'flushed: true' in the commit message. This is enough for bootstrap to
know not to bother to replay it on subsequent restarts, but it has lost
the error messages themselves.

If the server restarts again, at this point it doesn't rebuild a proper
response, but instead puts an errorless response into the ResultTracker.

So, if an operation hits an error, and then the tablet server restarts
twice while the client is still retrying, the client will falsely think
that its operation has succeeded.

This includes a disabled regression test which shows the bug.

Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a
Reviewed-on: http://gerrit.cloudera.org:8080/5417
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <[email protected]>
  • Loading branch information
toddlipcon committed May 4, 2017
1 parent a3c5983 commit c36e6c4
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 2 deletions.
6 changes: 4 additions & 2 deletions src/kudu/tserver/tablet_server-test-base.h
Original file line number Diff line number Diff line change
Expand Up @@ -387,8 +387,10 @@ class TabletServerTestBase : public KuduTest {
rb_row.Reset(&block, i);
VLOG(1) << "Verified row " << schema.DebugRow(rb_row);
ASSERT_LT(count, expected.size()) << "Got more rows than expected!";
ASSERT_EQ(expected[count].first, *schema.ExtractColumnFromRow<INT32>(rb_row, 0));
ASSERT_EQ(expected[count].second, *schema.ExtractColumnFromRow<INT32>(rb_row, 1));
EXPECT_EQ(expected[count].first, *schema.ExtractColumnFromRow<INT32>(rb_row, 0))
<< "Key mismatch at row: " << count;
EXPECT_EQ(expected[count].second, *schema.ExtractColumnFromRow<INT32>(rb_row, 1))
<< "Value mismatch at row: " << count;
count++;
}
}
Expand Down
60 changes: 60 additions & 0 deletions src/kudu/tserver/tablet_server-test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -951,6 +951,66 @@ TEST_F(TabletServerTest, TestKUDU_1341) {
ANFF(VerifyRows(schema_, { KeyValue(1, 12345) }));
}

TEST_F(TabletServerTest, DISABLED_TestExactlyOnceForErrorsAcrossRestart) {
WriteRequestPB req;
WriteResponsePB resp;
RpcController rpc;

// Set up a request to insert two rows.
req.set_tablet_id(kTabletId);
AddTestRowToPB(RowOperationsPB::INSERT, schema_, 1234, 5678, "hello world via RPC",
req.mutable_row_operations());
AddTestRowToPB(RowOperationsPB::INSERT, schema_, 12345, 5679, "hello world via RPC2",
req.mutable_row_operations());
ASSERT_OK(SchemaToPB(schema_, req.mutable_schema()));

// Insert it, assuming no errors.
{
SCOPED_TRACE(req.DebugString());
ASSERT_OK(proxy_->Write(req, &resp, &rpc));
SCOPED_TRACE(resp.DebugString());
ASSERT_FALSE(resp.has_error());
ASSERT_EQ(0, resp.per_row_errors_size());
}

// Set up a RequestID to use in the later requests.
rpc::RequestIdPB req_id;
req_id.set_client_id("client-id");
req_id.set_seq_no(1);
req_id.set_first_incomplete_seq_no(1);
req_id.set_attempt_no(1);

// Insert the row again, with the request ID specified. We should expect an
// "ALREADY_PRESENT" error.
{
rpc.Reset();
rpc.SetRequestIdPB(unique_ptr<rpc::RequestIdPB>(new rpc::RequestIdPB(req_id)));
ASSERT_OK(proxy_->Write(req, &resp, &rpc));
SCOPED_TRACE(resp.DebugString());
ASSERT_FALSE(resp.has_error());
ASSERT_EQ(2, resp.per_row_errors_size());
}

// Restart the tablet server several times, and after each restart, send a new attempt of the
// same request. We make the request itself invalid by clearing the schema and ops, but
// that shouldn't matter since it's just hitting the ResultTracker and returning the
// cached response. If the ResultTracker didn't have a cached response, then we'd get an
// error about an invalid request.
req.clear_schema();
req.clear_row_operations();
for (int i = 1; i <= 5; i++) {
SCOPED_TRACE(Substitute("restart attempt #$0", i));
NO_FATALS(ShutdownAndRebuildTablet());
rpc.Reset();
req_id.set_attempt_no(req_id.attempt_no() + 1);
rpc.SetRequestIdPB(unique_ptr<rpc::RequestIdPB>(new rpc::RequestIdPB(req_id)));
ASSERT_OK(proxy_->Write(req, &resp, &rpc));
SCOPED_TRACE(resp.DebugString());
ASSERT_FALSE(resp.has_error());
ASSERT_EQ(2, resp.per_row_errors_size());
}
}

// Regression test for KUDU-177. Ensures that after a major delta compaction,
// rows that were in the old DRS's DMS are properly replayed.
TEST_F(TabletServerTest, TestKUDU_177_RecoveryOfDMSEditsAfterMajorDeltaCompaction) {
Expand Down

0 comments on commit c36e6c4

Please sign in to comment.