Reduce severity of "Error trying to read ahead of the log" log message

In clusters under load, it is typical to see the ERROR log dominated by messages like the following: E1221 09:09:13.869918 143384 consensus_queue.cc:440] T 1d030514317942ec9d7796a8a029dace P a4eea0affa4545879c8988b24d8cb2bb [LEADER]: Error trying to read ahead of the log while preparing peer request: Incomplete: Op with index 6620 is ahead of the local log (next sequential op: 6620). Destination peer: Peer: c0a2e34b708845efb8f090459815272c, Is new: false, Last received: 2995.6619, Next index: 6621, Last known committed idx: 6620, Last exchange result: ERROR, Needs tablet copy: false This message is logged at the error level, and looks a little bit scary, but it is in fact harmless. Here's what happens: 1. A leader neglects its duties and the followers elect a new leader. 2. The new leader manages to replicate more ops (usually just the NO_OP asserting leadership). 3. The leader of the previous term tries to replicate an op to a peer in the new term. 4. From the response, it founds out that a. It is in an earlier term, so it should step down and increment its term. b. The last op the peer saw is (leader's index + k) for some k > 0 (usually k = 1). So the leader will attempt to send up ops of index up through (leader's index + k). 5. The term change is asynchronous, and before it happens the leader tries to prepare a new request to the peer whose log is ahead of the local log. This causes the ERROR message. 6. The term change happens. The leader steps down, and life goes on. Note that the leader should also have received a VoteRequest with the new term and an UpdateConsensus with the new term from the leader. In general, this message appears only when the leader is under enough stress to lose its leadership and be unable to service some consensus RPCs in a timely manner. It is not in an of itself a problem, and it's a decent indicator of stress on the leader, so I am leaving the message but downgrading it to INFO level. See KUDU-1078 for more information about this situation, especially its previous causes (which were at one time actually harmful). Change-Id: Ib841a52354e5c71450f3e364ab2fdee51ce2adb4 Reviewed-on: http://gerrit.cloudera.org:8080/12185 Reviewed-by: Adar Dembo <[email protected]> Reviewed-by: Mike Percy <[email protected]> Tested-by: Will Berkeley <[email protected]>
mojiahe · Jan 9, 2019 · 6103f87 · 6103f87
1 parent 17b5efd
commit 6103f87
Showing 1 changed file with 6 additions and 5 deletions.
diff --git a/src/kudu/consensus/consensus_queue.cc b/src/kudu/consensus/consensus_queue.cc
@@ -696,11 +696,12 @@ Status PeerMessageQueue::RequestForPeer(const string& uuid,
       }
       if (s.IsIncomplete()) {
         // IsIncomplete() means that we tried to read beyond the head of the log
-        // (in the future). See KUDU-1078.
-        LOG_WITH_PREFIX_UNLOCKED(ERROR) << "Error trying to read ahead of the log "
-                                        << "while preparing peer request: "
-                                        << s.ToString() << ". Destination peer: "
-                                        << peer_copy.ToString();
+        // (in the future). This is usually a sign that this peer is under load
+        // and is about to step down as leader. See KUDU-1078.
+        LOG_WITH_PREFIX_UNLOCKED(INFO) << "Error trying to read ahead of the log "
+                                       << "while preparing peer request: "
+                                       << s.ToString() << ". Destination peer: "
+                                       << peer_copy.ToString();
         return s;
       }
       LOG_WITH_PREFIX_UNLOCKED(FATAL) << "Error reading the log while preparing peer request: "