Tags: anuvu/cruise-control
Tags
Upgrade io.vertx to 4.4.3 due to CVE-2023-24815 (linkedin#2023)
fix/cruisecontrol: add partition movement timeout to executor There is an edge case wherein after the partition reassignment was submitted to kafka and before it finished, there was a partition leadership re-lection- this causes the reassignment to stall until there is another re-election. However, we do see cases where there is no re-election triggered leading to a partition reaissgnment being in IN_PROGRESS indefinitely and potentially missing new anomalies due to executor state being in INTER_BROKER_REPLICA_ACTION By adding a max timeout, we avoid this state by cancelling such reassignemnts and retrying them later includes minor cleanup
fix/cruisecontrol: add partition movement timeout to executor There is an edge case wherein after the partition reassignment was submitted to kafka and before it finished, there was a partition leadership re-lection- this causes the reassignment to stall until there is another re-election. However, we do see cases where there is no re-election triggered leading to a partition reaissgnment being in IN_PROGRESS indefinitely and potentially missing new anomalies due to executor state being in INTER_BROKER_REPLICA_ACTION By adding a max timeout, we avoid this state by cancelling such reassignemnts and retrying them later includes minor cleanup
fix/cruisecontrol: add partition movement timeout to executor There is an edge case wherein after the partition reassignment was submitted to kafka and before it finished, there was a partition leadership re-lection- this causes the reassignment to stall until there is another re-election. However, we do see cases where there is no re-election triggered leading to a partition reaissgnment being in IN_PROGRESS indefinitely and potentially missing new anomalies due to executor state being in INTER_BROKER_REPLICA_ACTION By adding a max timeout, we avoid this state by cancelling such reassignemnts and retrying them later includes minor cleanup
feat: cleaup stuck partitionReassignments Sometimes, an active partition reassignment goes into a limbo state due to the destination and source brokers going offline at the same time. When this happens, there will be a partitionReassignment stuck in kafka until it is maually cleared- due to this, CC stops reacting to any anomalies/broker failures/etc. This commit is for detecting and fixing such stuck active partitionReassignments.
fix: allow multiple partition reassignments to be scheduled Kafka Admin API states that this should be safe- It also allows us to escape stuck states when partition movements fail due to dead brokers similar to linkedin#664 but with the additional case of revert not being possible due to original broker also going down
PreviousNext