[CURATOR-171] LeaderLatch and LeaderSelector should support Revokable behavior #692

jira-importer · 2014-12-09T21:30:05Z

Many Curator lock recipes support revoking. The Leader recipes should support this also as they use locking internally. See http://curator.apache.org/curator-recipes/shared-reentrant-lock.html - Revoking

Originally reported by vines, imported from: LeaderLatch and LeaderSelector should support Revokable behavior

status: Open
priority: Minor
resolution: Unresolved
imported: 2025-01-21

jira-importer · 2016-10-05T20:41:35Z

[email protected]:

Is this an issue actively worked on? I just noticed something similar where sometimes the leader latch doesnt realize that the ephemeral node associated with it has been deleted and continues as leader

I am using version 2.11.0

jira-importer · 2017-01-12T09:28:48Z

githubbot:

GitHub user oza opened a pull request:

#195

CURATOR-171 LeaderLatch isn't aware if it's own ephemeral node goes away

The root cause of the problem reported on CURATOR-171 is that LeaderLatch is not aware of losing self znode after acquire lock. This PR fixes that LeaderLatch is aware of my znode losing and marking itself "not leader".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/oza/curator CURATOR-171

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/curator/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195

commit 3d9aed6
Author: Tsuyoshi Ozawa
Date: 2017-01-12T09:24:19Z

CURATOR-171 - LeaderLatch isn't aware if it's own ephemeral node goes away

jira-importer · 2017-01-13T14:03:32Z

ozawa:

John Vines Parminder Grewal Thanks for reporting this issue. After talking with Jordan on #195, I noticed that the latch code handles connection loss / reconnection correctly: after detecting the changes, the leader mark itself as "not a leader". As reported, the test case described is not handled, but it cannot happen when using leader latch. Am I missing something? Please correct me if I'm wrong. Thanks

jira-importer · 2017-01-13T19:20:29Z

vines:

I would expect if the ephemeral latch node were removed, as in my test code above, that the leader would recognize it had it's lock pulled and would no longer report itself as a master.

jira-importer · 2017-01-13T21:27:05Z

randgalt:

The leader latch has never watched its own node. That's not in its contract. Further, there's overhead to doing this. None of the Curator lock/leader recipes watch their own nodes. There's no reason to. Please see Tech Note 7: https://cwiki.apache.org/confluence/display/CURATOR/TN7

jira-importer · 2017-01-14T20:34:27Z

dragonsinth:

For what it's worth (this isn't a Curator issue), but in Solr I've frequent run into problems as an administrator where a member's internal view of who the leader is got out of sync with the actual queue. As an admin, I've desperately wished each node would also watch its own registration. That way, if things got out of whack (no leader, too many leaders) I could just delete all the nodes and have everyone re-register and sort things out.

jira-importer · 2017-01-15T13:26:11Z

randgalt:

FYI - I've added persistent recursive watches to ZooKeeper (https://issues.apache.org/jira/browse/ZOOKEEPER-1416). Maybe at some point in the future we could have some Curator code that watches all appropriate nodes and notifies leaders, etc. But, IMO, this is a lot of complication for what is really and end-user error.

jira-importer · 2017-01-16T11:39:22Z

[email protected]:

Jordan Zimmerman While i understand that there is an overhead associated in creating a watch for the nodes, I am making the assumption that the follower nodes (i.e. nodes that aren't leaders) are already either periodically checking for the existence of the ephemeral node or have a watch on it. That is how , even when we delete the node, another follower assumes the role of the leader. (please correct me if I am wrong).

Given the nature of leader election (number of followers > number of leaders), i would assume that the load added by the leader watching its own ephemeral node as well would be minimal. This would make this algorithm less susceptible to issues like this (and ultimately more usable out of the box in production environments).

To address the concern of performance, we could make the behavior configurable. i.e. by default, leader does not watch its own ephemeral node, but users can set a flag to do so.

Also it would be good to highlight this in user docs? (maybe it already is somewhere)

jira-importer · 2017-01-16T13:11:57Z

randgalt:

Yes, non-leader nodes watch one other node. Adding another watcher per participant could inflate the number of watchers dramatically in a large system.

Anyway, this is a community project. If the community wants to add this we can vote on it. My vote is -1 of course.

Also it would be good to highlight this in user docs? (maybe it already is somewhere)

It's implied by Tech Note 7: https://cwiki.apache.org/confluence/display/CURATOR/TN7

jira-importer · 2017-01-16T20:23:22Z

dragonsinth:

I could totally get on board with the idea that everyone watches exactly one node, and the leader watches himself. It's one extra watch, not per node, but per election. Seems a small price to pay if it allows you to externally nuke the election and retry to fix a borked state.

jira-importer · 2017-01-17T18:51:55Z

vines:

So, my case is in the event of a network partition occuring. If 2 nodes (or N + 1) nodes are in a cluster, and one of them happens to be the leader, but they get partitioned from one each other but maintain access to ZK, then the leader will keep being the leader even though it can't communicate with the rest of the cluster. In this event I would have a process rip out the leader's lock to
1. Get a new leader elected that can talk to the cluster
and 2. Ideally this would be extensible such that I can tell the partitioned server to die since it's not part of the cluster

jira-importer · 2017-01-17T18:55:18Z

randgalt:

"the leader will keep being the leader even though it can't communicate with the rest of the cluster" - this would be a badly behaving client. If the client gets SUSPENDED or LOST it should exit the leader code and assume it is not the leader. If you are partitioned and are in the non-quorum portion of the ZK ensemble you must assume that the system is down. ZooKeeper is a CP system.

jira-importer · 2017-01-17T19:43:57Z

dragonsinth:

I think he's saying, the leader can talk to ZK (no partition there, so it's still the leader in ZK) but somehow the node is isolated from other (non-ZK) nodes. IE, there is an outside constraint whereby he wants to demote the current leader from the outside.

jira-importer · 2017-01-17T19:47:45Z

randgalt:

Curator supports revokable locks. (see http://curator.apache.org/curator-recipes/shared-reentrant-lock.html - "Revoking"). I'd rather we add something orderly like revoking to leader recipes than supporting manually deleting Curator-managed nodes.

jira-importer · 2017-01-17T20:59:31Z

vines:

I'm okay with this

jira-importer · 2017-01-17T23:36:37Z

githubbot:

GitHub user oza reopened a pull request:

#195

CURATOR-171 LeaderLatch isn't aware if it's own ephemeral node goes away

The root cause of the problem reported on CURATOR-171 is that LeaderLatch is not aware of losing self znode after acquire lock. This PR fixes that LeaderLatch is aware of my znode losing and marking itself "not leader".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/oza/curator CURATOR-171

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/curator/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195

commit 3d9aed6
Author: Tsuyoshi Ozawa
Date: 2017-01-12T09:24:19Z

CURATOR-171 - LeaderLatch isn't aware if it's own ephemeral node goes away

jira-importer · 2017-01-17T23:37:54Z

githubbot:

Github user oza closed the pull request at:

#195

jira-importer · 2018-10-22T18:25:33Z

githubbot:

Github user jacky1193610322 commented on the issue:

#195

when I want to change leader manually, I can delete the leader node manually, let the others become the leader.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CURATOR-171] LeaderLatch and LeaderSelector should support Revokable behavior #692

[CURATOR-171] LeaderLatch and LeaderSelector should support Revokable behavior #692

jira-importer commented Dec 9, 2014

jira-importer commented Oct 5, 2016

jira-importer commented Jan 12, 2017

jira-importer commented Jan 13, 2017

jira-importer commented Jan 13, 2017

jira-importer commented Jan 13, 2017

jira-importer commented Jan 14, 2017

jira-importer commented Jan 15, 2017

jira-importer commented Jan 16, 2017

jira-importer commented Jan 16, 2017

jira-importer commented Jan 16, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Oct 22, 2018

[CURATOR-171] LeaderLatch and LeaderSelector should support Revokable behavior #692

[CURATOR-171] LeaderLatch and LeaderSelector should support Revokable behavior #692

Comments

jira-importer commented Dec 9, 2014

jira-importer commented Oct 5, 2016

jira-importer commented Jan 12, 2017

jira-importer commented Jan 13, 2017

jira-importer commented Jan 13, 2017

jira-importer commented Jan 13, 2017

jira-importer commented Jan 14, 2017

jira-importer commented Jan 15, 2017

jira-importer commented Jan 16, 2017

jira-importer commented Jan 16, 2017

jira-importer commented Jan 16, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Jan 17, 2017

jira-importer commented Oct 22, 2018