Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CURATOR-171] LeaderLatch and LeaderSelector should support Revokable behavior #692

Open
jira-importer opened this issue Dec 9, 2014 · 18 comments

Comments

@jira-importer
Copy link
Collaborator

Many Curator lock recipes support revoking. The Leader recipes should support this also as they use locking internally. See http://curator.apache.org/curator-recipes/shared-reentrant-lock.html - Revoking


Originally reported by vines, imported from: LeaderLatch and LeaderSelector should support Revokable behavior
  • status: Open
  • priority: Minor
  • resolution: Unresolved
  • imported: 2025-01-21
@jira-importer
Copy link
Collaborator Author

[email protected]:

Is this an issue actively worked on? I just noticed something similar where sometimes the leader latch doesnt realize that the ephemeral node associated with it has been deleted and continues as leader

I am using version 2.11.0

@jira-importer
Copy link
Collaborator Author

githubbot:

GitHub user oza opened a pull request:

#195

CURATOR-171 LeaderLatch isn't aware if it's own ephemeral node goes away

The root cause of the problem reported on CURATOR-171 is that LeaderLatch is not aware of losing self znode after acquire lock. This PR fixes that LeaderLatch is aware of my znode losing and marking itself "not leader".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/oza/curator CURATOR-171

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/curator/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195


commit 3d9aed6
Author: Tsuyoshi Ozawa
Date: 2017-01-12T09:24:19Z

CURATOR-171 - LeaderLatch isn't aware if it's own ephemeral node goes away


@jira-importer
Copy link
Collaborator Author

ozawa:

John Vines Parminder Grewal Thanks for reporting this issue. After talking with Jordan on #195, I noticed that the latch code handles connection loss / reconnection correctly: after detecting the changes, the leader mark itself as "not a leader". As reported, the test case described is not handled, but it cannot happen when using leader latch. Am I missing something? Please correct me if I'm wrong. Thanks

@jira-importer
Copy link
Collaborator Author

vines:

I would expect if the ephemeral latch node were removed, as in my test code above, that the leader would recognize it had it's lock pulled and would no longer report itself as a master.

@jira-importer
Copy link
Collaborator Author

randgalt:

The leader latch has never watched its own node. That's not in its contract. Further, there's overhead to doing this. None of the Curator lock/leader recipes watch their own nodes. There's no reason to. Please see Tech Note 7: https://cwiki.apache.org/confluence/display/CURATOR/TN7

@jira-importer
Copy link
Collaborator Author

dragonsinth:

For what it's worth (this isn't a Curator issue), but in Solr I've frequent run into problems as an administrator where a member's internal view of who the leader is got out of sync with the actual queue. As an admin, I've desperately wished each node would also watch its own registration. That way, if things got out of whack (no leader, too many leaders) I could just delete all the nodes and have everyone re-register and sort things out.

@jira-importer
Copy link
Collaborator Author

randgalt:

FYI - I've added persistent recursive watches to ZooKeeper (https://issues.apache.org/jira/browse/ZOOKEEPER-1416). Maybe at some point in the future we could have some Curator code that watches all appropriate nodes and notifies leaders, etc. But, IMO, this is a lot of complication for what is really and end-user error.

@jira-importer
Copy link
Collaborator Author

[email protected]:

Jordan Zimmerman While i understand that there is an overhead associated in creating a watch for the nodes, I am making the assumption that the follower nodes (i.e. nodes that aren't leaders) are already either periodically checking for the existence of the ephemeral node or have a watch on it. That is how , even when we delete the node, another follower assumes the role of the leader. (please correct me if I am wrong).

Given the nature of leader election (number of followers > number of leaders), i would assume that the load added by the leader watching its own ephemeral node as well would be minimal. This would make this algorithm less susceptible to issues like this (and ultimately more usable out of the box in production environments).

To address the concern of performance, we could make the behavior configurable. i.e. by default, leader does not watch its own ephemeral node, but users can set a flag to do so.

Also it would be good to highlight this in user docs? (maybe it already is somewhere)

@jira-importer
Copy link
Collaborator Author

randgalt:

Yes, non-leader nodes watch one other node. Adding another watcher per participant could inflate the number of watchers dramatically in a large system.

Anyway, this is a community project. If the community wants to add this we can vote on it. My vote is -1 of course.

Also it would be good to highlight this in user docs? (maybe it already is somewhere)

It's implied by Tech Note 7: https://cwiki.apache.org/confluence/display/CURATOR/TN7

@jira-importer
Copy link
Collaborator Author

dragonsinth:

I could totally get on board with the idea that everyone watches exactly one node, and the leader watches himself. It's one extra watch, not per node, but per election. Seems a small price to pay if it allows you to externally nuke the election and retry to fix a borked state.

@jira-importer
Copy link
Collaborator Author

vines:

So, my case is in the event of a network partition occuring. If 2 nodes (or N + 1) nodes are in a cluster, and one of them happens to be the leader, but they get partitioned from one each other but maintain access to ZK, then the leader will keep being the leader even though it can't communicate with the rest of the cluster. In this event I would have a process rip out the leader's lock to
1. Get a new leader elected that can talk to the cluster
and 2. Ideally this would be extensible such that I can tell the partitioned server to die since it's not part of the cluster

@jira-importer
Copy link
Collaborator Author

randgalt:

"the leader will keep being the leader even though it can't communicate with the rest of the cluster" - this would be a badly behaving client. If the client gets SUSPENDED or LOST it should exit the leader code and assume it is not the leader. If you are partitioned and are in the non-quorum portion of the ZK ensemble you must assume that the system is down. ZooKeeper is a CP system.

@jira-importer
Copy link
Collaborator Author

dragonsinth:

I think he's saying, the leader can talk to ZK (no partition there, so it's still the leader in ZK) but somehow the node is isolated from other (non-ZK) nodes. IE, there is an outside constraint whereby he wants to demote the current leader from the outside.

@jira-importer
Copy link
Collaborator Author

randgalt:

Curator supports revokable locks. (see http://curator.apache.org/curator-recipes/shared-reentrant-lock.html - "Revoking"). I'd rather we add something orderly like revoking to leader recipes than supporting manually deleting Curator-managed nodes.

@jira-importer
Copy link
Collaborator Author

vines:

I'm okay with this

@jira-importer
Copy link
Collaborator Author

githubbot:

GitHub user oza reopened a pull request:

#195

CURATOR-171 LeaderLatch isn't aware if it's own ephemeral node goes away

The root cause of the problem reported on CURATOR-171 is that LeaderLatch is not aware of losing self znode after acquire lock. This PR fixes that LeaderLatch is aware of my znode losing and marking itself "not leader".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/oza/curator CURATOR-171

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/curator/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195


commit 3d9aed6
Author: Tsuyoshi Ozawa
Date: 2017-01-12T09:24:19Z

CURATOR-171 - LeaderLatch isn't aware if it's own ephemeral node goes away


@jira-importer
Copy link
Collaborator Author

githubbot:

Github user oza closed the pull request at:

#195

@jira-importer
Copy link
Collaborator Author

githubbot:

Github user jacky1193610322 commented on the issue:

#195

when I want to change leader manually, I can delete the leader node manually, let the others become the leader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant