Don't close incoming connection on failed dial back. #255
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem is that after restarting a node it much longer than expected take time to connect.
node1 is always running, node2 is one which is reconnecting
After node2 is disconnecting,
connLoop
constantly retries this connection up until it either will be READY again or after 30*3s will pass.What happens if node2 is getting started again - node2 is dialing node1, grpc dial is successful, credentials check as well, so we store this connection to peers and start
connLoop
goroutine. Meanwhile node1 is trying to create its client back to node2 in async way, but it fails, since it ALREADY HAS connection to it inpeers
map and its broken (its in transient failure state and did not yet recover). So node1 is closing new connection from node2. Then somewhere around same time node2 after dialing node1 tries toPing
it using closed connection so ping fails and now both node1 and node2 have connections to each other which are broken and being constantly retried. After 30 tries one of the nodes creates new connection, but, if other node already has broken connection to it nothing will happen and this loop will continue. Situation resolves, but non deterministic and only when both nodes are getting fresh connections at the same time.I agree with proposed by @rkeene solution, which is in this PR - do not close incoming connection if we're not able to dial it back. This way incoming connection remains and node which tried to connect to us will have working connection and will be able to gather information from us even if we do not benefit from this communication. Meanwhile if we already have connection to this node but it in failure state (like in example above) it will be recovered by means of grpc very shortly. This also resolves cases if user's computers do not have required ports open for example.