Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decentralize p2p network #275

Closed
Tracked by #280
dshulyak opened this issue Oct 3, 2023 · 5 comments
Closed
Tracked by #280

decentralize p2p network #275

dshulyak opened this issue Oct 3, 2023 · 5 comments
Assignees
Labels

Comments

@dshulyak
Copy link

dshulyak commented Oct 3, 2023

the problem with existing setup is that many nodes are undialable and therefore spacemesh has to run CDN-like nodes (that we call boosters internally) that help with network connectivity.

this is not good for network long term health and we want to fix it when time allows. in the past we failed with libp2p hole punching protocol, but it could be due to rushing things. we want to:

  • enable quic / webrtc transport
  • retest hole punching protocol
  • measure it

this doesn't introduce any immediate issues, but not good for network health long term

@dshulyak dshulyak moved this to 📋 Backlog in Dev team kanban Oct 3, 2023
@lrettig lrettig mentioned this issue Oct 17, 2023
5 tasks
spacemesh-bors bot pushed a commit to spacemeshos/go-spacemesh that referenced this issue Dec 19, 2023
## Motivation

This PR implements changes needed for spacemeshos/pm#275, except for measurement

## Changes
* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

## Test Plan
* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

## TODO
- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
ivan4th added a commit to spacemeshos/go-spacemesh that referenced this issue Dec 20, 2023
This PR implements changes needed for spacemeshos/pm#275, except for measurement

* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
ivan4th added a commit to spacemeshos/go-spacemesh that referenced this issue Dec 20, 2023
This PR implements changes needed for spacemeshos/pm#275, except for measurement

* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
ivan4th added a commit to spacemeshos/go-spacemesh that referenced this issue Dec 25, 2023
This PR implements changes needed for spacemeshos/pm#275, except for measurement

* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
ivan4th added a commit to spacemeshos/go-spacemesh that referenced this issue Dec 26, 2023
This PR implements changes needed for spacemeshos/pm#275, except for measurement

* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
dsmello pushed a commit to spacemeshos/go-spacemesh that referenced this issue Dec 28, 2023
## Motivation

This PR implements changes needed for spacemeshos/pm#275, except for measurement

## Changes
* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

## Test Plan
* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

## TODO
- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
ivan4th added a commit to spacemeshos/go-spacemesh that referenced this issue Dec 29, 2023
This PR implements changes needed for spacemeshos/pm#275, except for measurement

* Introduce Routing Discovery to contact peers behind NATs
* Introduce dynamic v2 relay discovery which is needed for hole punching. The idea is to have a wider array of circuit-v2 passive relays which should be much safer than old libp2p active relays (which were disabled in e.g. Filecoin due to security concerns)
* Introduce QUIC transport to improve chances at hole punching, with testnet-mainnet "crosstalk" protection based on a transport-level handshake mechanism
  * the handshake is not used on mainnet. That way, connections between mainnet and testnet nodes are still prevented, as testnet peers expect the handshake, but if/when my libp2p changes are merged (libp2p/go-libp2p#2658) or libp2p gets private network support
* Make it possible to listen on multiple addresses and advertise multiple addresses
* Extend DebugService with additional P2P info needed for hole punching diagnostics (needs spacemeshos/api#285)
* Add `ping-peers` config option to facilitate P2P network issue diagnostics
* Add `force-dht-server` config option that is useful during troubleshooting DHT and hole-punching issues

`ping-peers` and `force-dht-server` were initially considered to be temporary features, but I think it might make sense to keep them for various P2P network troubleshooting scenarios.

All of the changes are disabled in the config by default, except for:
* libp2p Ping service is enabled by default to make diagnostics easier
* DHT Values and Providers as these will make DHT Routing Peer discovery work efficiently from the beginning when we enable this feature in the configs
* Bootnodes aren't used as relays by default anymore. v2 relays have very limited capacity by default and bootnode relay servers' reservations are very quickly exhausted. Need to either specify a static relay list or enable routing discovery, which searches for more available relays as needed

* Tested using k8s several clusters with cone NATs enabled via `bridge` CNI plugin (via Multus) -- backported to v1.2.8
* Added a Mac node for testing

- [x] Have spacemeshos/api#285 merged and updated to the new `api` release
- [x] Retest using an image based on this branch (not backport)
- [ ] Decide on whether/how to extend systests to include NAT testing
- [ ] To check: TCP holepunching tends to happen more than QUIC (might be related to the handshake mechanism)
- [ ] ~~To consider: try picking up some % (e.g.: 50%) of non-infra peers during routing discovery~~ (doesn't work too well, need something more involved for that)

Maybe as a follow-up (depending on how soon this gets reviewed):
- Include new metrics / check if they're already present
  - NAT type (UDP / TCP) - Cone / Symmetric / Unknown
  - Reachability - Public / Private / Unknown
  - N of "advertised" peers found via routing discovery
  - N of TCP and UDP (QUIC) peers
  - N of peers reached via relayed connections (these being present for a long time may indicate hole-punching troubles, usually relayed connections go away relatively quickly)
  - N of relay reservations this node managed to obtain
  - Whether routing discovery is active or suspended (e.g. b/c `low-peers` N of peers has been reached)
  - Whether DHT is in the `Server` or `Client` mode
- systests checking NATed connections

Co-authored-by: Ivan Shvedunov <[email protected]>
@pigmej pigmej moved this from 📋 Backlog to 🔖 Next in Dev team kanban Apr 9, 2024
@pigmej pigmej moved this from 🔖 Next to 🏗 Doing in Dev team kanban Apr 9, 2024
@ivan4th
Copy link

ivan4th commented Apr 9, 2024

With P2P decentralization (#5329) and fetch streaming (#5562) merged, we're currently trying to enable decentralization features in the network. The current problem appears that the routing discovery mechanism causes too much network load on the user nodes.

The following items are planned:

@ivan4th
Copy link

ivan4th commented Apr 16, 2024

Testing QUIC on the testnet. There was an issue with malfeasance sync on testnet, fixed: spacemeshos/go-spacemesh#5851

Looking into reports that streaming mode is somehow causing too many TCP connections (?)

@ivan4th
Copy link

ivan4th commented Apr 25, 2024

spacemeshos/go-spacemesh#5792 should facilitate diagnostics of possible network issues after we try to enable decentralization next time. It would be best to include corresponding views in SMAPP.

spacemeshos/go-spacemesh#5882 should reduce network strain due to DHT and connection activity after routing discovery is enabled

@ivan4th
Copy link

ivan4th commented May 1, 2024

spacemeshos/go-spacemesh#5902 adds QUIC mode systests

@pigmej
Copy link
Member

pigmej commented Aug 28, 2024

This is done, the default smapp setup now includes quic and discovery enabled. Improvements will come in separate tasks.

@pigmej pigmej closed this as completed Aug 28, 2024
@github-project-automation github-project-automation bot moved this from 🏗 Doing to ✅ Done in Dev team kanban Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants