Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to sync past epoch 22 on sui mainnet when data ingestion is enabled #20487

Closed
robmcl4 opened this issue Dec 2, 2024 · 5 comments · Fixed by #20490
Closed

Fail to sync past epoch 22 on sui mainnet when data ingestion is enabled #20487

robmcl4 opened this issue Dec 2, 2024 · 5 comments · Fixed by #20490
Assignees

Comments

@robmcl4
Copy link
Contributor

robmcl4 commented Dec 2, 2024

Steps to Reproduce Issue

I am running within docker-compose. I ran a git bisect and identified that the issue is introduced somewhere within commits 76d4852 (known bad), 874da38 (does not compile) and 855aa35 (does not compile). The issue is ONLY present when the checkpoint-executor-config section is present in the config; otherwise it does not present. The crash occurs every time I delete and try again (not flaky).

I have the sui repository cloned in ./sui/

# docker-compose.yml
version: "3.9"                                                                                            
services:                                                                                                                                                                                                           
  fullnode:                                                                                               
    build:                             
      context: sui                                                                                        
      dockerfile: docker/sui-node/Dockerfile
    ports:
    - "18080:8080"
    - "18084:8084/udp"
    - "19000:9000"
    - "19184:9184"
    volumes:
    - ./config/fullnode.yaml:/opt/sui/config/fullnode.yaml:ro
    - ./genesis.blob:/opt/sui/config/genesis.blob:ro
    - ./data/:/opt/sui/db:rw
    command: ["/opt/sui/bin/sui-node", "--config-path", "/opt/sui/config/fullnode.yaml"]
    environment:
      RUST_BACKTRACE: full
# config/fullnode.yaml
# Update this value to the location you want Sui to store its database
db-path: "/opt/sui/db"

# For ipv4, update this to "/ipv4/X.X.X.X/tcp/8080/http"
# network-address: "/dns/localhost/tcp/8080/http"
network-address: "/ip4/0.0.0.0/tcp/8080/http"
metrics-address: "0.0.0.0:9184"
# this address is also used for web socket connections
json-rpc-address: "0.0.0.0:9000"
enable-event-processing: true

genesis:
  # Update this to the location of where the genesis file is stored
  genesis-file-location: "/opt/sui/config/genesis.blob"

authority-store-pruning-config:
  num-latest-epoch-dbs-to-retain: 3
  epoch-db-pruning-period-secs: 3600
  num-epochs-to-retain: 50
  num-epochs-to-retain-for-checkpoints: 60
  periodic-compaction-threshold-days: 1
  smooth: true
  max-checkpoints-in-batch: 10
  max-transactions-in-batch: 1000

checkpoint-executor-config:
  checkpoint-execution-max-concurrency: 200
  local-execution-timeout-sec: 30
  data-ingestion-dir: /opt/sui/db/ingestion

state-archive-read-config:
  - object-store-config:
      object-store: "S3"
      # Use mysten-testnet-archives for testnet 
      # Use mysten-mainnet-archives for mainnet
      bucket: "mysten-mainnet-archives"
      no-sign-request: true
      aws-region: "us-west-2"
      object-store-connection-limit: 20
    # How many objects to read ahead when catching up  
    concurrency: 20
    # Whether to prune local state based on latest checkpoint in archive.
    # This should stay false for most use cases
    use-for-pruning-watermark: false

p2p-config:
  seed-peers:
    - address: /dns/mel-00.mainnet.sui.io/udp/8084
      peer-id: d32b55bdf1737ec415df8c88b3bf91e194b59ee3127e3f38ea46fd88ba2e7849
    - address: /dns/ewr-00.mainnet.sui.io/udp/8084
      peer-id: c7bf6cb93ca8fdda655c47ebb85ace28e6931464564332bf63e27e90199c50ee
    - address: /dns/ewr-01.mainnet.sui.io/udp/8084
      peer-id: 3227f8a05f0faa1a197c075d31135a366a1c6f3d4872cb8af66c14dea3e0eb66
    - address: /dns/lhr-00.mainnet.sui.io/udp/8084
      peer-id: c619a5e0f8f36eac45118c1f8bda28f0f508e2839042781f1d4a9818043f732c
    - address: /dns/sui-mainnet-ssfn-1.nodeinfra.com/udp/8084
      peer-id: 0c52ca8d2b9f51be4a50eb44ace863c05aadc940a7bd15d4d3f498deb81d7fc6
    - address: /dns/sui-mainnet-ssfn-2.nodeinfra.com/udp/8084
      peer-id: 1dbc28c105aa7eb9d1d3ac07ae663ea638d91f2b99c076a52bbded296bd3ed5c
    - address: /dns/sui-mainnet-ssfn-ashburn-na.overclock.run/udp/8084
      peer-id: 5ff8461ab527a8f241767b268c7aaf24d0312c7b923913dd3c11ee67ef181e45
    - address: /dns/sui-mainnet-ssfn-dallas-na.overclock.run/udp/8084
      peer-id: e1a4f40d66f1c89559a195352ba9ff84aec28abab1d3aa1c491901a252acefa6
    - address: /dns/ssn01.mainnet.sui.rpcpool.com/udp/8084
      peer-id: fadb7ccb0b7fc99223419176e707f5122fef4ea686eb8e80d1778588bf5a0bcd
    - address: /dns/ssn02.mainnet.sui.rpcpool.com/udp/8084
      peer-id: 13783584a90025b87d4604f1991252221e5fd88cab40001642f4b00111ae9b7e

System Information

  • OS: Ubuntu 20.04 (host) - running in docker
  • Compiler: (whatever is specified in docker)

Exception

Below is a snippet of the exception with RUST_BACKTRACE=full (as set in docker-compose.yml above). I also attached the full log. Note that this log is from commit 76d4852. Later commits are also bad, and I have logs demonstrating a crash for them as well, if needed.

2024-12-02T21:20:35.594194Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179588
2024-12-02T21:20:35.594231Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179589
2024-12-02T21:20:35.594270Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179590
2024-12-02T21:20:35.594307Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179591
2024-12-02T21:20:35.594343Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179592
2024-12-02T21:20:35.594380Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179593
2024-12-02T21:20:35.594417Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179594
2024-12-02T21:20:35.594454Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179595
2024-12-02T21:20:35.594491Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179596
2024-12-02T21:20:35.594528Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179597
2024-12-02T21:20:35.594567Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179598
2024-12-02T21:20:35.594605Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=2179599
thread 'sui-node-runtime' panicked at crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs:2024-12-02T21:20:36.637449Z ERROR handle_execution_effects{seq=1727039 epoch=22}: telemetry_subscribers: panicked at crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs:901:26:
Finalizing checkpoint cannot fail: UserInputError { error: ObjectNotFound { object_id: 0x70b68d5db939640143339dd72420ab785dd06d968e0f537a88ab5811922536b8, version: Some(SequenceNumber(1600521)) } } panic.file="crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs" panic.line=901 panic.column=26
901:26:
Finalizing checkpoint cannot fail: UserInputError { error: ObjectNotFound { object_id: 0x70b68d5db939640143339dd72420ab785dd06d968e0f537a88ab5811922536b8, version: Some(SequenceNumber(1600521)) } }
stack backtrace:
   0:     0x564437ab862c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hed7f999df88cc644
   1:     0x564437aeafe0 - core::fmt::write::h3a39390d8560d9c9
   2:     0x564437ab48cf - std::io::Write::write_fmt::h5fc9997dfe05f882
   3:     0x564437ab8414 - std::sys_common::backtrace::print::h23a2d212c6fff936
   4:     0x564437ab9dc7 - std::panicking::default_hook::{{closure}}::h8a1d2ee00185001a
   5:     0x564437ab9b2f - std::panicking::default_hook::h6038f2eba384e475
   6:     0x564435980f11 - telemetry_subscribers::set_panic_hook::{{closure}}::he588bce68a7a6344
   7:     0x564437aba3d8 - std::panicking::rust_panic_with_hook::h2b5517d590cab22e
   8:     0x564437aba12e - std::panicking::begin_panic_handler::{{closure}}::h233112c06e0ef43e
   9:     0x564437ab8af6 - std::sys_common::backtrace::__rust_end_short_backtrace::h6e893f24d7ebbff8
  10:     0x564437ab9e92 - rust_begin_unwind
  11:     0x5644345e7835 - core::panicking::panic_fmt::hbf0e066aabfa482c
  12:     0x5644345e7d73 - core::result::unwrap_failed::hddb4fea594200c52
  13:     0x56443542c92c - sui_core::checkpoints::checkpoint_executor::handle_execution_effects::{{closure}}::{{closure}}::h5ae147691fba6339
  14:     0x5644353d9fa9 - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::h485069d7b4a4c5f7
  15:     0x5644356f148a - sui_core::checkpoints::checkpoint_executor::execute_transactions::{{closure}}::{{closure}}::h912dceab8ae0aef7
  16:     0x5644356ec6a6 - sui_core::checkpoints::checkpoint_executor::CheckpointExecutor::schedule_checkpoint::{{closure}}::hbfb3217921f323bd
  17:     0x56443568e62e - tokio::runtime::task::core::Core<T,S>::poll::h4628986556ac2967
  18:     0x564434f85ef0 - tokio::runtime::task::harness::Harness<T,S>::poll::ha1189cc4ac93a1b8
  19:     0x564437a41efd - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::ha9f2411bcf3e6e6d
  20:     0x564437a4143e - tokio::runtime::scheduler::multi_thread::worker::Context::run::h4e3b13bb758e7c68
  21:     0x564437a295f5 - tokio::runtime::context::set_scheduler::hd572611497be79fe
  22:     0x564437a31581 - tokio::runtime::context::runtime::enter_runtime::hc80d7d4b134cf110
  23:     0x564437a4098d - tokio::runtime::scheduler::multi_thread::worker::run::h14d6c4ee9e66e884
  24:     0x564437a28048 - tokio::runtime::task::core::Core<T,S>::poll::hc80d4a3c15397ec1
  25:     0x564437a21d1a - tokio::runtime::task::harness::Harness<T,S>::poll::h6ac58ae218508437
  26:     0x564437a3cfd5 - tokio::runtime::blocking::pool::Inner::run::h36ddb1719c744f54
  27:     0x564437a26c7a - std::sys_common::backtrace::__rust_begin_short_backtrace::h977756efa16f3622
  28:     0x564437a27119 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hcf2a9e0b415079a4
  29:     0x564437ac0965 - std::sys::unix::thread::Thread::new::thread_start::he469335aef763e45
  30:     0x7f4f807aeea7 - start_thread
  31:     0x7f4f80582acf - clone
  32:                0x0 - <unknown>

76d48528e6c11bdea99fbb26472884cf5ed97c14.crash.log.gz

@robmcl4
Copy link
Contributor Author

robmcl4 commented Dec 2, 2024

Seems to have been introduced by #18144

@bmwill
Copy link
Contributor

bmwill commented Dec 3, 2024

@robmcl4 Thanks for the very detailed report and for bisecting to figure out where the bug was introduced. Its unfortunate that some of those commits didn't build (that's embarrassing on my part) but the information you provided was able to help me know where to look.

I haven't had time to do a repro of the case you provided but I believe #20490 (or some form of this fix) should address the change in functionality, and bug, that was introduced in 874da38 (CheckpointData: fix bug when loading input objects with deleted shared objects, 2024-06-10).

If you get a chance, before I do, let me know if the patch was able to successfully fix the bug and let you proceed past epoch 22 with the ingestion turned on.

@robmcl4
Copy link
Contributor Author

robmcl4 commented Dec 3, 2024

No worries, I will pull bmwill:effects-v1-modified_at_versions and test on my machine now :)

@robmcl4
Copy link
Contributor Author

robmcl4 commented Dec 3, 2024

That seems to have fixed it!

@Meurig2465
Copy link

Meurig2465 commented Dec 12, 2024

Could someone help me determine what's happening to my sui wallet please? I have reached out to sui and nightly but nothing.
Sui wallet address
Screenshot_20241211_210048_Chrome

0x2994ec89c041fc1c7020b967e94ecd10cb456b0aaca452c8a9a7524733667456

I suspect this is some kind of bug but the project has liquidity and if I can access I would dearly like to 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants