-
Notifications
You must be signed in to change notification settings - Fork 937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightningd: fix overzealous memleak report and other flakes #8112
Merged
rustyrussell
merged 8 commits into
ElementsProject:master
from
rustyrussell:flake-memleak
Mar 24, 2025
Merged
lightningd: fix overzealous memleak report and other flakes #8112
rustyrussell
merged 8 commits into
ElementsProject:master
from
rustyrussell:flake-memleak
Mar 24, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5295e02
to
b6a5490
Compare
b6a5490
to
1b26bbb
Compare
fa935c9
to
8b98703
Compare
Found by CI, not a real leak. Signed-off-by: Rusty Russell <[email protected]>
The sleep was simply allowing gossip to propagate, so (sometimes) l4 can connect to l3/l1. We really want to suppress the blinded path on l2: ``` 2025-03-03T05:25:16.6928072Z > l4.rpc.call('fetchinvoice', {'offer': offer3['bolt12']}) 2025-03-03T05:25:16.6928280Z 2025-03-03T05:25:16.6928367Z tests/test_pay.py:4540: 2025-03-03T05:25:16.6928609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-03-03T05:25:16.6928929Z contrib/pyln-testing/pyln/testing/utils.py:760: in call ... 2025-03-03T05:25:16.6956050Z E pyln.client.lightning.RpcError: RPC call failed: method: fetchinvoice, payload: {'offer': 'lno1qgsqvgnwgcg35z6ee2h3yczraddm72xrfua9uve2rlrm9deu7xyfzrcgqyqs5pn0venx2u3nzrhqxhftzxfdlwsnfcgw2sy8t5mxa0ytcdfat2nkdwqvpy9nnsa9mzzaqfwys22njn0v5g3jswhdh684hnlnkvd5pme28q5hgyyhshhmntd6qqsz40a4renm8a94r9xf5eez73ygcmendmd9utwmx0kzlp0lwd9sq98sqvutrneuljglxekynj2wdhpsa36ra3ae7uql9g79w9qqc0rqrunkystxgsz2reyay8hdtzwjew38w2u4xavpq2fm360hd75pkyuhpar0adgu93z8gn0jsyxganyhelch7savw6vrgqpj80cdc5qkwkaz0dk65cwyatgmhszpv72axqz6ldvcjq9crzevzpd3aeet607hq7sk0fvz0musdn7r96w6zcssytfzxcs2xkdy0lml0tzy0jzugmyj8kjn8zfzrgq9fsgurc72x82e'}, error: {'code': 1003, 'message': 'Failed: could not route or connect directly to blinded path at 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d: {"code":400,"message":"Unable to connect, no address known for peer"}'} ``` Signed-off-by: Rusty Russell <[email protected]>
We can in fact see the new channel before this line is called: ``` 2025-03-15T12:31:04.1472196Z @pytest.mark.openchannel('v1') 2025-03-15T12:31:04.1472616Z @pytest.mark.openchannel('v2') 2025-03-15T12:31:04.1473317Z @unittest.skipIf(TEST_NETWORK != 'regtest', 'elementsd doesnt yet support PSBT features we need') 2025-03-15T12:31:04.1474271Z def test_splice_gossip(node_factory, bitcoind): 2025-03-15T12:31:04.1475078Z l1, l2, l3 = node_factory.line_graph(3, fundamount=1000000, wait_for_announce=True, opts={'experimental-splicing': None}) 2025-03-15T12:31:04.1475781Z 2025-03-15T12:31:04.1476052Z chan_id = l1.get_channel_id(l2) 2025-03-15T12:31:04.1476460Z pre_splice_scid = first_scid(l1, l2) 2025-03-15T12:31:04.1476844Z 2025-03-15T12:31:04.1477134Z # add extra sats to pay fee 2025-03-15T12:31:04.1477741Z funds_result = l1.rpc.fundpsbt("109000sat", "slow", 166, excess_as_change=True) 2025-03-15T12:31:04.1478345Z 2025-03-15T12:31:04.1478765Z result = l1.rpc.splice_init(chan_id, 100000, funds_result['psbt']) 2025-03-15T12:31:04.1479432Z result = l1.rpc.splice_update(chan_id, result['psbt']) 2025-03-15T12:31:04.1479994Z assert(result['commitments_secured'] is False) 2025-03-15T12:31:04.1480584Z result = l1.rpc.splice_update(chan_id, result['psbt']) 2025-03-15T12:31:04.1481089Z assert(result['commitments_secured'] is True) 2025-03-15T12:31:04.1481386Z result = l1.rpc.signpsbt(result['psbt']) 2025-03-15T12:31:04.1481860Z result = l1.rpc.splice_signed(chan_id, result['signed_psbt']) 2025-03-15T12:31:04.1482403Z 2025-03-15T12:31:04.1485960Z wait_for(lambda: only_one(l2.rpc.listpeerchannels(l1.info['id'])['channels'])['state'] == 'CHANNELD_AWAITING_SPLICE') 2025-03-15T12:31:04.1489978Z wait_for(lambda: only_one(l1.rpc.listpeerchannels(l2.info['id'])['channels'])['state'] == 'CHANNELD_AWAITING_SPLICE') 2025-03-15T12:31:04.1490796Z 2025-03-15T12:31:04.1491223Z bitcoind.generate_block(6, wait_for_mempool=result['txid']) 2025-03-15T12:31:04.1491767Z 2025-03-15T12:31:04.1492213Z # l3 will see channel dying, but still consider it OK for 12 blocks. 2025-03-15T12:31:04.1493174Z l3.daemon.wait_for_log(f'gossipd: channel {pre_splice_scid} closing soon due to the funding outpoint being spent') 2025-03-15T12:31:04.1494422Z assert len(l3.rpc.listchannels(short_channel_id=pre_splice_scid)['channels']) == 2 2025-03-15T12:31:04.1495293Z > assert len(l3.rpc.listchannels(source=l1.info['id'])['channels']) == 1 2025-03-15T12:31:04.1495937Z E AssertionError: assert 2 == 1 2025-03-15T12:31:04.1503185Z E + where 2 = len([{'active': True, 'amount_msat': 1000000000, 'base_fee_millisatoshi': 1, 'channel_flags': 1, ...}, {'active': True, 'amount_msat': 1100000000, 'base_fee_millisatoshi': 1, 'channel_flags': 1, ...}]) ``` Signed-off-by: Rusty Russell <[email protected]>
``` > assert only_one(l1.rpc.listpeerchannels()['channels'])['last_stable_connection'] > recon_time + STABLE_TIME E assert 1742167762 > (1742167702.0235627 + 60) tests/test_connection.py:4545: AssertionError ``` Indeed, the > should be >=. Signed-off-by: Rusty Russell <[email protected]>
We play with directory permissions, but sqlites needs that, and sometimes (due to a timer, perhaps?) it gets really upset about it: ``` lightningd-1 2025-03-16T23:29:58.506Z INFO lightningd: Server started with public key 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518, alias JUNIORBEAM-f91eb11-modded (color #0266e4) and lightningd f91eb11-modded lightningd-1 2025-03-16T23:29:58.617Z DEBUG lightningd: Adding block 101: 55af171ade598f8c4f9a494eaaffe6b9f5ea64c7b0f3b273694148adf7ad7385 lightningd-1 2025-03-16T23:29:58.752Z TRACE lightningd: Calling rpc_command hook of plugin cln-xpay lightningd-1 2025-03-16T23:29:58.764Z TRACE lightningd: Plugin cln-xpay returned from rpc_command hook call lightningd-1 2025-03-16T23:29:58.784Z TRACE lightningd: Calling rpc_command hook of plugin cln-xpay lightningd-1 2025-03-16T23:29:58.840Z TRACE lightningd: Plugin cln-xpay returned from rpc_command hook call lightningd-1 2025-03-16T23:29:58.854Z TRACE lightningd: Calling rpc_command hook of plugin cln-xpay lightningd-1 2025-03-16T23:29:58.865Z DEBUG gossipd: REPLY WIRE_GOSSIPD_NEW_BLOCKHEIGHT_REPLY with 0 fds lightningd-1 2025-03-16T23:29:58.901Z **BROKEN** lightningd: Error executing statement: db/exec.c:108: UPDATE vars SET intval = intval + 1 WHERE name = 'data_version' AND intval = ?: attempt to write a readonly database lightningd-1 2025-03-16T23:29:58.916Z **BROKEN** lightningd: Error executing statement: db/exec.c:108: UPDATE vars SET intval = intval + 1 WHERE name = 'data_version' AND intval = ?: attempt to write a readonly database lightningd-1 2025-03-16T23:29:58.941Z **BROKEN** lightningd: FATAL SIGNAL 6 (version f91eb11-modded) {'github_repository': 'ElementsProject/lightning', 'github_sha': 'f91eb118388dc1455c67d16079168fd0267ba248', 'github_ref': 'refs/pull/8112/merge', 'github_ref_name': 'HEAD', 'github_run_id': 13888173501, 'github_head_ref': 'flake-memleak', 'github_run_number': 12573, 'github_base_ref': 'master', 'github_run_attempt': '1', 'testname': 'test_setconfig_access', 'start_time': 1742167762, 'end_time': 1742167800, 'outcome': 'fail'} ----------------------------- Captured stderr call ----------------------------- Error executing statement: db/exec.c:108: UPDATE vars SET intval = intval + 1 WHERE name = 'data_version' AND intval = ?: attempt to write a readonly database lightningd: FATAL SIGNAL 6 (version f91eb11-modded) Log dumped in /tmp/lightning-crash.log.20250316232958 Lost connection to the RPC socket.Lost connection to the RPC socket. =========================== short test summary info ============================ FAILED tests/test_misc.py::test_setconfig_access - AssertionError: Regex pattern did not match. Regex: 'Cannot write to config file /tmp/ltests-22a5dz1j/test_setconfig_access_1/lightning-1/regtest/config' Input: "RPC call failed: method: check, payload: {'command_to_check': 'setconfig', 'config': 'min-capacity-sat', 'val': 1000000}, error: Connection to RPC server lost." ERROR tests/test_misc.py::test_setconfig_access - ValueError: Node errors: - lightningd-1: had BROKEN messages - lightningd-1: Node exited with return code -6 ``` So we move the db file for sqlite. Signed-off-by: Rusty Russell <[email protected]>
By submitting them all at once, rather than serially, we should *definitely* hit the ratelimit. We can also reduce the timeout (from 60 seconds) to speed the test up. ``` @pytest.mark.slow_test def test_onionmessage_ratelimit(node_factory): l1, l2 = node_factory.line_graph(2, fundchannel=False, opts={'allow_warning': True}) offer = l2.rpc.call('offer', {'amount': '2msat', 'description': 'simple test'}) # Hopefully we can do this fast enough to reach ratelimit! > with pytest.raises(RpcError, match="Timeout waiting for response"): E Failed: DID NOT RAISE <class 'pyln.client.lightning.RpcError'> tests/test_pay.py:5825: Failed ``` Signed-off-by: Rusty Russell <[email protected]>
f4623d7
to
d3aaa1a
Compare
I saw this while watching top, and tracked it down. We load all the plugins to checksum them at startup: ``` $ ms_print massif.out.3312805 | head -n 50 -------------------------------------------------------------------------------- Command: lightningd/lightningd.real --developer --log-level=trace --cltv-delta=6 --cltv-final=5 --watchtime-blocks=5 --rescan=1 --disable-dns --lightning-dir=/tmp/ltests-roazlc8h/test_xpay_fake_channeld_1/lightning-1/ --addr=127.0.0.1:46337 --allow-deprecated-apis=false --network=regtest --ignore-fee-limits=false --bitcoin-rpcuser=rpcuser --bitcoin-rpcpassword=rpcpass --bitcoin-datadir=/tmp/ltests-roazlc8h/test_xpay_fake_channeld_1/lightning-1/ --dev-fast-gossip --dev-bitcoind-poll=1 --log-file=- --log-file=/tmp/ltests-roazlc8h/test_xpay_fake_channeld_1/lightning-1/log --log-prefix=lightningd-1 --dev-fail-on-subdaemon-fail --dev-no-reconnect --autoconnect-seeker-peers=0 --subdaemon=channeld:../tests/plugins/channeld_fakenet --dev-throttle-gossip --grpc-port=37819 --dev-crash-after=3600 --bitcoin-rpcport=51623 Massif arguments: (none) ms_print arguments: massif.out.3312805 -------------------------------------------------------------------------------- MB 446.5^# |# |# |# |# |# |# |# |# |# |# |# |# |# |# |# |# |# |# |# :: : ::@@:@:::::::::::::::::::::::::::@@::::::::::::::@::::@::::::: 0 +----------------------------------------------------------------------->Gi 0 75.66 Number of snapshots: 57 Detailed snapshots: [1 (peak), 2, 10, 12, 33, 45, 49] -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 295,677,530 468,189,872 447,087,069 21,102,803 0 95.49% (447,087,069B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->95.35% (446,440,727B) 0x2EFB99: tal_resize_ (tal.c:755) | ->94.23% (441,180,860B) 0x2EBCD1: grab_fd (grab_file.c:45) | | ->94.23% (441,180,860B) 0x2EBD54: grab_file (grab_file.c:63) | | ->94.23% (441,180,860B) 0x1A5CF7: file_checksum (plugin.c:315) | | ->94.23% (441,180,860B) 0x1A5EF1: plugin_register (plugin.c:355) | | ->94.23% (441,180,860B) 0x1AC62E: plugins_set_builtin_plugins_dir (plugin.c:2532) | | ->94.23% (441,180,860B) 0x16D614: find_subdaemons_and_plugins (lightningd.c:569) | | ->94.23% (441,180,860B) 0x16E9A1: main (lightningd.c:1226) | | | ->01.11% (5,219,417B) 0x2EBC32: grab_fd (grab_file.c:38) ``` Signed-off-by: Rusty Russell <[email protected]>
We reconnect from l3->l2, but l2 is also trying to reconnect and we can race: ``` # Now try when l3 uses scid for entry point of blinded path. l3.stop() l3.daemon.opts['dev-invoice-bpath-scid'] = None l3.start() > l3.rpc.connect(l2.info['id'], 'localhost', l2.port) tests/test_pay.py:5667: ... > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: connect, payload: {'id': '022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59', 'host': 'localhost', 'port': 33557}, error: {'code': 402, 'message': 'disconnected during connection'} ``` Signed-off-by: Rusty Russell <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Found by CI. Then I kept adding commits as more CI issues showed up!
Changelog-None