-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for HTLC db issues #2000
Fixes for HTLC db issues #2000
Conversation
2e629ae
to
362fe44
Compare
… is True This is what the callers want; generalize it. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
lightningd: Outstanding taken pointers: lightningd/pay.c:243:channel_update Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
This means we need to check when we've altered the state, so the checks are moved to the callers of htlc_in_update_state and htlc_out_update_state. Signed-off-by: Rusty Russell <[email protected]>
…utgoing. Usually, we only close an incoming HTLC once the outgoing HTLC is completely resolved. However, we short-cut this in the FULFILL case: we have the preimage, so might as well use it immediately (in fact, we wait for it to be committed, but we don't need to in theory). As a side-effect of this, our assumption that every outgoing HTLC has a corresponding incoming HTLC is incorrect, and this test (xfail) tickles that corner case. Signed-off-by: Rusty Russell <[email protected]>
… outgoing. We now need an explicit 'local' flag, rather than relying on the existence of the 'in' pointer. Signed-off-by: Rusty Russell <[email protected]>
…reed. Now we know this can happen (see previous patch), we need to handle it. Signed-off-by: Rusty Russell <[email protected]>
Which we don't handle, due to a separate bug, so it's xfail. Signed-off-by: Rusty Russell <[email protected]>
…hout a preimage. We need to handle this case (old db) before the next commit, which actually fixes it. Signed-off-by: Rusty Russell <[email protected]>
We can now wrap the 'missing preimage' hack in COMPAT_V061. Signed-off-by: Rusty Russell <[email protected]>
Under stress, it fails (test_restart_many_payments, the next test). I suspect a deep misunderstanding in the comparison code, will chase separately. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
We don't save them to the database, so fix things up as we load them. Next patch will actually save them into the db, and this will become COMPAT code. Also: call htlc_in_check() with NULL on db load, as otherwise it aborts internally. Signed-off-by: Rusty Russell <[email protected]>
We don't write it in there yet, so this change currently has no effect. Signed-off-by: Rusty Russell <[email protected]>
Now we can finally move the fixup code under COMPAT_V061, so it's only for old nodes. Signed-off-by: Rusty Russell <[email protected]>
failoutchannel tells us which channel to send an update for (specifically for temporary_channel_failure); but we don't save it into the db. It's not even clear we should, since it's a corner case and the channel might not even exist when we come back. So on db restore, change such errors to WIRE_TEMPORARY_NODE_FAILURE which doesn't need an update. We also don't memset it to 0 in the normal case (we only access if it failcode has the UPDATE bit set) so valgrind will trigger if we're wrong. Signed-off-by: Rusty Russell <[email protected]>
We don't expect payment or payment->route_channels to be NULL without an old db, but putting an assert there reveals that we try to fail an HTLC which has already succeeded in 'test_onchain_unwatch'. Obviously we only want to fail an HTLC which goes onchain if we don't already have the preimage! Signed-off-by: Rusty Russell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks ok from my side, just some minor nits about naming and comments. The only thing that might take some time, but I think is a nice cleanup, is to remove the preimage
from hout
altogether and making the state an explicit enum
field, to make it easier to reason about the states that an HTLC may be in.
ACK 362fe44
# Make sure everyone sees all channels: we can cheat and | ||
# simply check the ends (since it's a line). | ||
wait_for(lambda: both_dirs_ready(nodes[0], scids[-1])) | ||
wait_for(lambda: both_dirs_ready(nodes[-1], scids[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification
|
||
if not announce: | ||
return nodes | ||
|
||
bitcoin.generate_block(5) | ||
|
||
def both_dirs_ready(n, scid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will take a really long time without DEVELOPER=1
, since we can't set the broadcast interval, so each hop adds 30 wait time this way. But it seems we mostly wait anyway (or have a 2-line-graph which does not suffer this problem).
|
||
# Now, l2 should restore from DB fine, even though outgoing HTLC no longer | ||
# has an incoming. | ||
l2.restart() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically restarting does not trigger a reconnect, and it's not needed to check the HTLC restore from the database. It might be interesting nevertheless to have them reconnect and check that they clear correctly, not just load correctly from the DB. A bunch of wait_for
s should suffice to see the HTLC settle correctly.
Covered in 4f42de2
lightningd/htlc_end.h
Outdated
@@ -73,6 +73,9 @@ struct htlc_out { | |||
/* If we fulfilled, here's the preimage. */ | |||
struct preimage *preimage; | |||
|
|||
/* Is this local? Implies ->in is NULL. */ | |||
bool local; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already using local
in the context of a single channel quite extensively, could we name this origin
or sender
instead? I'd like to separate the nomenclatures of channel context from the multi-channel context a bit more, to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
/* FIXME: COMPAT_V061 only */ | ||
log_broken(ld->log, | ||
"Missing preimage for orphaned HTLC; replacing with zeros"); | ||
hout->preimage = talz(hout, struct preimage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears we are using the preimage
solely to test for success and to pass it to the payment. So it is never used on the downstream side, only on the upstream side (towards the sender), so why are we keeping this at all?
I might have been a bit overzealous in saving it on both sides, when it could just have been part of the state. This might be part of a bigger PR that drops the preimage
from hout
, but it'd allow us to avoid a compat flag here (and subsequent forced global reload).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a FIXME.
# Manually create routes, get invoices | ||
route = [] | ||
payment_hash = [] | ||
for i in range(len(innodes)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a tuple of (route, hash)
seems to be preferable here. it'd make the unpacking for
loop below easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also since we are iterating over innodes
we could extend this to be (innode, route, hash)
making the loop really trivial below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the spirit of pythonicity, this is a bit nicer:
import itertools
for inode, outnode, outchan, inchan in itertools.izip(innodes,outnodes,outchans,inchans):
tests/test_connection.py
Outdated
|
||
# sendpay is async. | ||
for i in range(len(payment_hash)): | ||
innodes[i // 2].rpc.sendpay(route[i], payment_hash[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [i // 2]
seems strange here, shouldn't we iterate over all innodes
not just the first half?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two payments per innode...
wallet/wallet.c
Outdated
@@ -1346,6 +1346,34 @@ static bool wallet_stmt2htlc_out(struct channel *channel, | |||
return ok; | |||
} | |||
|
|||
/* We didn't used to save failcore, failuremsg... */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "didn't used to"
@@ -1378,19 +1385,20 @@ static void fixup_hin(struct wallet *wallet, struct htlc_in *hin) | |||
if (hin->failcode || hin->failuremsg) | |||
return; | |||
|
|||
hin->failcode = WIRE_TEMPORARY_CHANNEL_FAILURE; | |||
hin->failcode = WIRE_TEMPORARY_NODE_FAILURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This and the next change could be added to the commit that added this fix in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. I only realized it when I found that issue though...
#endif | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll kill that, too.
Noted by @cdecker, the term 'local' is grossly overused, and the hout preimage is basically only used as a sanity check (though I've just put a FIXME there for now). Also eliminated spurious blank line which crept into wallet.c. Signed-off-by: Rusty Russell <[email protected]>
Awesome, thanks for the fixes @rustyrussell ACK 14e2798 The merge request should disappear when doing a |
tests/test_connection.py
Outdated
# We agree on fee change first, then add HTLC, then remove; stop after remove. | ||
disconnects = ['+WIRE_COMMITMENT_SIGNED*3'] | ||
# We manually reconnect l2 & l3, after 100 blocks; hence allowing manual | ||
# reconnect, but disabling auto coneect, and massive cltv so 2/3 doesn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: coneect -> connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to rebase anyway due to CHANGELOG conflict, so I snuck this in too... thanks!
struct preimage *preimage; | ||
|
||
/* Is this local? Implies ->in is NULL. */ | ||
bool local; | ||
/* Is this a locally-generated payment? Implies ->in is NULL. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much helpful 👍
# Manually create routes, get invoices | ||
route = [] | ||
payment_hash = [] | ||
for i in range(len(innodes)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the spirit of pythonicity, this is a bit nicer:
import itertools
for inode, outnode, outchan, inchan in itertools.izip(innodes,outnodes,outchans,inchans):
IIUC, namedtuple is like tuple for grown-ups: Pythonify! Suggested-by: @cdecker Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
14e2798
to
18b6c63
Compare
Puzzled about two PR's, I wrote a stress test, which revealed a number of bugs, including those which caused the horrific issues #1960 and #1993.
It also triggers a yet-to-be-diagnosed occasional issue where we dislike the peer's option_dataloss_protect; I have temporarily disabled that check in order to resolve these more important issues, but am now chasing that one.
Fixes: #1960
Fixes: #1993