From 4dd9920d991745c4a16f53a8f615f706fbe4b3f7 Mon Sep 17 00:00:00 2001 From: Robbie Ko Date: Thu, 5 Jan 2017 16:24:55 +0800 Subject: [PATCH 01/11] Btrfs: send, fix failure to rename top level inode due to name collision Under certain situations, an incremental send operation can fail due to a premature attempt to create a new top level inode (a direct child of the subvolume/snapshot root) whose name collides with another inode that was removed from the send snapshot. Consider the following example scenario. Parent snapshot: . (ino 256, gen 8) |---- a1/ (ino 257, gen 9) |---- a2/ (ino 258, gen 9) Send snapshot: . (ino 256, gen 3) |---- a2/ (ino 257, gen 7) In this scenario, when receiving the incremental send stream, the btrfs receive command fails like this (ran in verbose mode, -vv argument): rmdir a1 mkfile o257-7-0 rename o257-7-0 -> a2 ERROR: rename o257-7-0 -> a2 failed: Is a directory What happens when computing the incremental send stream is: 1) An operation to remove the directory with inode number 257 and generation 9 is issued. 2) An operation to create the inode with number 257 and generation 7 is issued. This creates the inode with an orphanized name of "o257-7-0". 3) An operation rename the new inode 257 to its final name, "a2", is issued. This is incorrect because inode 258, which has the same name and it's a child of the same parent (root inode 256), was not yet processed and therefore no rmdir operation for it was yet issued. The rename operation is issued because we fail to detect that the name of the new inode 257 collides with inode 258, because their parent, a subvolume/snapshot root (inode 256) has a different generation in both snapshots. So fix this by ignoring the generation value of a parent directory that matches a root inode (number 256) when we are checking if the name of the inode currently being processed collides with the name of some other inode that was not yet processed. We can achieve this scenario of different inodes with the same number but different generation values either by mounting a filesystem with the inode cache option (-o inode_cache) or by creating and sending snapshots across different filesystems, like in the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a1 $ mkdir /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inode from the second snapshot has # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt /tmp/1.snap # Receive of snapshot snap2 used to fail. $ btrfs receive /mnt /tmp/2.snap Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index d145ce80462021..2ae32c4f7dd7e5 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1681,6 +1681,9 @@ static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen) { int ret; + if (ino == BTRFS_FIRST_FREE_OBJECTID) + return 1; + ret = get_cur_inode_state(sctx, ino, gen); if (ret < 0) goto out; @@ -1866,7 +1869,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen, * not deleted and then re-created, if it was then we have no overwrite * and we can just unlink this entry. */ - if (sctx->parent_root) { + if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID) { ret = get_inode_info(sctx->parent_root, dir, NULL, &gen, NULL, NULL, NULL, NULL); if (ret < 0 && ret != -ENOENT) From fe9c798dbf4c78322549068255f611e4038a5b28 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 11 Jan 2017 02:15:39 +0000 Subject: [PATCH 02/11] Btrfs: incremental send, do not delay rename when parent inode is new When we are checking if we need to delay the rename operation for an inode we not checking if a parent inode that exists in the send and parent snapshots is really the same inode or not, that is, we are not comparing the generation number of the parent inode in the send and parent snapshots. Not only this results in unnecessarily delaying a rename operation but also can later on make us generate an incorrect name for a new inode in the send snapshot that has the same number as another inode in the parent snapshot but a different generation. Here follows an example where this happens. Parent snapshot: . (ino 256, gen 3) |--- dir258/ (ino 258, gen 7) | |--- dir257/ (ino 257, gen 7) | |--- dir259/ (ino 259, gen 7) Send snapshot: . (ino 256, gen 3) |--- file258 (ino 258, gen 10) | |--- new_dir259/ (ino 259, gen 10) |--- dir257/ (ino 257, gen 7) The following steps happen when computing the incremental send stream: 1) When processing inode 257, its new parent is created using its orphan name (o257-21-0), and the rename operation for inode 257 is delayed because its new parent (inode 259) was not yet processed - this decision to delay the rename operation does not make much sense because the inode 259 in the send snapshot is a new inode, it's not the same as inode 259 in the parent snapshot. 2) When processing inode 258 we end up delaying its rmdir operation, because inode 257 was not yet renamed (moved away from the directory inode 258 represents). We also create the new inode 258 using its orphan name "o258-10-0", then rename it to its final name of "file258" and then issue a truncate operation for it. However this truncate operation contains an incorrect name, which corresponds to the orphan name and not to the final name, which makes the receiver fail. This happens because when we attempt to compute the inode's current name we verify that there's another inode with the same number (258) that has its rmdir operation pending and because of that we generate an orphan name for the new inode 258 (we do this in the function get_cur_path()). Fix this by not delayed the rename operation of an inode if it has parents with the same number but different generations in both snapshots. The following steps reproduce this example scenario. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir257 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/dir257 /mnt/dir258/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/dir258/dir257 /mnt/dir257 $ rmdir /mnt/dir258 $ rmdir /mnt/dir259 # Remount the filesystem so that the next created inodes will have the # numbers 258 and 259. This is because when a filesystem is mounted, # btrfs sets the subvolume's inode counter to a value corresponding to # the highest inode number in the subvolume plus 1. This inode counter # is used to assign a unique number to each new inode and it's # incremented by 1 after very inode creation. # Note: we unmount and then mount instead of doing a mount with # "-o remount" because otherwise the inode counter remains at value 260. $ umount /mnt $ mount /dev/sdb /mnt $ touch /mnt/file258 $ mkdir /mnt/new_dir259 $ mv /mnt/dir257 /mnt/new_dir259/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmo/1.snap $ btrfs receive /mnt -f /tmo/2.snap -vv receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7 utimes mkdir o259-10-0 rename dir258 -> o258-7-0 utimes mkfile o258-10-0 rename o258-10-0 -> file258 utimes truncate o258-10-0 size=0 ERROR: truncate o258-10-0 failed: No such file or directory Reported-by: Robbie Ko Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 2ae32c4f7dd7e5..2742324514e4fd 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3559,6 +3559,7 @@ static int wait_for_parent_move(struct send_ctx *sctx, { int ret = 0; u64 ino = parent_ref->dir; + u64 ino_gen = parent_ref->dir_gen; u64 parent_ino_before, parent_ino_after; struct fs_path *path_before = NULL; struct fs_path *path_after = NULL; @@ -3579,6 +3580,8 @@ static int wait_for_parent_move(struct send_ctx *sctx, * at get_cur_path()). */ while (ino > BTRFS_FIRST_FREE_OBJECTID) { + u64 parent_ino_after_gen; + if (is_waiting_for_move(sctx, ino)) { /* * If the current inode is an ancestor of ino in the @@ -3601,7 +3604,7 @@ static int wait_for_parent_move(struct send_ctx *sctx, fs_path_reset(path_after); ret = get_first_ref(sctx->send_root, ino, &parent_ino_after, - NULL, path_after); + &parent_ino_after_gen, path_after); if (ret < 0) goto out; ret = get_first_ref(sctx->parent_root, ino, &parent_ino_before, @@ -3618,10 +3621,20 @@ static int wait_for_parent_move(struct send_ctx *sctx, if (ino > sctx->cur_ino && (parent_ino_before != parent_ino_after || len1 != len2 || memcmp(path_before->start, path_after->start, len1))) { - ret = 1; - break; + u64 parent_ino_gen; + + ret = get_inode_info(sctx->parent_root, ino, NULL, + &parent_ino_gen, NULL, NULL, NULL, + NULL); + if (ret < 0) + goto out; + if (ino_gen == parent_ino_gen) { + ret = 1; + break; + } } ino = parent_ino_after; + ino_gen = parent_ino_after_gen; } out: From 0191410158840d6176de3267daa4604ad6a773fb Mon Sep 17 00:00:00 2001 From: Robbie Ko Date: Thu, 5 Jan 2017 16:24:58 +0800 Subject: [PATCH 03/11] Btrfs: incremental send, do not issue invalid rmdir operations When both the parent and send snapshots have a directory inode with the same number but different generations (therefore they are different inodes) and both have an entry with the same name, an incremental send stream will contain an invalid rmdir operation that refers to the orphanized name of the inode from the parent snapshot. The following example scenario shows how this happens. Parent snapshot: . |---- d259_old/ (ino 259, gen 9) | |---- d1/ (ino 258, gen 9) | |---- f (ino 257, gen 9) Send snapshot: . |---- d258/ (ino 258, gen 7) |---- d259/ (ino 259, gen 7) |---- d1/ (ino 257, gen 7) When the kernel is processing inode 258 it notices that in both snapshots there is an inode numbered 259 that is a parent of an inode 258. However it ignores the fact that the inodes numbered 259 have different generations in both snapshots, which means they are effectively different inodes. Then it checks that both inodes 259 have a dentry named "d1" and because of that it issues a rmdir operation with orphanized name of the inode 258 from the parent snapshot. This happens at send.c:process_record_refs(), which calls send.c:did_overwrite_first_ref() that returns true and because of that later on at process_recorded_refs() such rmdir operation is issued because the inode being currently processed (258) is a directory and it was deleted in the send snapshot (and replaced with another inode that has the same number and is a directory too). Fix this issue by comparing the generations of parent directory inodes that have the same number and make send.c:did_overwrite_first_ref() when the generations are different. The following steps reproduce the problem. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/f $ mkdir /mnt/d1 $ mkdir /mnt/d259_old $ mv /mnt/d1 /mnt/d259_old/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/d1 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/d1 /mnt/dir259/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt/ -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inodes from the second snapshot all have # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. This means all the inodes # in the first snapshot (snap1) have a generation value of 9. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive -vv /mnt -f /tmp/2.snap receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9 utimes unlink f mkdir o257-7-0 mkdir o259-7-0 rename o257-7-0 -> o259-7-0/d1 chown o259-7-0/d1 - uid=0, gid=0 chmod o259-7-0/d1 - mode=0755 utimes o259-7-0/d1 rmdir o258-9-0 ERROR: rmdir o258-9-0 failed: No such file or directory Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 2742324514e4fd..712922ea64d278 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1937,6 +1937,19 @@ static int did_overwrite_ref(struct send_ctx *sctx, if (ret <= 0) goto out; + if (dir != BTRFS_FIRST_FREE_OBJECTID) { + ret = get_inode_info(sctx->send_root, dir, NULL, &gen, NULL, + NULL, NULL, NULL); + if (ret < 0 && ret != -ENOENT) + goto out; + if (ret) { + ret = 0; + goto out; + } + if (gen != dir_gen) + goto out; + } + /* check if the ref was overwritten by another ref */ ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len, &ow_inode, &other_type); From 6f546216e9f9e95d6783547ce6113eb13e2daa54 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Sat, 28 Jan 2017 01:47:56 +0000 Subject: [PATCH 04/11] Btrfs: bulk delete checksum items in the same leaf Very often we have the checksums for an extent spread in multiple items in the checksums tree, and currently the algorithm to delete them starts by looking for them one by one and then deleting them one by one, which is not optimal since each deletion involves shifting all the other items in the leaf and when the leaf reaches some low threshold, to move items off the leaf into its left and right neighbor leafs. Also, after each item deletion we release our search path and start a new search for other checksums items. So optimize this by deleting in bulk all the items in the same leaf that contain checksums for the extent being freed. Signed-off-by: Filipe Manana Reviewed-by: Liu Bo --- fs/btrfs/file-item.c | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index f7b9a92ad56d17..e35df48b438387 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -643,7 +643,33 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans, /* delete the entire item, it is inside our range */ if (key.offset >= bytenr && csum_end <= end_byte) { - ret = btrfs_del_item(trans, root, path); + int del_nr = 1; + + /* + * Check how many csum items preceding this one in this + * leaf correspond to our range and then delete them all + * at once. + */ + if (key.offset > bytenr && path->slots[0] > 0) { + int slot = path->slots[0] - 1; + + while (slot >= 0) { + struct btrfs_key pk; + + btrfs_item_key_to_cpu(leaf, &pk, slot); + if (pk.offset < bytenr || + pk.type != BTRFS_EXTENT_CSUM_KEY || + pk.objectid != + BTRFS_EXTENT_CSUM_OBJECTID) + break; + path->slots[0] = slot; + del_nr++; + key.offset = pk.offset; + slot--; + } + } + ret = btrfs_del_items(trans, root, path, + path->slots[0], del_nr); if (ret) goto out; if (key.offset == bytenr) From 91e1f56a8b3c94cb5ac9ce12b806134dc33c1eeb Mon Sep 17 00:00:00 2001 From: Robbie Ko Date: Fri, 7 Oct 2016 10:01:29 +0800 Subject: [PATCH 05/11] Btrfs: fix leak of subvolume writers counter When falling back from a nocow write to a regular cow write, we were leaking the subvolume writers counter in 2 situations, preventing snapshot creation from ever completing in the future, as it waits for that counter to go down to zero before the snapshot creation starts. Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana [Improved changelog and subject] Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c38391e948d97e..4efe9d82944c1b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1331,10 +1331,16 @@ static noinline int run_delalloc_nocow(struct inode *inode, * either valid or do not exist. */ if (csum_exist_in_range(fs_info, disk_bytenr, - num_bytes)) + num_bytes)) { + if (!nolock) + btrfs_end_write_no_snapshoting(root); goto out_check; - if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) + } + if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) { + if (!nolock) + btrfs_end_write_no_snapshoting(root); goto out_check; + } nocow = 1; } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { extent_end = found_key.offset + From 3168021cf9b4906f7bd9871770235f14c5a17715 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 1 Feb 2017 14:58:02 +0000 Subject: [PATCH 06/11] Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled We log holes explicitly by using file extent items, however when replaying a log tree, if a logged file extent item corresponds to a hole and the NO_HOLES feature is enabled we do not need to copy the file extent item into the fs/subvolume tree, as the absence of such file extent items is the purpose of the NO_HOLES feature. So skip the copying of file extent items representing holes when the NO_HOLES feature is enabled. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 3806853cde08d8..3cc972330ab34a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -673,6 +673,10 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, unsigned long dest_offset; struct btrfs_key ins; + if (btrfs_file_extent_disk_bytenr(eb, item) == 0 && + btrfs_fs_incompat(fs_info, NO_HOLES)) + goto update_inode; + ret = btrfs_insert_empty_item(trans, root, path, key, sizeof(*item)); if (ret) @@ -825,6 +829,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, } inode_add_bytes(inode, nbytes); +update_inode: ret = btrfs_update_inode(trans, root, inode); out: if (inode) From 5cdd7db6c5c9d33dc66ecdd81aa8020276c7d140 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Wed, 1 Feb 2017 22:39:50 +0000 Subject: [PATCH 07/11] Btrfs: fix assertion failure when freeing block groups at close_ctree() At close_ctree() we free the block groups and then only after we wait for any running worker kthreads to finish and shutdown the workqueues. This behaviour is racy and it triggers an assertion failure when freeing block groups because while we are doing it we can have for example a block group caching kthread running, and in that case the block group's reference count can still be greater than 1 by the time we assert its reference count is 1, leading to an assertion failure: [19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799 [19041.200584] ------------[ cut here ]------------ [19041.201692] kernel BUG at fs/btrfs/ctree.h:3418! [19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP [19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000 [19041.208082] RIP: 0010:[] [] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP: 0018:ffffc9000ad37d60 EFLAGS: 00010286 [19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001 [19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff [19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000 [19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170 [19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100 [19041.208082] FS: 00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000 [19041.208082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0 [19041.208082] Stack: [19041.208082] ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630 [19041.208082] ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8 [19041.208082] ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d [19041.208082] Call Trace: [19041.208082] [] btrfs_free_block_groups+0x17f/0x392 [btrfs] [19041.208082] [] close_ctree+0x1c5/0x2e1 [btrfs] [19041.208082] [] ? evict_inodes+0x132/0x141 [19041.208082] [] btrfs_put_super+0x15/0x17 [btrfs] [19041.208082] [] generic_shutdown_super+0x6a/0xeb [19041.208082] [] kill_anon_super+0x12/0x1c [19041.208082] [] btrfs_kill_super+0x16/0x21 [btrfs] [19041.208082] [] deactivate_locked_super+0x3b/0x68 [19041.208082] [] deactivate_super+0x36/0x39 [19041.208082] [] cleanup_mnt+0x58/0x76 [19041.208082] [] __cleanup_mnt+0x12/0x14 [19041.208082] [] task_work_run+0x6f/0x95 [19041.208082] [] prepare_exit_to_usermode+0xa3/0xc1 [19041.208082] [] syscall_return_slowpath+0x16e/0x1d2 [19041.208082] [] entry_SYSCALL_64_fastpath+0xab/0xad [19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9 [19041.208082] RIP [] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP [19041.279264] ---[ end trace 23330586f16f064d ]--- This started happening as of kernel 4.8, since commit f3bca8028bd9 ("Btrfs: add ASSERT for block group's memory leak") introduced these assertions. So fix this by freeing the block groups only after waiting for all worker kthreads to complete and shutdown the workqueues. Signed-off-by: Filipe Manana Reviewed-by: Liu Bo --- fs/btrfs/disk-io.c | 6 +++--- fs/btrfs/extent-tree.c | 9 ++++++--- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2b06f557c176ea..d0e61d58b12190 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3261,7 +3261,6 @@ int open_ctree(struct super_block *sb, fail_block_groups: btrfs_put_block_group_cache(fs_info); - btrfs_free_block_groups(fs_info); fail_tree_roots: free_root_pointers(fs_info, 1); @@ -3269,6 +3268,7 @@ int open_ctree(struct super_block *sb, fail_sb_buffer: btrfs_stop_all_workers(fs_info); + btrfs_free_block_groups(fs_info); fail_alloc: fail_iput: btrfs_mapping_tree_free(&fs_info->mapping_tree); @@ -3977,8 +3977,6 @@ void close_ctree(struct btrfs_fs_info *fs_info) btrfs_put_block_group_cache(fs_info); - btrfs_free_block_groups(fs_info); - /* * we must make sure there is not any read request to * submit after we stopping all workers. @@ -3986,6 +3984,8 @@ void close_ctree(struct btrfs_fs_info *fs_info) invalidate_inode_pages2(fs_info->btree_inode->i_mapping); btrfs_stop_all_workers(fs_info); + btrfs_free_block_groups(fs_info); + clear_bit(BTRFS_FS_OPEN, &fs_info->flags); free_root_pointers(fs_info, 1); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c35b966335543c..438c7312de3381 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9740,6 +9740,11 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info) } } +/* + * Must be called only after stopping all workers, since we could have block + * group caching kthreads running, and therefore they could race with us if we + * freed the block groups before stopping them. + */ int btrfs_free_block_groups(struct btrfs_fs_info *info) { struct btrfs_block_group_cache *block_group; @@ -9779,9 +9784,6 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) list_del(&block_group->list); up_write(&block_group->space_info->groups_sem); - if (block_group->cached == BTRFS_CACHE_STARTED) - wait_block_group_cache_done(block_group); - /* * We haven't cached this block group, which means we could * possibly have excluded extents on this block group. @@ -9791,6 +9793,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) free_excluded_extents(info, block_group); btrfs_remove_free_space_cache(block_group); + ASSERT(block_group->cached != BTRFS_CACHE_STARTED); ASSERT(list_empty(&block_group->dirty_list)); ASSERT(list_empty(&block_group->io_list)); ASSERT(list_empty(&block_group->bg_list)); From a9b9477db2937934e469db800317ec3ef7e81b51 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Sat, 4 Feb 2017 17:12:00 +0000 Subject: [PATCH 08/11] Btrfs: fix use-after-free due to wrong order of destroying work queues Before we destroy all work queues (and wait for their tasks to complete) we were destroying the work queues used for metadata I/O operations, which can result in a use-after-free problem because most tasks from all work queues do metadata I/O operations. For example, the tasks from the caching workers work queue (fs_info->caching_workers), which is destroyed only after the work queue used for metadata reads (fs_info->endio_meta_workers) is destroyed, do metadata reads, which result in attempts to queue tasks into the later work queue, triggering a use-after-free with a trace like the following: [23114.613543] general protection fault: 0000 [#1] PREEMPT SMP [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug] [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs] [23114.616932] task: ffff880221d45780 task.stack: ffffc9000bc50000 [23114.616932] RIP: 0010:[] [] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP: 0018:ffff88023f443d60 EFLAGS: 00010246 [23114.616932] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000102 [23114.616932] RDX: ffffffffa0419000 RSI: ffff88011df534f0 RDI: ffff880101f01c00 [23114.616932] RBP: ffff88023f443d80 R08: 00000000000f7000 R09: 000000000000ffff [23114.616932] R10: ffff88023f443d48 R11: 0000000000001000 R12: ffff88011df534f0 [23114.616932] R13: ffff880135963868 R14: 0000000000001000 R15: 0000000000001000 [23114.616932] FS: 0000000000000000(0000) GS:ffff88023f440000(0000) knlGS:0000000000000000 [23114.616932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [23114.616932] CR2: 00007f0fb9f8e520 CR3: 0000000001a0b000 CR4: 00000000000006e0 [23114.616932] Stack: [23114.616932] ffff880101f01c00 ffff88011df534f0 ffff880135963868 0000000000001000 [23114.616932] ffff88023f443da0 ffffffffa03470af ffff880149b37200 ffff880135963868 [23114.616932] ffff88023f443db8 ffffffff8125293c ffff880149b37200 ffff88023f443de0 [23114.616932] Call Trace: [23114.616932] [23114.616932] [] end_workqueue_bio+0xd5/0xda [btrfs] [23114.616932] [] bio_endio+0x54/0x57 [23114.616932] [] btrfs_end_bio+0xf7/0x106 [btrfs] [23114.616932] [] bio_endio+0x54/0x57 [23114.616932] [] blk_update_request+0x21a/0x30f [23114.616932] [] scsi_end_request+0x31/0x182 [scsi_mod] [23114.616932] [] scsi_io_completion+0x1ce/0x4c8 [scsi_mod] [23114.616932] [] scsi_finish_command+0x104/0x10d [scsi_mod] [23114.616932] [] scsi_softirq_done+0x101/0x10a [scsi_mod] [23114.616932] [] blk_done_softirq+0x82/0x8d [23114.616932] [] __do_softirq+0x1ab/0x412 [23114.616932] [] irq_exit+0x49/0x99 [23114.616932] [] smp_call_function_single_interrupt+0x24/0x26 [23114.616932] [] call_function_single_interrupt+0x89/0x90 [23114.616932] [23114.616932] [] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [] ? _raw_spin_unlock_irq+0x32/0x4a [23114.616932] [] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [] scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [] __blk_run_queue_uncond+0x22/0x2b [23114.616932] [] __blk_run_queue+0x19/0x1b [23114.616932] [] blk_queue_bio+0x268/0x282 [23114.616932] [] generic_make_request+0xbd/0x160 [23114.616932] [] submit_bio+0x100/0x11d [23114.616932] [] ? __this_cpu_preempt_check+0x13/0x15 [23114.616932] [] ? __percpu_counter_add+0x8e/0xa7 [23114.616932] [] btrfsic_submit_bio+0x1a/0x1d [btrfs] [23114.616932] [] btrfs_map_bio+0x1f4/0x26d [btrfs] [23114.616932] [] btree_submit_bio_hook+0x74/0xbf [btrfs] [23114.616932] [] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs] [23114.616932] [] submit_one_bio+0x6b/0x89 [btrfs] [23114.616932] [] read_extent_buffer_pages+0x170/0x1ec [btrfs] [23114.616932] [] ? free_root_pointers+0x64/0x64 [btrfs] [23114.616932] [] readahead_tree_block+0x3f/0x4c [btrfs] [23114.616932] [] read_block_for_search.isra.20+0x1ce/0x23d [btrfs] [23114.616932] [] btrfs_search_slot+0x65f/0x774 [btrfs] [23114.616932] [] ? free_extent_buffer+0x73/0x7e [btrfs] [23114.616932] [] btrfs_next_old_leaf+0xa1/0x33c [btrfs] [23114.616932] [] btrfs_next_leaf+0x10/0x12 [btrfs] [23114.616932] [] caching_thread+0x22d/0x416 [btrfs] [23114.616932] [] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs] [23114.616932] [] btrfs_cache_helper+0xe/0x10 [btrfs] [23114.616932] [] process_one_work+0x273/0x4e4 [23114.616932] [] worker_thread+0x1eb/0x2ca [23114.616932] [] ? rescuer_thread+0x2b6/0x2b6 [23114.616932] [] kthread+0xd5/0xdd [23114.616932] [] ? __kthread_unpark+0x5a/0x5a [23114.616932] [] ret_from_fork+0x27/0x40 [23114.616932] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 64 ff 74 04 f0 ff 43 58 49 83 7c 24 08 00 74 2c 4c 8d 6b [23114.616932] RIP [] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP [23114.689493] ---[ end trace 6e48b6bc707ca34b ]--- [23114.690166] Kernel panic - not syncing: Fatal exception in interrupt [23114.691283] Kernel Offset: disabled [23114.691918] ---[ end Kernel panic - not syncing: Fatal exception in interrupt The following diagram shows the sequence of operations that lead to the use-after-free problem from the above trace: CPU 1 CPU 2 CPU 3 caching_thread() close_ctree() btrfs_stop_all_workers() btrfs_destroy_workqueue( fs_info->endio_meta_workers) btrfs_search_slot() read_block_for_search() readahead_tree_block() read_extent_buffer_pages() submit_one_bio() btree_submit_bio_hook() btrfs_bio_wq_end_io() --> sets the bio's bi_end_io callback to end_workqueue_bio() --> bio is submitted bio completes and its bi_end_io callback is invoked --> end_workqueue_bio() --> attempts to queue a task on fs_info->endio_meta_workers btrfs_destroy_workqueue( fs_info->caching_workers) So fix this by destroying the queues used for metadata I/O tasks only after destroying all the other queues. Signed-off-by: Filipe Manana Reviewed-by: Liu Bo --- fs/btrfs/disk-io.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d0e61d58b12190..32a9ec11888d7b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2205,11 +2205,9 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info) btrfs_destroy_workqueue(fs_info->delalloc_workers); btrfs_destroy_workqueue(fs_info->workers); btrfs_destroy_workqueue(fs_info->endio_workers); - btrfs_destroy_workqueue(fs_info->endio_meta_workers); btrfs_destroy_workqueue(fs_info->endio_raid56_workers); btrfs_destroy_workqueue(fs_info->endio_repair_workers); btrfs_destroy_workqueue(fs_info->rmw_workers); - btrfs_destroy_workqueue(fs_info->endio_meta_write_workers); btrfs_destroy_workqueue(fs_info->endio_write_workers); btrfs_destroy_workqueue(fs_info->endio_freespace_worker); btrfs_destroy_workqueue(fs_info->submit_workers); @@ -2219,6 +2217,13 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info) btrfs_destroy_workqueue(fs_info->flush_workers); btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers); btrfs_destroy_workqueue(fs_info->extent_workers); + /* + * Now that all other work queues are destroyed, we can safely destroy + * the queues used for metadata I/O, since tasks from those other work + * queues can do metadata I/O operations. + */ + btrfs_destroy_workqueue(fs_info->endio_meta_workers); + btrfs_destroy_workqueue(fs_info->endio_meta_write_workers); } static void free_root_extent_buffers(struct btrfs_root *root) From 82bfb2e7b645c8f228dc3b6d3b27b0b10125ca4f Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 14 Feb 2017 17:56:32 +0000 Subject: [PATCH 09/11] Btrfs: incremental send, fix unnecessary hole writes for sparse files When using the NO_HOLES feature, during an incremental send we often issue write operations for holes when we should not, because that range is already a hole in the destination snapshot. While that does not change the contents of the file at the receiver, it avoids preservation of file holes, leading to wasted disk space and extra IO during send/receive. A couple examples where the holes are not preserved follows. $ mkfs.btrfs -O no-holes -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 # Now add one new extent to our first test file, increasing its size and # leaving a 1Mb hole between the first extent and this new extent. $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo # Now overwrite the last extent of our second test file. $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25088..25095 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24576..24583 8 0x2001 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25096..25103 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24584..24591 8 0x2001 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt # It's not relevant to enable no-holes in the new filesystem. $ mkfs.btrfs -O no-holes -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive /mnt -f /tmp/2.snap $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24576..24583 8 0x2000 1: [8..2063]: 25624..27679 2056 0x1 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24584..24591 8 0x2000 1: [8..2063]: 27680..29735 2056 0x1 The holes do not exist in the second filesystem and they were replaced with extents filled with the byte 0x00, making each file take 1032Kb of space instead of 8Kb. So fix this by not issuing the write operations consisting of buffers filled with the byte 0x00 when the destination snapshot already has a hole for the respective range. A test case for fstests will follow soon. Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 86 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 712922ea64d278..456c8901489b6c 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -5306,6 +5306,81 @@ static int get_last_extent(struct send_ctx *sctx, u64 offset) return ret; } +static int range_is_hole_in_parent(struct send_ctx *sctx, + const u64 start, + const u64 end) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_root *root = sctx->parent_root; + u64 search_start = start; + int ret; + + path = alloc_path_for_send(); + if (!path) + return -ENOMEM; + + key.objectid = sctx->cur_ino; + key.type = BTRFS_EXTENT_DATA_KEY; + key.offset = search_start; + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + goto out; + if (ret > 0 && path->slots[0] > 0) + path->slots[0]--; + + while (search_start < end) { + struct extent_buffer *leaf = path->nodes[0]; + int slot = path->slots[0]; + struct btrfs_file_extent_item *fi; + u64 extent_end; + + if (slot >= btrfs_header_nritems(leaf)) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + goto out; + else if (ret > 0) + break; + continue; + } + + btrfs_item_key_to_cpu(leaf, &key, slot); + if (key.objectid < sctx->cur_ino || + key.type < BTRFS_EXTENT_DATA_KEY) + goto next; + if (key.objectid > sctx->cur_ino || + key.type > BTRFS_EXTENT_DATA_KEY || + key.offset >= end) + break; + + fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); + if (btrfs_file_extent_type(leaf, fi) == + BTRFS_FILE_EXTENT_INLINE) { + u64 size = btrfs_file_extent_inline_len(leaf, slot, fi); + + extent_end = ALIGN(key.offset + size, + root->fs_info->sectorsize); + } else { + extent_end = key.offset + + btrfs_file_extent_num_bytes(leaf, fi); + } + if (extent_end <= start) + goto next; + if (btrfs_file_extent_disk_bytenr(leaf, fi) == 0) { + search_start = extent_end; + goto next; + } + ret = 0; + goto out; +next: + path->slots[0]++; + } + ret = 1; +out: + btrfs_free_path(path); + return ret; +} + static int maybe_send_hole(struct send_ctx *sctx, struct btrfs_path *path, struct btrfs_key *key) { @@ -5350,8 +5425,17 @@ static int maybe_send_hole(struct send_ctx *sctx, struct btrfs_path *path, return ret; } - if (sctx->cur_inode_last_extent < key->offset) - ret = send_hole(sctx, key->offset); + if (sctx->cur_inode_last_extent < key->offset) { + ret = range_is_hole_in_parent(sctx, + sctx->cur_inode_last_extent, + key->offset); + if (ret < 0) + return ret; + else if (ret == 0) + ret = send_hole(sctx, key->offset); + else + ret = 0; + } sctx->cur_inode_last_extent = extent_end; return ret; } From 76b42abbf7488121c4f9f1ea5941123306e25d99 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 14 Feb 2017 16:56:01 +0000 Subject: [PATCH 10/11] Btrfs: fix data loss after truncate when using the no-holes feature If we have a file with an implicit hole (NO_HOLES feature enabled) that has an extent following the hole, delayed writes against regions of the file behind the hole happened before but were not yet flushed and then we truncate the file to a smaller size that lies inside the hole, we end up persisting a wrong disk_i_size value for our inode that leads to data loss after umounting and mounting again the filesystem or after the inode is evicted and loaded again. This happens because at inode.c:btrfs_truncate_inode_items() we end up setting last_size to the offset of the extent that we deleted and that followed the hole. We then pass that value to btrfs_ordered_update_i_size() which updates the inode's disk_i_size to a value smaller then the offset of the buffered (delayed) writes. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo $ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo $ xfs_io -c "truncate 60K" /mnt/foo --> inode's disk_i_size updated to 0 $ md5sum /mnt/foo 3c5ca3c3ab42f4b04d7e7eb0b0d4d806 /mnt/foo $ umount /dev/sdb $ mount /dev/sdb /mnt $ md5sum /mnt/foo d41d8cd98f00b204e9800998ecf8427e /mnt/foo --> Empty file, all data lost! Cc: # 3.14+ Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana Reviewed-by: Liu Bo --- fs/btrfs/inode.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4efe9d82944c1b..70df4519242406 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4418,19 +4418,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (found_type > min_type) { del_item = 1; } else { - if (item_end < new_size) { - /* - * With NO_HOLES mode, for the following mapping - * - * [0-4k][hole][8k-12k] - * - * if truncating isize down to 6k, it ends up - * isize being 8k. - */ - if (btrfs_fs_incompat(root->fs_info, NO_HOLES)) - last_size = new_size; + if (item_end < new_size) break; - } if (found_key.offset >= new_size) del_item = 1; else @@ -4613,8 +4602,12 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); } error: - if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) + if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { + ASSERT(last_size >= new_size); + if (!err && last_size > new_size) + last_size = new_size; btrfs_ordered_update_i_size(inode, last_size, NULL); + } btrfs_free_path(path); From 263d3995c93c6020576f6c93506412a0b9d1e932 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Fri, 17 Feb 2017 18:43:57 +0000 Subject: [PATCH 11/11] Btrfs: try harder to migrate items to left sibling before splitting a leaf Before attempting to split a leaf we try to migrate items from the leaf to its right and left siblings. We start by trying to move items into the rigth sibling and, if the new item is meant to be inserted at the end of our leaf, we try to free from our leaf an amount of bytes equal to the number of bytes used by the new item, by setting the variable space_needed to the byte size of that new item. However if we fail to move enough items to the right sibling due to lack of space in that sibling, we then try to move items into the left sibling, and in that case we try to free an amount equal to the size of the new item from our leaf, when we need only to free an amount corresponding to the size of the new item minus the current free space of our leaf. So make sure that before we try to move items to the left sibling we do set the variable space_needed with a value corresponding to the new item's size minus the leaf's current free space. Signed-off-by: Filipe Manana Reviewed-by: Liu Bo --- fs/btrfs/ctree.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 1192bc7d2ee782..1fb60ee77b4a50 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -4159,6 +4159,9 @@ static noinline int push_for_double_split(struct btrfs_trans_handle *trans, /* try to push all the items before our slot into the next leaf */ slot = path->slots[0]; + space_needed = data_size; + if (slot > 0) + space_needed -= btrfs_leaf_free_space(fs_info, path->nodes[0]); ret = push_leaf_left(trans, root, path, 1, space_needed, 0, slot); if (ret < 0) return ret; @@ -4214,6 +4217,10 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans, if (wret < 0) return wret; if (wret) { + space_needed = data_size; + if (slot > 0) + space_needed -= btrfs_leaf_free_space(fs_info, + l); wret = push_leaf_left(trans, root, path, space_needed, space_needed, 0, (u32)-1); if (wret < 0)