Skip to content

Commit

Permalink
Merge tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/x…
Browse files Browse the repository at this point in the history
…fs-linux

Pull xfs updates from Darrick Wong:
 "There's a lot going on this time, which seems about right for this
  drama-filled year.

  Community developers added some code to speed up freezing when
  read-only workloads are still running, refactored the logging code,
  added checks to prevent file extent counter overflow, reduced iolock
  cycling to speed up fsync and gc scans, and started the slow march
  towards supporting filesystem shrinking.

  There's a huge refactoring of the internal speculative preallocation
  garbage collection code which fixes a bunch of bugs, makes the gc
  scheduling per-AG and hence multithreaded, and standardizes the retry
  logic when we try to reserve space or quota, can't, and want to
  trigger a gc scan. We also enable multithreaded quotacheck to reduce
  mount times further. This is also preparation for background file gc,
  which may or may not land for 5.13.

  We also fixed some deadlocks in the rename code, fixed a quota
  accounting leak when FSSETXATTR fails, restored the behavior that
  write faults to an mmap'd region actually cause a SIGBUS, fixed a bug
  where sgid directory inheritance wasn't quite working properly, and
  fixed a bug where symlinks weren't working properly in ecryptfs. We
  also now advertise the inode btree counters feature that was
  introduced two cycles ago.

  Summary:

   - Fix an ABBA deadlock when renaming files on overlayfs.

   - Make sure that we can't overflow the inode extent counters when
     adding to or removing extents from a file.

   - Make directory sgid inheritance work the same way as all the other
     filesystems.

   - Don't drain the buffer cache on freeze and ro remount, which should
     reduce the amount of time if read-only workloads are continuing
     during the freeze.

   - Fix a bug where symlink size isn't reported to the vfs in ecryptfs.

   - Disentangle log cleaning from log covering. This refactoring sets
     us up for future changes to the log, though for now it simply means
     that we can use covering for freezes, and cleaning becomes
     something we only do at unmount.

   - Speed up file fsyncs by reducing iolock cycling.

   - Fix delalloc blocks leaking when changing the project id fails
     because of input validation errors in FSSETXATTR.

   - Fix oversized quota reservation when converting unwritten extents
     during a DAX write.

   - Create a transaction allocation helper function to standardize the
     idiom of allocating a transaction, reserving blocks, locking
     inodes, and reserving quota. Replace all the open-coded logic for
     file creation, file ownership changes, and file modifications to
     use them.

   - Actually shut down the fs if the incore quota reservations get
     corrupted.

   - Fix background block garbage collection scans to not block and to
     actually clean out CoW staging extents properly.

   - Run block gc scans when we run low on project quota.

   - Use the standardized transaction allocation helpers to make it so
     that ENOSPC and EDQUOT errors during reservation will back out,
     invoke the block gc scanner, and try again. This is preparation for
     introducing background inode garbage collection in the next cycle.

   - Combine speculative post-EOF block garbage collection with
     speculative copy on write block garbage collection.

   - Enable multithreaded quotacheck.

   - Allow sysadmins to tweak the CPU affinities and maximum concurrency
     levels of quotacheck and background blockgc worker pools.

   - Expose the inode btree counter feature in the fs geometry ioctl.

   - Cleanups of the growfs code in preparation for starting work on
     filesystem shrinking.

   - Fix all the bloody gcc warnings that the maintainer knows about. :P

   - Fix a RST syntax error.

   - Don't trigger bmbt corruption assertions after the fs shuts down.

   - Restore behavior of forcing SIGBUS on a shut down filesystem when
     someone triggers a mmap write fault (or really, any buffered
     write)"

* tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (85 commits)
  xfs: consider shutdown in bmapbt cursor delete assert
  xfs: fix boolreturn.cocci warnings
  xfs: restore shutdown check in mapped write fault path
  xfs: fix rst syntax error in admin guide
  xfs: fix incorrect root dquot corruption error when switching group/project quota types
  xfs: get rid of xfs_growfs_{data,log}_t
  xfs: rename `new' to `delta' in xfs_growfs_data_private()
  libxfs: expose inobtcount in xfs geometry
  xfs: don't bounce the iolock between free_{eof,cow}blocks
  xfs: expose the blockgc workqueue knobs publicly
  xfs: parallelize block preallocation garbage collection
  xfs: rename block gc start and stop functions
  xfs: only walk the incore inode tree once per blockgc scan
  xfs: consolidate the eofblocks and cowblocks workers
  xfs: consolidate incore inode radix tree posteof/cowblocks tags
  xfs: remove trivial eof/cowblocks functions
  xfs: hide xfs_icache_free_cowblocks
  xfs: hide xfs_icache_free_eofblocks
  xfs: relocate the eofb/cowb workqueue functions
  xfs: set WQ_SYSFS on all workqueues in debug mode
  ...
  • Loading branch information
torvalds committed Feb 21, 2021
2 parents 4f016a3 + 1cd738b commit b52bb13
Show file tree
Hide file tree
Showing 53 changed files with 1,650 additions and 978 deletions.
42 changes: 42 additions & 0 deletions Documentation/admin-guide/xfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -495,3 +495,45 @@ the class and error context. For example, the default values for
"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
to "fail immediately" behaviour. This is done because ENODEV is a fatal,
unrecoverable error no matter how many times the metadata IO is retried.

Workqueue Concurrency
=====================

XFS uses kernel workqueues to parallelize metadata update processes. This
enables it to take advantage of storage hardware that can service many IO
operations simultaneously. This interface exposes internal implementation
details of XFS, and as such is explicitly not part of any userspace API/ABI
guarantee the kernel may give userspace. These are undocumented features of
the generic workqueue implementation XFS uses for concurrency, and they are
provided here purely for diagnostic and tuning purposes and may change at any
time in the future.

The control knobs for a filesystem's workqueues are organized by task at hand
and the short name of the data device. They all can be found in:

/sys/bus/workqueue/devices/${task}!${device}

================ ===========
Task Description
================ ===========
xfs_iwalk-$pid Inode scans of the entire filesystem. Currently limited to
mount time quotacheck.
xfs-blockgc Background garbage collection of disk space that have been
speculatively allocated beyond EOF or for staging copy on
write operations.
================ ===========

For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.

The interesting knobs for XFS workqueues are as follows:

============ ===========
Knob Description
============ ===========
max_active Maximum number of background threads that can be started to
run the work.
cpumask CPUs upon which the threads are allowed to run.
nice Relative priority of scheduling the threads. These are the
same nice levels that can be applied to userspace processes.
============ ===========
50 changes: 50 additions & 0 deletions fs/xfs/libxfs/xfs_alloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -2474,6 +2474,47 @@ xfs_defer_agfl_block(
xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &new->xefi_list);
}

#ifdef DEBUG
/*
* Check if an AGF has a free extent record whose length is equal to
* args->minlen.
*/
STATIC int
xfs_exact_minlen_extent_available(
struct xfs_alloc_arg *args,
struct xfs_buf *agbp,
int *stat)
{
struct xfs_btree_cur *cnt_cur;
xfs_agblock_t fbno;
xfs_extlen_t flen;
int error = 0;

cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, agbp,
args->agno, XFS_BTNUM_CNT);
error = xfs_alloc_lookup_ge(cnt_cur, 0, args->minlen, stat);
if (error)
goto out;

if (*stat == 0) {
error = -EFSCORRUPTED;
goto out;
}

error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, stat);
if (error)
goto out;

if (*stat == 1 && flen != args->minlen)
*stat = 0;

out:
xfs_btree_del_cursor(cnt_cur, error);

return error;
}
#endif

/*
* Decide whether to use this allocation group for this allocation.
* If so, fix up the btree freelist's size.
Expand Down Expand Up @@ -2545,6 +2586,15 @@ xfs_alloc_fix_freelist(
if (!xfs_alloc_space_available(args, need, flags))
goto out_agbp_relse;

#ifdef DEBUG
if (args->alloc_minlen_only) {
int stat;

error = xfs_exact_minlen_extent_available(args, agbp, &stat);
if (error || !stat)
goto out_agbp_relse;
}
#endif
/*
* Make the freelist shorter if it's too long.
*
Expand Down
3 changes: 3 additions & 0 deletions fs/xfs/libxfs/xfs_alloc.h
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ typedef struct xfs_alloc_arg {
char wasfromfl; /* set if allocation is from freelist */
struct xfs_owner_info oinfo; /* owner of blocks being allocated */
enum xfs_ag_resv_type resv; /* block reservation to use */
#ifdef DEBUG
bool alloc_minlen_only; /* allocate exact minlen extent */
#endif
} xfs_alloc_arg_t;

/*
Expand Down
22 changes: 11 additions & 11 deletions fs/xfs/libxfs/xfs_attr.c
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,7 @@ xfs_attr_set(
struct xfs_trans_res tres;
bool rsvd = (args->attr_filter & XFS_ATTR_ROOT);
int error, local;
int rmt_blks = 0;
unsigned int total;

if (XFS_FORCED_SHUTDOWN(dp->i_mount))
Expand Down Expand Up @@ -442,34 +443,33 @@ xfs_attr_set(
tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
total = args->total;

if (!local)
rmt_blks = xfs_attr3_rmt_blocks(mp, args->valuelen);
} else {
XFS_STATS_INC(mp, xs_attr_remove);

tres = M_RES(mp)->tr_attrrm;
total = XFS_ATTRRM_SPACE_RES(mp);
rmt_blks = xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX);
}

/*
* Root fork attributes can use reserved data blocks for this
* operation if necessary
*/
error = xfs_trans_alloc(mp, &tres, total, 0,
rsvd ? XFS_TRANS_RESERVE : 0, &args->trans);
error = xfs_trans_alloc_inode(dp, &tres, total, 0, rsvd, &args->trans);
if (error)
return error;

xfs_ilock(dp, XFS_ILOCK_EXCL);
xfs_trans_ijoin(args->trans, dp, 0);
if (args->value) {
unsigned int quota_flags = XFS_QMOPT_RES_REGBLKS;

if (rsvd)
quota_flags |= XFS_QMOPT_FORCE_RES;
error = xfs_trans_reserve_quota_nblks(args->trans, dp,
args->total, 0, quota_flags);
if (args->value || xfs_inode_hasattr(dp)) {
error = xfs_iext_count_may_overflow(dp, XFS_ATTR_FORK,
XFS_IEXT_ATTR_MANIP_CNT(rmt_blks));
if (error)
goto out_trans_cancel;
}

if (args->value) {
error = xfs_has_attr(args);
if (error == -EEXIST && (args->attr_flags & XATTR_CREATE))
goto out_trans_cancel;
Expand Down
Loading

0 comments on commit b52bb13

Please sign in to comment.