Skip to content

Commit

Permalink
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/…
Browse files Browse the repository at this point in the history
…git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Rename and export helpers that get write access to a mount. They
     are used in overlayfs to get write access to the upper mount.

   - Print the pretty name of the root device on boot failure. This
     helps in scenarios where we would usually only print
     "unknown-block(1,2)".

   - Add an internal SB_I_NOUMASK flag. This is another part in the
     endless POSIX ACL saga in a way.

     When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
     the umask because if the relevant inode has POSIX ACLs set it might
     take the umask from there. But if the inode doesn't have any POSIX
     ACLs set then we apply the umask in the filesytem itself. So we end
     up with:

      (1) no SB_POSIXACL -> strip umask in vfs
      (2) SB_POSIXACL    -> strip umask in filesystem

     The umask semantics associated with SB_POSIXACL allowed filesystems
     that don't even support POSIX ACLs at all to raise SB_POSIXACL
     purely to avoid umask stripping. That specifically means NFS v4 and
     Overlayfs. NFS v4 does it because it delegates this to the server
     and Overlayfs because it needs to delegate umask stripping to the
     upper filesystem, i.e., the filesystem used as the writable layer.

     This went so far that SB_POSIXACL is raised eve on kernels that
     don't even have POSIX ACL support at all.

     Stop this blatant abuse and add SB_I_NOUMASK which is an internal
     superblock flag that filesystems can raise to opt out of umask
     handling. That should really only be the two mentioned above. It's
     not that we want any filesystems to do this. Ideally we have all
     umask handling always in the vfs.

   - Make overlayfs use SB_I_NOUMASK too.

   - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
     IS_POSIXACL() if the kernel doesn't have support for it. This is a
     very old patch but it's only possible to do this now with the wider
     cleanup that was done.

   - Follow-up work on fake path handling from last cycle. Citing mostly
     from Amir:

     When overlayfs was first merged, overlayfs files of regular files
     and directories, the ones that are installed in file table, had a
     "fake" path, namely, f_path is the overlayfs path and f_inode is
     the "real" inode on the underlying filesystem.

     In v6.5, we took another small step by introducing of the
     backing_file container and the file_real_path() helper. This change
     allowed vfs and filesystem code to get the "real" path of an
     overlayfs backing file. With this change, we were able to make
     fsnotify work correctly and report events on the "real" filesystem
     objects that were accessed via overlayfs.

     This method works fine, but it still leaves the vfs vulnerable to
     new code that is not aware of files with fake path. A recent
     example is commit db1d1e8 ("IMA: use vfs_getattr_nosec to get
     the i_version"). This commit uses direct referencing to f_path in
     IMA code that otherwise uses file_inode() and file_dentry() to
     reference the filesystem objects that it is measuring.

     This contains work to switch things around: instead of having
     filesystem code opt-in to get the "real" path, have generic code
     opt-in for the "fake" path in the few places that it is needed.

     Is it far more likely that new filesystems code that does not use
     the file_dentry() and file_real_path() helpers will end up causing
     crashes or averting LSM/audit rules if we keep the "fake" path
     exposed by default.

     This change already makes file_dentry() moot, but for now we did
     not change this helper just added a WARN_ON() in ovl_d_real() to
     catch if we have made any wrong assumptions.

     After the dust settles on this change, we can make file_dentry() a
     plain accessor and we can drop the inode argument to ->d_real().

   - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
     change but it really isn't and I would like to see everyone on
     their tippie toes for any possible bugs from this work.

     Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
     files since a very long time because of the nasty interactions
     between the SCM_RIGHTS file descriptor garbage collection. So
     extending it makes a lot of sense but it is a subtle change. There
     are almost no places that fiddle with file rcu semantics directly
     and the ones that did mess around with struct file internal under
     rcu have been made to stop doing that because it really was always
     dodgy.

     I forgot to put in the link tag for this change and the discussion
     in the commit so adding it into the merge message:

       https://lore.kernel.org/r/[email protected]

  Cleanups:

   - Various smaller pipe cleanups including the removal of a spin lock
     that was only used to protect against writes without pipe_lock()
     from O_NOTIFICATION_PIPE aka watch queues. As that was never
     implemented remove the additional locking from pipe_write().

   - Annotate struct watch_filter with the new __counted_by attribute.

   - Clarify do_unlinkat() cleanup so that it doesn't look like an extra
     iput() is done that would cause issues.

   - Simplify file cleanup when the file has never been opened.

   - Use module helper instead of open-coding it.

   - Predict error unlikely for stale retry.

   - Use WRITE_ONCE() for mount expiry field instead of just commenting
     that one hopes the compiler doesn't get smart.

  Fixes:

   - Fix readahead on block devices.

   - Fix writeback when layztime is enabled and inodes whose timestamp
     is the only thing that changed reside on wb->b_dirty_time. This
     caused excessively large zombie memory cgroup when lazytime was
     enabled as such inodes weren't handled fast enough.

   - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"

* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  file, i915: fix file reference for mmap_singleton()
  vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
  writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
  chardev: Simplify usage of try_module_get()
  ovl: rely on SB_I_NOUMASK
  fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
  fs: store real path instead of fake path in backing file f_path
  fs: create helper file_user_path() for user displayed mapped file path
  fs: get mnt_writers count for an open backing file's real path
  vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
  vfs: predict the error in retry_estale as unlikely
  backing file: free directly
  vfs: fix readahead(2) on block devices
  io_uring: use files_lookup_fd_locked()
  file: convert to SLAB_TYPESAFE_BY_RCU
  vfs: shave work on failed file open
  fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
  watch_queue: Annotate struct watch_filter with __counted_by
  fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
  fs/pipe: remove unnecessary spinlock from pipe_write()
  ...
  • Loading branch information
torvalds committed Oct 30, 2023
2 parents 0d63d8b + 61d4fb0 commit 3b3f874
Show file tree
Hide file tree
Showing 39 changed files with 479 additions and 270 deletions.
53 changes: 24 additions & 29 deletions Documentation/filesystems/files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,51 +62,30 @@ the fdtable structure -
be held.

4. To look up the file structure given an fd, a reader
must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
take care of barrier requirements due to lock-free lookup.

An example::

struct file *file;
rcu_read_lock();
file = lookup_fd_rcu(fd);
if (file) {
...
}
....
file = lookup_fdget_rcu(fd);
rcu_read_unlock();

5. Handling of the file structures is special. Since the look-up
of the fd (fget()/fget_light()) are lock-free, it is possible
that look-up may race with the last put() operation on the
file structure. This is avoided using atomic_long_inc_not_zero()
on ->f_count::

rcu_read_lock();
file = files_lookup_fd_rcu(files, fd);
if (file) {
if (atomic_long_inc_not_zero(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
...
fput(file);
}
rcu_read_unlock();
....
return file;
atomic_long_inc_not_zero() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail
fget()/fget_light().

6. Since both fdtable and file structures can be looked up
5. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
issues.

7. While updating, the fdtable pointer must be looked up while
6. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
Expand All @@ -126,3 +105,19 @@ the fdtable structure -
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().

On newer kernels rcu based file lookup has been switched to rely on
SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore
to just acquire a reference to the file in question under rcu using
atomic_long_inc_not_zero() since the file might have already been
recycled and someone else might have bumped the reference. In other
words, callers might see reference count bumps from newer users. For
this is reason it is necessary to verify that the pointer is the same
before and after the reference count increment. This pattern can be seen
in get_file_rcu() and __files_get_rcu().

In addition, it isn't possible to access or check fields in struct file
without first aqcuiring a reference on it under rcu lookup. Not doing
that was always very dodgy and it was only usable for non-pointer data
in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
either first acquire a reference or they must hold the files_lock of the
fdtable.
6 changes: 4 additions & 2 deletions arch/arc/kernel/troubleshoot.c
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,12 @@ static void show_faulting_vma(unsigned long address)
*/
if (vma) {
char buf[ARC_PATH_MAX];
char *nm = "?";
char *nm = "anon";

if (vma->vm_file) {
nm = file_path(vma->vm_file, buf, ARC_PATH_MAX-1);
/* XXX: can we use %pD below and get rid of buf? */
nm = d_path(file_user_path(vma->vm_file), buf,
ARC_PATH_MAX-1);
if (IS_ERR(nm))
nm = "?";
}
Expand Down
11 changes: 7 additions & 4 deletions arch/powerpc/platforms/cell/spufs/coredump.c
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,21 @@ static int match_context(const void *v, struct file *file, unsigned fd)
*/
static struct spu_context *coredump_next_context(int *fd)
{
struct spu_context *ctx;
struct spu_context *ctx = NULL;
struct file *file;
int n = iterate_fd(current->files, *fd, match_context, NULL);
if (!n)
return NULL;
*fd = n - 1;

rcu_read_lock();
file = lookup_fd_rcu(*fd);
ctx = SPUFS_I(file_inode(file))->i_ctx;
get_spu_context(ctx);
file = lookup_fdget_rcu(*fd);
rcu_read_unlock();
if (file) {
ctx = SPUFS_I(file_inode(file))->i_ctx;
get_spu_context(ctx);
fput(file);
}

return ctx;
}
Expand Down
6 changes: 1 addition & 5 deletions drivers/gpu/drm/i915/gem/i915_gem_mman.c
Original file line number Diff line number Diff line change
Expand Up @@ -916,11 +916,7 @@ static struct file *mmap_singleton(struct drm_i915_private *i915)
{
struct file *file;

rcu_read_lock();
file = READ_ONCE(i915->gem.mmap_singleton);
if (file && !get_file_rcu(file))
file = NULL;
rcu_read_unlock();
file = get_file_active(&i915->gem.mmap_singleton);
if (file)
return file;

Expand Down
2 changes: 1 addition & 1 deletion fs/char_dev.c
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ static struct kobject *cdev_get(struct cdev *p)
struct module *owner = p->owner;
struct kobject *kobj;

if (owner && !try_module_get(owner))
if (!try_module_get(owner))
return NULL;
kobj = kobject_get_unless_zero(&p->kobj);
if (!kobj)
Expand Down
153 changes: 135 additions & 18 deletions fs/file.c
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,9 @@ void fd_install(unsigned int fd, struct file *file)
struct files_struct *files = current->files;
struct fdtable *fdt;

if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
return;

rcu_read_lock_sched();

if (unlikely(files->resize_in_progress)) {
Expand Down Expand Up @@ -853,8 +856,104 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock);
}

static struct file *__get_file_rcu(struct file __rcu **f)
{
struct file __rcu *file;
struct file __rcu *file_reloaded;
struct file __rcu *file_reloaded_cmp;

file = rcu_dereference_raw(*f);
if (!file)
return NULL;

if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
return ERR_PTR(-EAGAIN);

file_reloaded = rcu_dereference_raw(*f);

/*
* Ensure that all accesses have a dependency on the load from
* rcu_dereference_raw() above so we get correct ordering
* between reuse/allocation and the pointer check below.
*/
file_reloaded_cmp = file_reloaded;
OPTIMIZER_HIDE_VAR(file_reloaded_cmp);

/*
* atomic_long_inc_not_zero() above provided a full memory
* barrier when we acquired a reference.
*
* This is paired with the write barrier from assigning to the
* __rcu protected file pointer so that if that pointer still
* matches the current file, we know we have successfully
* acquired a reference to the right file.
*
* If the pointers don't match the file has been reallocated by
* SLAB_TYPESAFE_BY_RCU.
*/
if (file == file_reloaded_cmp)
return file_reloaded;

fput(file);
return ERR_PTR(-EAGAIN);
}

/**
* get_file_rcu - try go get a reference to a file under rcu
* @f: the file to get a reference on
*
* This function tries to get a reference on @f carefully verifying that
* @f hasn't been reused.
*
* This function should rarely have to be used and only by users who
* understand the implications of SLAB_TYPESAFE_BY_RCU. Try to avoid it.
*
* Return: Returns @f with the reference count increased or NULL.
*/
struct file *get_file_rcu(struct file __rcu **f)
{
for (;;) {
struct file __rcu *file;

file = __get_file_rcu(f);
if (unlikely(!file))
return NULL;

if (unlikely(IS_ERR(file)))
continue;

return file;
}
}
EXPORT_SYMBOL_GPL(get_file_rcu);

/**
* get_file_active - try go get a reference to a file
* @f: the file to get a reference on
*
* In contast to get_file_rcu() the pointer itself isn't part of the
* reference counting.
*
* This function should rarely have to be used and only by users who
* understand the implications of SLAB_TYPESAFE_BY_RCU. Try to avoid it.
*
* Return: Returns @f with the reference count increased or NULL.
*/
struct file *get_file_active(struct file **f)
{
struct file __rcu *file;

rcu_read_lock();
file = __get_file_rcu(f);
rcu_read_unlock();
if (IS_ERR(file))
file = NULL;
return file;
}
EXPORT_SYMBOL_GPL(get_file_active);

static inline struct file *__fget_files_rcu(struct files_struct *files,
unsigned int fd, fmode_t mask)
unsigned int fd, fmode_t mask)
{
for (;;) {
struct file *file;
Expand All @@ -865,12 +964,6 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
return NULL;

fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
file = rcu_dereference_raw(*fdentry);
if (unlikely(!file))
return NULL;

if (unlikely(file->f_mode & mask))
return NULL;

/*
* Ok, we have a file pointer. However, because we do
Expand All @@ -879,10 +972,15 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
*
* Such a race can take two forms:
*
* (a) the file ref already went down to zero,
* and get_file_rcu() fails. Just try again:
* (a) the file ref already went down to zero and the
* file hasn't been reused yet or the file count
* isn't zero but the file has already been reused.
*/
if (unlikely(!get_file_rcu(file)))
file = __get_file_rcu(fdentry);
if (unlikely(!file))
return NULL;

if (unlikely(IS_ERR(file)))
continue;

/*
Expand All @@ -893,12 +991,20 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
*
* If so, we need to put our ref and try again.
*/
if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
unlikely(rcu_dereference_raw(*fdentry) != file)) {
if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
fput(file);
continue;
}

/*
* This isn't the file we're looking for or we're not
* allowed to get a reference to it.
*/
if (unlikely(file->f_mode & mask)) {
fput(file);
return NULL;
}

/*
* Ok, we have a ref to the file, and checked that it
* still exists.
Expand Down Expand Up @@ -948,7 +1054,14 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
return file;
}

struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
struct file *lookup_fdget_rcu(unsigned int fd)
{
return __fget_files_rcu(current->files, fd, 0);

}
EXPORT_SYMBOL_GPL(lookup_fdget_rcu);

struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd)
{
/* Must be called with rcu_read_lock held */
struct files_struct *files;
Expand All @@ -957,13 +1070,13 @@ struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
task_lock(task);
files = task->files;
if (files)
file = files_lookup_fd_rcu(files, fd);
file = __fget_files_rcu(files, fd, 0);
task_unlock(task);

return file;
}

struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret_fd)
struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *ret_fd)
{
/* Must be called with rcu_read_lock held */
struct files_struct *files;
Expand All @@ -974,7 +1087,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
files = task->files;
if (files) {
for (; fd < files_fdtable(files)->max_fds; fd++) {
file = files_lookup_fd_rcu(files, fd);
file = __fget_files_rcu(files, fd, 0);
if (file)
break;
}
Expand All @@ -983,7 +1096,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
*ret_fd = fd;
return file;
}
EXPORT_SYMBOL(task_lookup_next_fd_rcu);
EXPORT_SYMBOL(task_lookup_next_fdget_rcu);

/*
* Lightweight file lookup - no refcnt increment if fd table isn't shared.
Expand Down Expand Up @@ -1272,12 +1385,16 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
{
if (unlikely(newfd == oldfd)) { /* corner case */
struct files_struct *files = current->files;
struct file *f;
int retval = oldfd;

rcu_read_lock();
if (!files_lookup_fd_rcu(files, oldfd))
f = __fget_files_rcu(files, oldfd, 0);
if (!f)
retval = -EBADF;
rcu_read_unlock();
if (f)
fput(f);
return retval;
}
return ksys_dup3(oldfd, newfd, 0);
Expand Down
Loading

0 comments on commit 3b3f874

Please sign in to comment.