Skip to content

Commit

Permalink
Merge tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kern…
Browse files Browse the repository at this point in the history
…el/git/vfs/vfs

Pull libfs and tmpfs updates from Christian Brauner:
 "This cycle saw a lot of work for tmpfs that required changes to the
  vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this
  cycle. Things will go back to mm next cycle.

  Features
  ========

   - By far the biggest work is the quota support for tmpfs. New tmpfs
     quota infrastructure is added to support it and a new QFMT_SHMEM
     uapi option is exposed.

     This offers user and group quotas to tmpfs (project quotas will be
     added later). Similar to other filesystems tmpfs quota are not
     supported within user namespaces yet.

   - Add support for user xattrs. While tmpfs already supports security
     xattrs (security.*) and POSIX ACLs for a long time it lacked
     support for user xattrs (user.*). With this pull request tmpfs will
     be able to support a limited number of user xattrs.

     This is accompanied by a fix (see below) to limit persistent simple
     xattr allocations.

   - Add support for stable directory offsets. Currently tmpfs relies on
     the libfs provided cursor-based mechanism for readdir. This causes
     issues when a tmpfs filesystem is exported via NFS.

     NFS clients do not open directories. Instead, each server-side
     readdir operation opens the directory, reads it, and then closes
     it. Since the cursor state for that directory is associated with
     the opened file it is discarded after each readdir operation. Such
     directory offsets are not just cached by NFS clients but also
     various userspace libraries based on these clients.

     As it stands there is no way to invalidate the caches when
     directory offsets have changed and the whole application depends on
     unchanging directory offsets.

     At LSFMM we discussed how to solve this problem and decided to
     support stable directory offsets. libfs now allows filesystems like
     tmpfs to use an xarrary to map a directory offset to a dentry. This
     mechanism is currently only used by tmpfs but can be supported by
     others as well.

  Fixes
  =====

   - Change persistent simple xattrs allocations in libfs from
     GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory
     cgroup limits. Since this is a change to libfs it affects both
     tmpfs and kernfs.

   - Correctly verify {g,u}id mount options.

     A new filesystem context is created via fsopen() which records the
     namespace that becomes the owning namespace of the superblock when
     fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are
     mountable in namespaces. However, fsconfig() calls can occur in a
     namespace different from the namespace where fsopen() has been
     called.

     Currently, when fsconfig() is called to set {g,u}id mount options
     the requested {g,u}id is mapped into a k{g,u}id according to the
     namespace where fsconfig() was called from. The resulting k{g,u}id
     is not guaranteed to be resolvable in the namespace of the
     filesystem (the one that fsopen() was called in).

     This means it's possible for an unprivileged user to create files
     owned by any group in a tmpfs mount since it's possible to set the
     setid bits on the tmpfs directory.

     The contract for {g,u}id mount options and {g,u}id values in
     general set from userspace has always been that they are translated
     according to the caller's idmapping. In so far, tmpfs has been
     doing the correct thing. But since tmpfs is mountable in
     unprivileged contexts it is also necessary to verify that the
     resulting {k,g}uid is representable in the namespace of the
     superblock to avoid such bugs.

     The new mount api's cross-namespace delegation abilities are
     already widely used. Having talked to a bunch of userspace this is
     the most faithful solution with minimal regression risks"

* tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs
  mm: invalidation check mapping before folio_contains
  tmpfs: trivial support for direct IO
  tmpfs,xattr: enable limited user extended attributes
  tmpfs: track free_ispace instead of free_inodes
  xattr: simple_xattr_set() return old_xattr to be freed
  tmpfs: verify {g,u}id mount options correctly
  shmem: move spinlock into shmem_recalc_inode() to fix quota support
  libfs: Remove parent dentry locking in offset_iterate_dir()
  libfs: Add a lock class for the offset map's xa_lock
  shmem: stable directory offsets
  shmem: Refactor shmem_symlink()
  libfs: Add directory operations for stable offsets
  shmem: fix quota lock nesting in huge hole handling
  shmem: Add default quota limit mount options
  shmem: quota support
  shmem: prepare shmem quota infrastructure
  quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks
  shmem: make shmem_get_inode() return ERR_PTR instead of NULL
  shmem: make shmem_inode_acct_block() return error
  • Loading branch information
torvalds committed Aug 28, 2023
2 parents 615e958 + 572a3d1 commit ecd7db2
Show file tree
Hide file tree
Showing 19 changed files with 1,412 additions and 297 deletions.
8 changes: 5 additions & 3 deletions Documentation/filesystems/locking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,14 @@ prototypes::
struct dentry *dentry, struct fileattr *fa);
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);

locking rules:
all may block

============== =============================================
============== ==================================================
ops i_rwsem(inode)
============== =============================================
============== ==================================================
lookup: shared
create: exclusive
link: exclusive (both)
Expand All @@ -115,7 +116,8 @@ atomic_open: shared (exclusive if O_CREAT is set in open flags)
tmpfile: no
fileattr_get: no or exclusive
fileattr_set: exclusive
============== =============================================
get_offset_ctx no
============== ==================================================


Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
Expand Down
38 changes: 36 additions & 2 deletions Documentation/filesystems/tmpfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ explained further below, some of which can be reconfigured dynamically on the
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
filesystem can be resized but it cannot be resized to a size below its current
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
modify any parameter for a ramfs filesystem. The size limit of a ramfs
trusted.*, security.* and user.* namespaces. ramfs does not use swap and you
cannot modify any parameter for a ramfs filesystem. The size limit of a ramfs
filesystem is how much memory you have available, and so care must be taken if
used so to not run out of memory.

Expand Down Expand Up @@ -97,6 +97,9 @@ mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many CPUs making intensive use of it.

If nr_inodes is not 0, that limited space for inodes is also used up by
extended attributes: "df -i"'s IUsed and IUse% increase, IFree decreases.

tmpfs blocks may be swapped out, when there is a shortage of memory.
tmpfs has a mount option to disable its use of swap:

Expand All @@ -123,6 +126,37 @@ sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled: which can
be used to deny huge pages on all tmpfs mounts in an emergency, or to
force huge pages on all tmpfs mounts for testing.

tmpfs also supports quota with the following mount options

======================== =================================================
quota User and group quota accounting and enforcement
is enabled on the mount. Tmpfs is using hidden
system quota files that are initialized on mount.
usrquota User quota accounting and enforcement is enabled
on the mount.
grpquota Group quota accounting and enforcement is enabled
on the mount.
usrquota_block_hardlimit Set global user quota block hard limit.
usrquota_inode_hardlimit Set global user quota inode hard limit.
grpquota_block_hardlimit Set global group quota block hard limit.
grpquota_inode_hardlimit Set global group quota inode hard limit.
======================== =================================================

None of the quota related mount options can be set or changed on remount.

Quota limit parameters accept a suffix k, m or g for kilo, mega and giga
and can't be changed on remount. Default global quota limits are taking
effect for any and all user/group/project except root the first time the
quota entry for user/group/project id is being accessed - typically the
first time an inode with a particular id ownership is being created after
the mount. In other words, instead of the limits being initialized to zero,
they are initialized with the particular value provided with these mount
options. The limits can be changed for any user/group id at any time as they
normally can be.

Note that tmpfs quotas do not support user namespaces so no uid/gid
translation is done if quotas are enabled inside user namespaces.

tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'
Expand Down
6 changes: 5 additions & 1 deletion Documentation/filesystems/vfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,7 @@ As of kernel 2.6.22, the following members are defined:
int (*fileattr_set)(struct mnt_idmap *idmap,
struct dentry *dentry, struct fileattr *fa);
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
Again, all methods are called without any locks being held, unless
Expand Down Expand Up @@ -675,7 +676,10 @@ otherwise noted.
called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to
change miscellaneous file flags and attributes. Callers hold
i_rwsem exclusive. If unset, then fall back to f_op->ioctl().

``get_offset_ctx``
called to get the offset context for a directory inode. A
filesystem must define this operation to use
simple_offset_dir_operations.

The Address Space Object
========================
Expand Down
16 changes: 14 additions & 2 deletions fs/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -205,8 +205,8 @@ config TMPFS_XATTR
Extended attributes are name:value pairs associated with inodes by
the kernel or by users (see the attr(5) manual page for details).

Currently this enables support for the trusted.* and
security.* namespaces.
This enables support for the trusted.*, security.* and user.*
namespaces.

You need this for POSIX ACL support on tmpfs.

Expand All @@ -233,6 +233,18 @@ config TMPFS_INODE64

If unsure, say N.

config TMPFS_QUOTA
bool "Tmpfs quota support"
depends on TMPFS
select QUOTA
help
Quota support allows to set per user and group limits for tmpfs
usage. Say Y to enable quota support. Once enabled you can control
user and group quota enforcement with quota, usrquota and grpquota
mount options.

If unsure, say N.

config ARCH_SUPPORTS_HUGETLBFS
def_bool n

Expand Down
2 changes: 1 addition & 1 deletion fs/kernfs/dir.c
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,7 @@ void kernfs_put(struct kernfs_node *kn)
kfree_const(kn->name);

if (kn->iattr) {
simple_xattrs_free(&kn->iattr->xattrs);
simple_xattrs_free(&kn->iattr->xattrs, NULL);
kmem_cache_free(kernfs_iattrs_cache, kn->iattr);
}
spin_lock(&kernfs_idr_lock);
Expand Down
46 changes: 29 additions & 17 deletions fs/kernfs/inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -305,11 +305,17 @@ int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
const void *value, size_t size, int flags)
{
struct simple_xattr *old_xattr;
struct kernfs_iattrs *attrs = kernfs_iattrs(kn);
if (!attrs)
return -ENOMEM;

return simple_xattr_set(&attrs->xattrs, name, value, size, flags, NULL);
old_xattr = simple_xattr_set(&attrs->xattrs, name, value, size, flags);
if (IS_ERR(old_xattr))
return PTR_ERR(old_xattr);

simple_xattr_free(old_xattr);
return 0;
}

static int kernfs_vfs_xattr_get(const struct xattr_handler *handler,
Expand Down Expand Up @@ -341,7 +347,7 @@ static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
{
atomic_t *sz = &kn->iattr->user_xattr_size;
atomic_t *nr = &kn->iattr->nr_user_xattrs;
ssize_t removed_size;
struct simple_xattr *old_xattr;
int ret;

if (atomic_inc_return(nr) > KERNFS_MAX_USER_XATTRS) {
Expand All @@ -354,13 +360,18 @@ static int kernfs_vfs_user_xattr_add(struct kernfs_node *kn,
goto dec_size_out;
}

ret = simple_xattr_set(xattrs, full_name, value, size, flags,
&removed_size);

if (!ret && removed_size >= 0)
size = removed_size;
else if (!ret)
old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags);
if (!old_xattr)
return 0;

if (IS_ERR(old_xattr)) {
ret = PTR_ERR(old_xattr);
goto dec_size_out;
}

ret = 0;
size = old_xattr->size;
simple_xattr_free(old_xattr);
dec_size_out:
atomic_sub(size, sz);
dec_count_out:
Expand All @@ -375,18 +386,19 @@ static int kernfs_vfs_user_xattr_rm(struct kernfs_node *kn,
{
atomic_t *sz = &kn->iattr->user_xattr_size;
atomic_t *nr = &kn->iattr->nr_user_xattrs;
ssize_t removed_size;
int ret;
struct simple_xattr *old_xattr;

ret = simple_xattr_set(xattrs, full_name, value, size, flags,
&removed_size);
old_xattr = simple_xattr_set(xattrs, full_name, value, size, flags);
if (!old_xattr)
return 0;

if (removed_size >= 0) {
atomic_sub(removed_size, sz);
atomic_dec(nr);
}
if (IS_ERR(old_xattr))
return PTR_ERR(old_xattr);

return ret;
atomic_sub(old_xattr->size, sz);
atomic_dec(nr);
simple_xattr_free(old_xattr);
return 0;
}

static int kernfs_vfs_user_xattr_set(const struct xattr_handler *handler,
Expand Down
Loading

0 comments on commit ecd7db2

Please sign in to comment.