Skip to content

Commit

Permalink
Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel…
Browse files Browse the repository at this point in the history
…/git/brauner/linux

Pull thread management updates from Christian Brauner:
 "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
  syscall.

  This syscall allows for the retrieval of file descriptors of a process
  based on its pidfd. A task needs to have ptrace_may_access()
  permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
  Andy) on the target.

  One of the main use-cases is in combination with seccomp's user
  notification feature. As a reminder, seccomp's user notification
  feature was made available in v5.0. It allows a task to retrieve a
  file descriptor for its seccomp filter. The file descriptor is usually
  handed of to a more privileged supervising process. The supervisor can
  then listen for syscall events caught by the seccomp filter of the
  supervisee and perform actions in lieu of the supervisee, usually
  emulating syscalls. pidfd_getfd() is needed to expand its uses.

  There are currently two major users that wait on pidfd_getfd() and one
  future user:

   - Netflix, Sargun said, is working on a service mesh where users
     should be able to connect to a dns-based VIP. When a user connects
     to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
     redirected to an envoy process. This service mesh uses seccomp user
     notifications and pidfd to intercept all connect calls and instead
     of connecting them to 1.2.3.4:80 connects them to e.g.
     127.0.0.1:8080.

   - LXD uses the seccomp notifier heavily to intercept and emulate
     mknod() and mount() syscalls for unprivileged containers/processes.
     With pidfd_getfd() more uses-cases e.g. bridging socket connections
     will be possible.

   - The patchset has also seen some interest from the browser corner.
     Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
     broker process. In the future glibc will start blocking all signals
     during dlopen() rendering this type of sandbox impossible. Hence,
     in the future Firefox will switch to a seccomp-user-nofication
     based sandbox which also makes use of file descriptor retrieval.
     The thread for this can be found at
     https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html

  With pidfd_getfd() it is e.g. possible to bridge socket connections
  for the supervisee (binding to a privileged port) and taking actions
  on file descriptors on behalf of the supervisee in general.

  Sargun's first version was using an ioctl on pidfds but various people
  pushed for it to be a proper syscall which he duely implemented as
  well over various review cycles. Selftests are of course included.
  I've also added instructions how to deal with merge conflicts below.

  There's also a small fix coming from the kernel mentee project to
  correctly annotate struct sighand_struct with __rcu to fix various
  sparse warnings. We've received a few more such fixes and even though
  they are mostly trivial I've decided to postpone them until after -rc1
  since they came in rather late and I don't want to risk introducing
  build warnings.

  Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
  needed to avoid allocation recursions triggerable by storage drivers
  that have userspace parts that run in the IO path (e.g. dm-multipath,
  iscsi, etc). These allocation recursions deadlock the device.

  The new prctl() allows such privileged userspace components to avoid
  allocation recursions by setting the PF_MEMALLOC_NOIO and
  PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
  relevant maintainers and is routed here as part of prctl()
  thread-management."

* tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
  sched.h: Annotate sighand_struct with __rcu
  test: Add test for pidfd getfd
  arch: wire up pidfd_getfd syscall
  pid: Implement pidfd_getfd syscall
  vfs, fdtable: Add fget_task helper
  • Loading branch information
torvalds committed Jan 30, 2020
2 parents 896f8d2 + 8d19f1c commit 83fa805
Show file tree
Hide file tree
Showing 32 changed files with 427 additions and 7 deletions.
1 change: 1 addition & 0 deletions arch/alpha/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -476,3 +476,4 @@
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
547 common openat2 sys_openat2
548 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/arm/tools/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -450,3 +450,4 @@
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
2 changes: 1 addition & 1 deletion arch/arm64/include/asm/unistd.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

#define __NR_compat_syscalls 438
#define __NR_compat_syscalls 439
#endif

#define __ARCH_WANT_SYS_CLONE
Expand Down
2 changes: 2 additions & 0 deletions arch/arm64/include/asm/unistd32.h
Original file line number Diff line number Diff line change
Expand Up @@ -881,6 +881,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
__SYSCALL(__NR_clone3, sys_clone3)
#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)
#define __NR_pidfd_getfd 438
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)

/*
* Please add new compat syscalls above this comment and update
Expand Down
1 change: 1 addition & 0 deletions arch/ia64/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -357,3 +357,4 @@
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/m68k/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -436,3 +436,4 @@
434 common pidfd_open sys_pidfd_open
435 common clone3 __sys_clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/microblaze/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -442,3 +442,4 @@
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/mips/kernel/syscalls/syscall_n32.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -375,3 +375,4 @@
434 n32 pidfd_open sys_pidfd_open
435 n32 clone3 __sys_clone3
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/mips/kernel/syscalls/syscall_n64.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -351,3 +351,4 @@
434 n64 pidfd_open sys_pidfd_open
435 n64 clone3 __sys_clone3
437 n64 openat2 sys_openat2
438 n64 pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/mips/kernel/syscalls/syscall_o32.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -424,3 +424,4 @@
434 o32 pidfd_open sys_pidfd_open
435 o32 clone3 __sys_clone3
437 o32 openat2 sys_openat2
438 o32 pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/parisc/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -434,3 +434,4 @@
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/powerpc/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -518,3 +518,4 @@
434 common pidfd_open sys_pidfd_open
435 nospu clone3 ppc_clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/s390/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -439,3 +439,4 @@
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/sh/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -439,3 +439,4 @@
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/sparc/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -482,3 +482,4 @@
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/x86/entry/syscalls/syscall_32.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -441,3 +441,4 @@
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
437 i386 openat2 sys_openat2 __ia32_sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd
1 change: 1 addition & 0 deletions arch/x86/entry/syscalls/syscall_64.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,7 @@
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
437 common openat2 __x64_sys_openat2
438 common pidfd_getfd __x64_sys_pidfd_getfd

#
# x32-specific system call numbers start at 512 to avoid cache impact
Expand Down
1 change: 1 addition & 0 deletions arch/xtensa/kernel/syscalls/syscall.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -407,3 +407,4 @@
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
22 changes: 20 additions & 2 deletions fs/file.c
Original file line number Diff line number Diff line change
Expand Up @@ -708,9 +708,9 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock);
}

static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
static struct file *__fget_files(struct files_struct *files, unsigned int fd,
fmode_t mask, unsigned int refs)
{
struct files_struct *files = current->files;
struct file *file;

rcu_read_lock();
Expand All @@ -731,6 +731,12 @@ static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
return file;
}

static inline struct file *__fget(unsigned int fd, fmode_t mask,
unsigned int refs)
{
return __fget_files(current->files, fd, mask, refs);
}

struct file *fget_many(unsigned int fd, unsigned int refs)
{
return __fget(fd, FMODE_PATH, refs);
Expand All @@ -748,6 +754,18 @@ struct file *fget_raw(unsigned int fd)
}
EXPORT_SYMBOL(fget_raw);

struct file *fget_task(struct task_struct *task, unsigned int fd)
{
struct file *file = NULL;

task_lock(task);
if (task->files)
file = __fget_files(task->files, fd, 0, 1);
task_unlock(task);

return file;
}

/*
* Lightweight file lookup - no refcnt increment if fd table isn't shared.
*
Expand Down
2 changes: 2 additions & 0 deletions include/linux/file.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ extern void fput(struct file *);
extern void fput_many(struct file *, unsigned int);

struct file_operations;
struct task_struct;
struct vfsmount;
struct dentry;
struct inode;
Expand Down Expand Up @@ -47,6 +48,7 @@ static inline void fdput(struct fd fd)
extern struct file *fget(unsigned int fd);
extern struct file *fget_many(unsigned int fd, unsigned int refs);
extern struct file *fget_raw(unsigned int fd);
extern struct file *fget_task(struct task_struct *task, unsigned int fd);
extern unsigned long __fdget(unsigned int fd);
extern unsigned long __fdget_raw(unsigned int fd);
extern unsigned long __fdget_pos(unsigned int fd);
Expand Down
2 changes: 1 addition & 1 deletion include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -917,7 +917,7 @@ struct task_struct {

/* Signal handlers: */
struct signal_struct *signal;
struct sighand_struct *sighand;
struct sighand_struct __rcu *sighand;
sigset_t blocked;
sigset_t real_blocked;
/* Restored if set_restore_sigmask() was used: */
Expand Down
1 change: 1 addition & 0 deletions include/linux/syscalls.h
Original file line number Diff line number Diff line change
Expand Up @@ -1002,6 +1002,7 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);

/*
* Architecture-specific system calls
Expand Down
4 changes: 3 additions & 1 deletion include/uapi/asm-generic/unistd.h
Original file line number Diff line number Diff line change
Expand Up @@ -853,9 +853,11 @@ __SYSCALL(__NR_clone3, sys_clone3)

#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)
#define __NR_pidfd_getfd 438
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)

#undef __NR_syscalls
#define __NR_syscalls 438
#define __NR_syscalls 439

/*
* 32 bit systems traditionally used different
Expand Down
1 change: 1 addition & 0 deletions include/uapi/linux/capability.h
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ struct vfs_ns_cap_data {
/* Allow more than 64hz interrupts from the real-time clock */
/* Override max number of consoles on console allocation */
/* Override max number of keymaps */
/* Control memory reclaim behavior */

#define CAP_SYS_RESOURCE 24

Expand Down
4 changes: 4 additions & 0 deletions include/uapi/linux/prctl.h
Original file line number Diff line number Diff line change
Expand Up @@ -234,4 +234,8 @@ struct prctl_mm_map {
#define PR_GET_TAGGED_ADDR_CTRL 56
# define PR_TAGGED_ADDR_ENABLE (1UL << 0)

/* Control reclaim behavior when allocating memory */
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

#endif /* _LINUX_PRCTL_H */
90 changes: 90 additions & 0 deletions kernel/pid.c
Original file line number Diff line number Diff line change
Expand Up @@ -578,3 +578,93 @@ void __init pid_idr_init(void)
init_pid_ns.pid_cachep = KMEM_CACHE(pid,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
}

static struct file *__pidfd_fget(struct task_struct *task, int fd)
{
struct file *file;
int ret;

ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (ret)
return ERR_PTR(ret);

if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
file = fget_task(task, fd);
else
file = ERR_PTR(-EPERM);

mutex_unlock(&task->signal->cred_guard_mutex);

return file ?: ERR_PTR(-EBADF);
}

static int pidfd_getfd(struct pid *pid, int fd)
{
struct task_struct *task;
struct file *file;
int ret;

task = get_pid_task(pid, PIDTYPE_PID);
if (!task)
return -ESRCH;

file = __pidfd_fget(task, fd);
put_task_struct(task);
if (IS_ERR(file))
return PTR_ERR(file);

ret = security_file_receive(file);
if (ret) {
fput(file);
return ret;
}

ret = get_unused_fd_flags(O_CLOEXEC);
if (ret < 0)
fput(file);
else
fd_install(ret, file);

return ret;
}

/**
* sys_pidfd_getfd() - Get a file descriptor from another process
*
* @pidfd: the pidfd file descriptor of the process
* @fd: the file descriptor number to get
* @flags: flags on how to get the fd (reserved)
*
* This syscall gets a copy of a file descriptor from another process
* based on the pidfd, and file descriptor number. It requires that
* the calling process has the ability to ptrace the process represented
* by the pidfd. The process which is having its file descriptor copied
* is otherwise unaffected.
*
* Return: On success, a cloexec file descriptor is returned.
* On error, a negative errno number will be returned.
*/
SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
unsigned int, flags)
{
struct pid *pid;
struct fd f;
int ret;

/* flags is currently unused - make sure it's unset */
if (flags)
return -EINVAL;

f = fdget(pidfd);
if (!f.file)
return -EBADF;

pid = pidfd_pid(f.file);
if (IS_ERR(pid))
ret = PTR_ERR(pid);
else
ret = pidfd_getfd(pid, fd);

fdput(f);
return ret;
}
2 changes: 1 addition & 1 deletion kernel/signal.c
Original file line number Diff line number Diff line change
Expand Up @@ -1383,7 +1383,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
* must see ->sighand == NULL.
*/
spin_lock_irqsave(&sighand->siglock, *flags);
if (likely(sighand == tsk->sighand))
if (likely(sighand == rcu_access_pointer(tsk->sighand)))
break;
spin_unlock_irqrestore(&sighand->siglock, *flags);
}
Expand Down
25 changes: 25 additions & 0 deletions kernel/sys.c
Original file line number Diff line number Diff line change
Expand Up @@ -2261,6 +2261,8 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}

#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)

SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
Expand Down Expand Up @@ -2488,6 +2490,29 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = GET_TAGGED_ADDR_CTRL();
break;
case PR_SET_IO_FLUSHER:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;

if (arg3 || arg4 || arg5)
return -EINVAL;

if (arg2 == 1)
current->flags |= PR_IO_FLUSHER;
else if (!arg2)
current->flags &= ~PR_IO_FLUSHER;
else
return -EINVAL;
break;
case PR_GET_IO_FLUSHER:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;

if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
default:
error = -EINVAL;
break;
Expand Down
1 change: 1 addition & 0 deletions tools/testing/selftests/pidfd/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ pidfd_open_test
pidfd_poll_test
pidfd_test
pidfd_wait
pidfd_getfd_test
2 changes: 1 addition & 1 deletion tools/testing/selftests/pidfd/Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
CFLAGS += -g -I../../../../usr/include/ -pthread

TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait
TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait pidfd_getfd_test

include ../lib.mk

9 changes: 9 additions & 0 deletions tools/testing/selftests/pidfd/pidfd.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@
#define __NR_clone3 -1
#endif

#ifndef __NR_pidfd_getfd
#define __NR_pidfd_getfd -1
#endif

/*
* The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
* That means, when it wraps around any pid < 300 will be skipped.
Expand Down Expand Up @@ -84,4 +88,9 @@ static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
}

static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
{
return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
}

#endif /* __PIDFD_H */
Loading

0 comments on commit 83fa805

Please sign in to comment.