Skip to content

Commit

Permalink
Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/tip/tip

Pull core entry/exit updates from Thomas Gleixner:
 "A set of updates for entry/exit handling:

   - More generalization of entry/exit functionality

   - The consolidation work to reclaim TIF flags on x86 and also for
     non-x86 specific TIF flags which are solely relevant for syscall
     related work and have been moved into their own storage space. The
     x86 specific part had to be merged in to avoid a major conflict.

   - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
     delivery mode of task work and results in an impressive performance
     improvement for io_uring. The non-x86 consolidation of this is
     going to come seperate via Jens.

   - The selective syscall redirection facility which provides a clean
     and efficient way to support the non-Linux syscalls of WINE by
     catching them at syscall entry and redirecting them to the user
     space emulation. This can be utilized for other purposes as well
     and has been designed carefully to avoid overhead for the regular
     fastpath. This includes the core changes and the x86 support code.

   - Simplification of the context tracking entry/exit handling for the
     users of the generic entry code which guarantee the proper ordering
     and protection.

   - Preparatory changes to make the generic entry code accomodate S390
     specific requirements which are mostly related to their syscall
     restart mechanism"

* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  entry: Add syscall_exit_to_user_mode_work()
  entry: Add exit_to_user_mode() wrapper
  entry_Add_enter_from_user_mode_wrapper
  entry: Rename exit_to_user_mode()
  entry: Rename enter_from_user_mode()
  docs: Document Syscall User Dispatch
  selftests: Add benchmark for syscall user dispatch
  selftests: Add kselftest for syscall user dispatch
  entry: Support Syscall User Dispatch on common syscall entry
  kernel: Implement selective syscall userspace redirection
  signal: Expose SYS_USER_DISPATCH si_code type
  x86: vdso: Expose sigreturn address on vdso to the kernel
  MAINTAINERS: Add entry for common entry code
  entry: Fix boot for !CONFIG_GENERIC_ENTRY
  x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
  sched: Detect call to schedule from critical entry code
  context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
  x86: Reclaim unused x86 TI flags
  ...
  • Loading branch information
torvalds committed Dec 15, 2020
2 parents ff61359 + c6156e1 commit 1ac0884
Show file tree
Hide file tree
Showing 54 changed files with 1,320 additions and 232 deletions.
1 change: 1 addition & 0 deletions Documentation/admin-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ configure specific aspects of kernel behavior to your liking.
rtc
serial-console
svga
syscall-user-dispatch
sysrq
thunderbolt
ufs
Expand Down
90 changes: 90 additions & 0 deletions Documentation/admin-guide/syscall-user-dispatch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
.. SPDX-License-Identifier: GPL-2.0
=====================
Syscall User Dispatch
=====================

Background
----------

Compatibility layers like Wine need a way to efficiently emulate system
calls of only a part of their process - the part that has the
incompatible code - while being able to execute native syscalls without
a high performance penalty on the native part of the process. Seccomp
falls short on this task, since it has limited support to efficiently
filter syscalls based on memory regions, and it doesn't support removing
filters. Therefore a new mechanism is necessary.

Syscall User Dispatch brings the filtering of the syscall dispatcher
address back to userspace. The application is in control of a flip
switch, indicating the current personality of the process. A
multiple-personality application can then flip the switch without
invoking the kernel, when crossing the compatibility layer API
boundaries, to enable/disable the syscall redirection and execute
syscalls directly (disabled) or send them to be emulated in userspace
through a SIGSYS.

The goal of this design is to provide very quick compatibility layer
boundary crosses, which is achieved by not executing a syscall to change
personality every time the compatibility layer executes. Instead, a
userspace memory region exposed to the kernel indicates the current
personality, and the application simply modifies that variable to
configure the mechanism.

There is a relatively high cost associated with handling signals on most
architectures, like x86, but at least for Wine, syscalls issued by
native Windows code are currently not known to be a performance problem,
since they are quite rare, at least for modern gaming applications.

Since this mechanism is designed to capture syscalls issued by
non-native applications, it must function on syscalls whose invocation
ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
doesn't rely on any of the syscall ABI to make the filtering. It uses
only the syscall dispatcher address and the userspace key.

As the ABI of these intercepted syscalls is unknown to Linux, these
syscalls are not instrumentable via ptrace or the syscall tracepoints.

Interface
---------

A thread can setup this mechanism on supported kernels by executing the
following prctl:

prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])

<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
disable the mechanism globally for that thread. When
PR_SYS_DISPATCH_OFF is used, the other fields must be zero.

[<offset>, <offset>+<length>) delimit a memory region interval
from which syscalls are always executed directly, regardless of the
userspace selector. This provides a fast path for the C library, which
includes the most common syscall dispatchers in the native code
applications, and also provides a way for the signal handler to return
without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
interface should make sure that at least the signal trampoline code is
included in this region. In addition, for syscalls that implement the
trampoline code on the vDSO, that trampoline is never intercepted.

[selector] is a pointer to a char-sized region in the process memory
region, that provides a quick way to enable disable syscall redirection
thread-wide, without the need to invoke the kernel directly. selector
can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other
value should terminate the program with a SIGSYS.

Security Notes
--------------

Syscall User Dispatch provides functionality for compatibility layers to
quickly capture system calls issued by a non-native part of the
application, while not impacting the Linux native regions of the
process. It is not a mechanism for sandboxing system calls, and it
should not be seen as a security mechanism, since it is trivial for a
malicious application to subvert the mechanism by jumping to an allowed
dispatcher region prior to executing the syscall, or to discover the
address and modify the selector value. If the use case requires any
kind of security sandboxing, Seccomp should be used instead.

Any fork or exec of the existing process resets the mechanism to
PR_SYS_DISPATCH_OFF.
11 changes: 11 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -7361,6 +7361,17 @@ S: Maintained
F: drivers/base/arch_topology.c
F: include/linux/arch_topology.h

GENERIC ENTRY CODE
M: Thomas Gleixner <[email protected]>
M: Peter Zijlstra <[email protected]>
M: Andy Lutomirski <[email protected]>
L: [email protected]
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/entry
F: include/linux/entry-common.h
F: include/linux/entry-kvm.h
F: kernel/entry/

GENERIC GPIO I2C DRIVER
M: Wolfram Sang <[email protected]>
S: Supported
Expand Down
17 changes: 17 additions & 0 deletions arch/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -618,6 +618,23 @@ config HAVE_CONTEXT_TRACKING
protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal
handling on irq exit still need to be protected.

config HAVE_CONTEXT_TRACKING_OFFSTACK
bool
help
Architecture neither relies on exception_enter()/exception_exit()
nor on schedule_user(). Also preempt_schedule_notrace() and
preempt_schedule_irq() can't be called in a preemptible section
while context tracking is CONTEXT_USER. This feature reflects a sane
entry implementation where the following requirements are met on
critical entry code, ie: before user_exit() or after user_enter():

- Critical entry code isn't preemptible (or better yet:
not interruptible).
- No use of RCU read side critical sections, unless rcu_nmi_enter()
got called.
- No use of instrumentation, unless instrumentation_begin() got
called.

config HAVE_TIF_NOHZ
bool
help
Expand Down
1 change: 1 addition & 0 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ config X86
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING if X86_64
select HAVE_CONTEXT_TRACKING_OFFSTACK if HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
Expand Down
34 changes: 0 additions & 34 deletions arch/x86/entry/common.c
Original file line number Diff line number Diff line change
Expand Up @@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
}

noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
{
bool irq_state = lockdep_hardirqs_enabled();

__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
lockdep_hardirq_enter();
rcu_nmi_enter();

instrumentation_begin();
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();

return irq_state;
}

noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
{
instrumentation_begin();
ftrace_nmi_exit();
if (restore) {
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
}
instrumentation_end();

rcu_nmi_exit();
lockdep_hardirq_exit();
if (restore)
lockdep_hardirqs_on(CALLER_ADDR0);
__nmi_exit();
}

#ifdef CONFIG_XEN_PV
#ifndef CONFIG_PREEMPTION
/*
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/entry/vdso/vdso2c.c
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@ struct vdso_sym required_syms[] = {
{"__kernel_sigreturn", true},
{"__kernel_rt_sigreturn", true},
{"int80_landing_pad", true},
{"vdso32_rt_sigreturn_landing_pad", true},
{"vdso32_sigreturn_landing_pad", true},
};

__attribute__((format(printf, 1, 2))) __attribute__((noreturn))
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/entry/vdso/vdso32/sigreturn.S
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ __kernel_sigreturn:
movl $__NR_sigreturn, %eax
SYSCALL_ENTER_KERNEL
.LEND_sigreturn:
SYM_INNER_LABEL(vdso32_sigreturn_landing_pad, SYM_L_GLOBAL)
nop
.size __kernel_sigreturn,.-.LSTART_sigreturn

Expand All @@ -29,6 +30,7 @@ __kernel_rt_sigreturn:
movl $__NR_rt_sigreturn, %eax
SYSCALL_ENTER_KERNEL
.LEND_rt_sigreturn:
SYM_INNER_LABEL(vdso32_rt_sigreturn_landing_pad, SYM_L_GLOBAL)
nop
.size __kernel_rt_sigreturn,.-.LSTART_rt_sigreturn
.previous
Expand Down
15 changes: 15 additions & 0 deletions arch/x86/entry/vdso/vma.c
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,21 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
}
#endif

bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs)
{
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
const struct vdso_image *image = current->mm->context.vdso_image;
unsigned long vdso = (unsigned long) current->mm->context.vdso;

if (in_ia32_syscall() && image == &vdso_image_32) {
if (regs->ip == vdso + image->sym_vdso32_sigreturn_landing_pad ||
regs->ip == vdso + image->sym_vdso32_rt_sigreturn_landing_pad)
return true;
}
#endif
return false;
}

#ifdef CONFIG_X86_64
static __init int vdso_setup(char *s)
{
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/include/asm/elf.h
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,8 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
compat_arch_setup_additional_pages(bprm, interpreter, \
(ex->e_machine == EM_X86_64))

extern bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs);

/* Do not change the values. See get_align_mask() */
enum align_flags {
ALIGN_VA_32 = BIT(0),
Expand Down
3 changes: 0 additions & 3 deletions arch/x86/include/asm/idtentry.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@

#include <asm/irq_stack.h>

bool idtentry_enter_nmi(struct pt_regs *regs);
void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);

/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
Expand Down
13 changes: 3 additions & 10 deletions arch/x86/include/asm/thread_info.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ struct task_struct;

struct thread_info {
unsigned long flags; /* low level flags */
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
u32 status; /* thread synchronous flags */
};

Expand All @@ -74,15 +75,11 @@ struct thread_info {
* - these are process state flags that various assembly files
* may need to access
*/
#define TIF_SYSCALL_TRACE 0 /* syscall trace active */
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
#define TIF_SYSCALL_EMU 6 /* syscall emulation active */
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_FORCE_UPDATE 10 /* Force speculation MSR update in context switch */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
Expand All @@ -91,25 +88,21 @@ struct thread_info {
#define TIF_NEED_FPU_LOAD 14 /* load FPU on return to userspace */
#define TIF_NOCPUID 15 /* CPUID is not accessible in userland */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_NOTIFY_SIGNAL 17 /* signal notifications exist */
#define TIF_SLD 18 /* Restore split lock detection on context switch */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU)
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
Expand All @@ -118,13 +111,13 @@ struct thread_info {
#define _TIF_NEED_FPU_LOAD (1 << TIF_NEED_FPU_LOAD)
#define _TIF_NOCPUID (1 << TIF_NOCPUID)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)

/* flags to check in __switch_to() */
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/include/asm/vdso.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ struct vdso_image {
long sym___kernel_rt_sigreturn;
long sym___kernel_vsyscall;
long sym_int80_landing_pad;
long sym_vdso32_sigreturn_landing_pad;
long sym_vdso32_rt_sigreturn_landing_pad;
};

#ifdef CONFIG_X86_64
Expand Down
6 changes: 3 additions & 3 deletions arch/x86/kernel/cpu/mce/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -1974,7 +1974,7 @@ void (*machine_check_vector)(struct pt_regs *) = unexpected_machine_check;

static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
{
bool irq_state;
irqentry_state_t irq_state;

WARN_ON_ONCE(user_mode(regs));

Expand All @@ -1986,7 +1986,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

irq_state = idtentry_enter_nmi(regs);
irq_state = irqentry_nmi_enter(regs);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
Expand All @@ -1997,7 +1997,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
idtentry_exit_nmi(regs, irq_state);
irqentry_nmi_exit(regs, irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
Expand Down
6 changes: 3 additions & 3 deletions arch/x86/kernel/nmi.c
Original file line number Diff line number Diff line change
Expand Up @@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);

DEFINE_IDTENTRY_RAW(exc_nmi)
{
bool irq_state;
irqentry_state_t irq_state;

/*
* Re-enable NMIs right here when running as an SEV-ES guest. This might
Expand All @@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

this_cpu_write(nmi_dr7, local_db_save());

irq_state = idtentry_enter_nmi(regs);
irq_state = irqentry_nmi_enter(regs);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

idtentry_exit_nmi(regs, irq_state);
irqentry_nmi_exit(regs, irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

Expand Down
4 changes: 2 additions & 2 deletions arch/x86/kernel/signal.c
Original file line number Diff line number Diff line change
Expand Up @@ -804,11 +804,11 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
* want to handle. Thus you cannot kill init even with a SIGKILL even by
* mistake.
*/
void arch_do_signal(struct pt_regs *regs)
void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
{
struct ksignal ksig;

if (get_signal(&ksig)) {
if (has_signal && get_signal(&ksig)) {
/* Whee! Actually deliver the signal. */
handle_signal(&ksig, regs);
return;
Expand Down
Loading

0 comments on commit 1ac0884

Please sign in to comment.