Skip to content

Commit

Permalink
Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linu…
Browse files Browse the repository at this point in the history
…x/kernel/git/tip/tip

Pull x86 asm updates from Ingo Molnar:

 - Introduce the ORC unwinder, which can be enabled via
   CONFIG_ORC_UNWINDER=y.

   The ORC unwinder is a lightweight, Linux kernel specific debuginfo
   implementation, which aims to be DWARF done right for unwinding.
   Objtool is used to generate the ORC unwinder tables during build, so
   the data format is flexible and kernel internal: there's no
   dependency on debuginfo created by an external toolchain.

   The ORC unwinder is almost two orders of magnitude faster than the
   (out of tree) DWARF unwinder - which is important for perf call graph
   profiling. It is also significantly simpler and is coded defensively:
   there has not been a single ORC related kernel crash so far, even
   with early versions. (knock on wood!)

   But the main advantage is that enabling the ORC unwinder allows
   CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
   measurably:

   With frame pointers disabled, GCC does not have to add frame pointer
   instrumentation code to every function in the kernel. The kernel's
   .text size decreases by about 3.2%, resulting in better cache
   utilization and fewer instructions executed, resulting in a broad
   kernel-wide speedup. Average speedup of system calls should be
   roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
   a speedup of 5-10% for some function execution intense workloads.

   The main cost of the unwinder is that the unwinder data has to be
   stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
   config - which is a modest cost on modern x86 systems.

   Given how young the ORC unwinder code is it's not enabled by default
   - but given the performance advantages the plan is to eventually make
   it the default unwinder on x86.

   See Documentation/x86/orc-unwinder.txt for more details.

 - Remove lguest support: its intended role was that of a temporary
   proof of concept for virtualization, plus its removal will enable the
   reduction (removal) of the paravirt API as well, so Rusty agreed to
   its removal. (Juergen Gross)

 - Clean up and fix FSGS related functionality (Andy Lutomirski)

 - Clean up IO access APIs (Andy Shevchenko)

 - Enhance the symbol namespace (Jiri Slaby)

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
  objtool: Handle GCC stack pointer adjustment bug
  x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
  x86/fpu/math-emu: Add ENDPROC to functions
  x86/boot/64: Extract efi_pe_entry() from startup_64()
  x86/boot/32: Extract efi_pe_entry() from startup_32()
  x86/lguest: Remove lguest support
  x86/paravirt/xen: Remove xen_patch()
  objtool: Fix objtool fallthrough detection with function padding
  x86/xen/64: Fix the reported SS and CS in SYSCALL
  objtool: Track DRAP separately from callee-saved registers
  objtool: Fix validate_branch() return codes
  x86: Clarify/fix no-op barriers for text_poke_bp()
  x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
  selftests/x86/fsgsbase: Test selectors 1, 2, and 3
  x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
  x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
  x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
  x86/asm/32: Fix regs_get_register() on segment registers
  x86/xen/64: Rearrange the SYSCALL entries
  x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
  ...
  • Loading branch information
torvalds committed Sep 4, 2017
2 parents f213a6c + dd88a0a commit b0c79f4
Show file tree
Hide file tree
Showing 125 changed files with 3,325 additions and 11,368 deletions.
179 changes: 179 additions & 0 deletions Documentation/x86/orc-unwinder.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
ORC unwinder
============

Overview
--------

The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder. The difference is that the
format of the ORC data is much simpler than DWARF, which in turn allows
the ORC unwinder to be much simpler and faster.

The ORC data consists of unwind tables which are generated by objtool.
They contain out-of-band data which is used by the in-kernel ORC
unwinder. Objtool generates the ORC data by first doing compile-time
stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing
all the code paths of a .o file, it determines information about the
stack state at each instruction address in the file and outputs that
information to the .orc_unwind and .orc_unwind_ip sections.

The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time. The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.


ORC vs frame pointers
---------------------

With frame pointers enabled, GCC adds instrumentation code to every
function in the kernel. The kernel's .text size increases by about
3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel
Gorman [1] have shown a slowdown of 5-10% for some workloads.

In contrast, the ORC unwinder has no effect on text size or runtime
performance, because the debuginfo is out of band. So if you disable
frame pointers and enable the ORC unwinder, you get a nice performance
improvement across the board, and still have reliable stack traces.

Ingo Molnar says:

"Note that it's not just a performance improvement, but also an
instruction cache locality improvement: 3.2% .text savings almost
directly transform into a similarly sized reduction in cache
footprint. That can transform to even higher speedups for workloads
whose cache locality is borderline."

Another benefit of ORC compared to frame pointers is that it can
reliably unwind across interrupts and exceptions. Frame pointer based
unwinds can sometimes skip the caller of the interrupted function, if it
was a leaf function or if the interrupt hit before the frame pointer was
saved.

The main disadvantage of the ORC unwinder compared to frame pointers is
that it needs more memory to store the ORC unwind tables: roughly 2-4MB
depending on the kernel config.


ORC vs DWARF
------------

ORC debuginfo's advantage over DWARF itself is that it's much simpler.
It gets rid of the complex DWARF CFI state machine and also gets rid of
the tracking of unnecessary registers. This allows the unwinder to be
much simpler, meaning fewer bugs, which is especially important for
mission critical oops code.

The simpler debuginfo format also enables the unwinder to be much faster
than DWARF, which is important for perf and lockdep. In a basic
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
faster than an out-of-tree DWARF unwinder. (Note: That measurement was
taken before some performance tweaks were added, which doubled
performance, so the speedup over DWARF may be closer to 40x.)

The ORC data format does have a few downsides compared to DWARF. ORC
unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
than DWARF-based eh_frame tables.

Another potential downside is that, as GCC evolves, it's conceivable
that the ORC data may end up being *too* simple to describe the state of
the stack for certain optimizations. But IMO this is unlikely because
GCC saves the frame pointer for any unusual stack adjustments it does,
so I suspect we'll really only ever need to keep track of the stack
pointer and the frame pointer between call frames. But even if we do
end up having to track all the registers DWARF tracks, at least we will
still be able to control the format, e.g. no complex state machines.


ORC unwind table generation
---------------------------

The ORC data is generated by objtool. With the existing compile-time
stack metadata validation feature, objtool already follows all code
paths, and so it already has all the information it needs to be able to
generate ORC data from scratch. So it's an easy step to go from stack
validation to ORC data generation.

It should be possible to instead generate the ORC data with a simple
tool which converts DWARF to ORC data. However, such a solution would
be incomplete due to the kernel's extensive use of asm, inline asm, and
special sections like exception tables.

That could be rectified by manually annotating those special code paths
using GNU assembler .cfi annotations in .S files, and homegrown
annotations for inline asm in .c files. But asm annotations were tried
in the past and were found to be unmaintainable. They were often
incorrect/incomplete and made the code harder to read and keep updated.
And based on looking at glibc code, annotating inline asm in .c files
might be even worse.

Objtool still needs a few annotations, but only in code which does
unusual things to the stack like entry code. And even then, far fewer
annotations are needed than what DWARF would need, so they're much more
maintainable than DWARF CFI annotations.

So the advantages of using objtool to generate ORC data are that it
gives more accurate debuginfo, with very few annotations. It also
insulates the kernel from toolchain bugs which can be very painful to
deal with in the kernel since we often have to workaround issues in
older versions of the toolchain for years.

The downside is that the unwinder now becomes dependent on objtool's
ability to reverse engineer GCC code flow. If GCC optimizations become
too complicated for objtool to follow, the ORC data generation might
stop working or become incomplete. (It's worth noting that livepatch
already has such a dependency on objtool's ability to follow GCC code
flow.)

If newer versions of GCC come up with some optimizations which break
objtool, we may need to revisit the current implementation. Some
possible solutions would be asking GCC to make the optimizations more
palatable, or having objtool use DWARF as an additional input, or
creating a GCC plugin to assist objtool with its analysis. But for now,
objtool follows GCC code quite well.


Unwinder implementation details
-------------------------------

Objtool generates the ORC data by integrating with the compile-time
stack metadata validation feature, which is described in detail in
tools/objtool/Documentation/stack-validation.txt. After analyzing all
the code paths of a .o file, it creates an array of orc_entry structs,
and a parallel array of instruction addresses associated with those
structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
respectively.

The ORC data is split into the two arrays for performance reasons, to
make the searchable part of the data (.orc_unwind_ip) more compact. The
arrays are sorted in parallel at boot time.

Performance is further improved by the use of a fast lookup table which
is created at runtime. The fast lookup table associates a given address
with a range of indices for the .orc_unwind table, so that only a small
subset of the table needs to be searched.


Etymology
---------

Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
enemies. Similarly, the ORC unwinder was created in opposition to the
complexity and slowness of DWARF.

"Although Orcs rarely consider multiple solutions to a problem, they do
excel at getting things done because they are creatures of action, not
thought." [3] Similarly, unlike the esoteric DWARF unwinder, the
veracious ORC unwinder wastes no time or siloconic effort decoding
variable-length zero-extended unsigned-integer byte-coded
state-machine-based debug information entries.

Similar to how Orcs frequently unravel the well-intentioned plans of
their adversaries, the ORC unwinder frequently unravels stacks with
brutal, unyielding efficiency.

ORC stands for Oops Rewind Capability.


[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] http://dustin.wikidot.com/half-orcs-and-orcs
11 changes: 0 additions & 11 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -7660,17 +7660,6 @@ T: git git://linuxtv.org/mkrufky/tuners.git
S: Maintained
F: drivers/media/dvb-frontends/lgdt3305.*

LGUEST
M: Rusty Russell <[email protected]>
L: [email protected]
W: http://lguest.ozlabs.org/
S: Odd Fixes
F: arch/x86/include/asm/lguest*.h
F: arch/x86/lguest/
F: drivers/lguest/
F: include/linux/lguest*.h
F: tools/lguest/

LIBATA PATA ARASAN COMPACT FLASH CONTROLLER
M: Viresh Kumar <[email protected]>
L: [email protected]
Expand Down
8 changes: 8 additions & 0 deletions arch/um/include/asm/unwind.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#ifndef _ASM_UML_UNWIND_H
#define _ASM_UML_UNWIND_H

static inline void
unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}

#endif /* _ASM_UML_UNWIND_H */
3 changes: 0 additions & 3 deletions arch/x86/Kbuild
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@ obj-$(CONFIG_XEN) += xen/
# Hyper-V paravirtualization support
obj-$(CONFIG_HYPERVISOR_GUEST) += hyperv/

# lguest paravirtualization support
obj-$(CONFIG_LGUEST_GUEST) += lguest/

obj-y += realmode/
obj-y += kernel/
obj-y += mm/
Expand Down
6 changes: 2 additions & 4 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ config X86
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_THP_SWAP if X86_64
select BUILDTIME_EXTABLE_SORT
Expand Down Expand Up @@ -158,6 +157,7 @@ config X86
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
Expand All @@ -168,7 +168,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && STACK_VALIDATION
select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER_UNWINDER && STACK_VALIDATION
select HAVE_STACK_VALIDATION if X86_64
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
Expand Down Expand Up @@ -778,8 +778,6 @@ config KVM_DEBUG_FS
Statistics are displayed in debugfs filesystem. Enabling this option
may incur significant overhead.

source "arch/x86/lguest/Kconfig"

config PARAVIRT_TIME_ACCOUNTING
bool "Paravirtual steal time accounting"
depends on PARAVIRT
Expand Down
59 changes: 57 additions & 2 deletions arch/x86/Kconfig.debug
Original file line number Diff line number Diff line change
Expand Up @@ -305,8 +305,6 @@ config DEBUG_ENTRY
Some of these sanity checks may slow down kernel entries and
exits or otherwise impact performance.

This is currently used to help test NMI code.

If unsure, say N.

config DEBUG_NMI_SELFTEST
Expand Down Expand Up @@ -358,4 +356,61 @@ config PUNIT_ATOM_DEBUG
The current power state can be read from
/sys/kernel/debug/punit_atom/dev_power_state

choice
prompt "Choose kernel unwinder"
default FRAME_POINTER_UNWINDER
---help---
This determines which method will be used for unwinding kernel stack
traces for panics, oopses, bugs, warnings, perf, /proc/<pid>/stack,
livepatch, lockdep, and more.

config FRAME_POINTER_UNWINDER
bool "Frame pointer unwinder"
select FRAME_POINTER
---help---
This option enables the frame pointer unwinder for unwinding kernel
stack traces.

The unwinder itself is fast and it uses less RAM than the ORC
unwinder, but the kernel text size will grow by ~3% and the kernel's
overall performance will degrade by roughly 5-10%.

This option is recommended if you want to use the livepatch
consistency model, as this is currently the only way to get a
reliable stack trace (CONFIG_HAVE_RELIABLE_STACKTRACE).

config ORC_UNWINDER
bool "ORC unwinder"
depends on X86_64
select STACK_VALIDATION
---help---
This option enables the ORC (Oops Rewind Capability) unwinder for
unwinding kernel stack traces. It uses a custom data format which is
a simplified version of the DWARF Call Frame Information standard.

This unwinder is more accurate across interrupt entry frames than the
frame pointer unwinder. It also enables a 5-10% performance
improvement across the entire kernel compared to frame pointers.

Enabling this option will increase the kernel's runtime memory usage
by roughly 2-4MB, depending on your kernel config.

config GUESS_UNWINDER
bool "Guess unwinder"
depends on EXPERT
---help---
This option enables the "guess" unwinder for unwinding kernel stack
traces. It scans the stack and reports every kernel text address it
finds. Some of the addresses it reports may be incorrect.

While this option often produces false positives, it can still be
useful in many cases. Unlike the other unwinders, it has no runtime
overhead.

endchoice

config FRAME_POINTER
depends on !ORC_UNWINDER && !GUESS_UNWINDER
bool

endmenu
Loading

0 comments on commit b0c79f4

Please sign in to comment.