Skip to content

Commit

Permalink
Merge tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel…
Browse files Browse the repository at this point in the history
…/git/arm64/linux

Pull mmiowb removal from Will Deacon:
 "Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())

  Remove mmiowb() from the kernel memory barrier API and instead, for
  architectures that need it, hide the barrier inside spin_unlock() when
  MMIO has been performed inside the critical section.

  The only relatively recent changes have been addressing review
  comments on the documentation, which is in a much better shape thanks
  to the efforts of Ben and Ingo.

  I was initially planning to split this into two pull requests so that
  you could run the coccinelle script yourself, however it's been plain
  sailing in linux-next so I've just included the whole lot here to keep
  things simple"

* tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (23 commits)
  docs/memory-barriers.txt: Update I/O section to be clearer about CPU vs thread
  docs/memory-barriers.txt: Fix style, spacing and grammar in I/O section
  arch: Remove dummy mmiowb() definitions from arch code
  net/ethernet/silan/sc92031: Remove stale comment about mmiowb()
  i40iw: Redefine i40iw_mmiowb() to do nothing
  scsi/qla1280: Remove stale comment about mmiowb()
  drivers: Remove explicit invocations of mmiowb()
  drivers: Remove useless trailing comments from mmiowb() invocations
  Documentation: Kill all references to mmiowb()
  riscv/mmiowb: Hook up mmwiob() implementation to asm-generic code
  powerpc/mmiowb: Hook up mmwiob() implementation to asm-generic code
  ia64/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  mips/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  sh/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  m68k/io: Remove useless definition of mmiowb()
  nds32/io: Remove useless definition of mmiowb()
  x86/io: Remove useless definition of mmiowb()
  arm64/io: Remove useless definition of mmiowb()
  ARM/io: Remove useless definition of mmiowb()
  mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors
  ...
  • Loading branch information
torvalds committed May 6, 2019
2 parents 14be4c6 + 9726840 commit dd4e5d6
Show file tree
Hide file tree
Showing 181 changed files with 343 additions and 828 deletions.
45 changes: 0 additions & 45 deletions Documentation/driver-api/device-io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,51 +103,6 @@ continuing execution::
ha->flags.ints_enabled = 0;
}

In addition to write posting, on some large multiprocessing systems
(e.g. SGI Challenge, Origin and Altix machines) posted writes won't be
strongly ordered coming from different CPUs. Thus it's important to
properly protect parts of your driver that do memory-mapped writes with
locks and use the :c:func:`mmiowb()` to make sure they arrive in the
order intended. Issuing a regular readX() will also ensure write ordering,
but should only be used when the
driver has to be sure that the write has actually arrived at the device
(not that it's simply ordered with respect to other writes), since a
full readX() is a relatively expensive operation.

Generally, one should use :c:func:`mmiowb()` prior to releasing a spinlock
that protects regions using :c:func:`writeb()` or similar functions that
aren't surrounded by readb() calls, which will ensure ordering
and flushing. The following pseudocode illustrates what might occur if
write ordering isn't guaranteed via :c:func:`mmiowb()` or one of the
readX() functions::

CPU A: spin_lock_irqsave(&dev_lock, flags)
CPU A: ...
CPU A: writel(newval, ring_ptr);
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
...
CPU B: spin_lock_irqsave(&dev_lock, flags)
CPU B: writel(newval2, ring_ptr);
CPU B: ...
CPU B: spin_unlock_irqrestore(&dev_lock, flags)

In the case above, newval2 could be written to ring_ptr before newval.
Fixing it is easy though::

CPU A: spin_lock_irqsave(&dev_lock, flags)
CPU A: ...
CPU A: writel(newval, ring_ptr);
CPU A: mmiowb(); /* ensure no other writes beat us to the device */
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
...
CPU B: spin_lock_irqsave(&dev_lock, flags)
CPU B: writel(newval2, ring_ptr);
CPU B: ...
CPU B: mmiowb();
CPU B: spin_unlock_irqrestore(&dev_lock, flags)

See tg3.c for a real world example of how to use :c:func:`mmiowb()`

PCI ordering rules also guarantee that PIO read responses arrive after any
outstanding DMA writes from that bus, since for some devices the result of
a readb() call may signal to the driver that a DMA transaction is
Expand Down
4 changes: 0 additions & 4 deletions Documentation/driver-api/pci/p2pdma.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,6 @@ precludes passing these pages to userspace.
P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
However, as the memory is not cache coherent, if access ever needs to
be protected by a spinlock then :c:func:`mmiowb()` must be used before
unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
Documentation/memory-barriers.txt)


P2P DMA Support Library
Expand Down
249 changes: 100 additions & 149 deletions Documentation/memory-barriers.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1937,21 +1937,6 @@ There are some more advanced barrier functions:
information on consistent memory.


MMIO WRITE BARRIER
------------------

The Linux kernel also has a special barrier for use with memory-mapped I/O
writes:

mmiowb();

This is a variation on the mandatory write barrier that causes writes to weakly
ordered I/O regions to be partially ordered. Its effects may go beyond the
CPU->Hardware interface and actually affect the hardware at some level.

See the subsection "Acquires vs I/O accesses" for more information.


===============================
IMPLICIT KERNEL MEMORY BARRIERS
===============================
Expand Down Expand Up @@ -2317,75 +2302,6 @@ But it won't see any of:
*E, *F or *G following RELEASE Q



ACQUIRES VS I/O ACCESSES
------------------------

Under certain circumstances (especially involving NUMA), I/O accesses within
two spinlocked sections on two different CPUs may be seen as interleaved by the
PCI bridge, because the PCI bridge does not necessarily participate in the
cache-coherence protocol, and is therefore incapable of issuing the required
read memory barriers.

For example:

CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
writel(0, ADDR)
writel(1, DATA);
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
writel(5, DATA);
spin_unlock(Q);

may be seen by the PCI bridge as follows:

STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5

which would probably cause the hardware to malfunction.


What is necessary here is to intervene with an mmiowb() before dropping the
spinlock, for example:

CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
writel(0, ADDR)
writel(1, DATA);
mmiowb();
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
writel(5, DATA);
mmiowb();
spin_unlock(Q);

this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
before either of the stores issued on CPU 2.


Furthermore, following a store by a load from the same device obviates the need
for the mmiowb(), because the load forces the store to complete before the load
is performed:

CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
writel(0, ADDR)
a = readl(DATA);
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
b = readl(DATA);
spin_unlock(Q);


See Documentation/driver-api/device-io.rst for more information.


=================================
WHERE ARE MEMORY BARRIERS NEEDED?
=================================
Expand Down Expand Up @@ -2532,16 +2448,9 @@ the device to malfunction.
Inside of the Linux kernel, I/O should be done through the appropriate accessor
routines - such as inb() or writel() - which know how to make such accesses
appropriately sequential. While this, for the most part, renders the explicit
use of memory barriers unnecessary, there are a couple of situations where they
might be needed:

(1) On some systems, I/O stores are not strongly ordered across all CPUs, and
so for _all_ general drivers locks should be used and mmiowb() must be
issued prior to unlocking the critical section.

(2) If the accessor functions are used to refer to an I/O memory window with
relaxed memory access properties, then _mandatory_ memory barriers are
required to enforce ordering.
use of memory barriers unnecessary, if the accessor functions are used to refer
to an I/O memory window with relaxed memory access properties, then _mandatory_
memory barriers are required to enforce ordering.

See Documentation/driver-api/device-io.rst for more information.

Expand Down Expand Up @@ -2586,8 +2495,7 @@ explicit barriers are used.

Normally this won't be a problem because the I/O accesses done inside such
sections will include synchronous load operations on strictly ordered I/O
registers that form implicit I/O barriers. If this isn't sufficient then an
mmiowb() may need to be used explicitly.
registers that form implicit I/O barriers.


A similar situation may occur between an interrupt routine and two routines
Expand All @@ -2599,71 +2507,114 @@ likely, then interrupt-disabling locks should be used to guarantee ordering.
KERNEL I/O BARRIER EFFECTS
==========================

When accessing I/O memory, drivers should use the appropriate accessor
functions:

(*) inX(), outX():

These are intended to talk to I/O space rather than memory space, but
that's primarily a CPU-specific concept. The i386 and x86_64 processors
do indeed have special I/O space access cycles and instructions, but many
CPUs don't have such a concept.

The PCI bus, amongst others, defines an I/O space concept which - on such
CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
space. However, it may also be mapped as a virtual I/O space in the CPU's
memory map, particularly on those CPUs that don't support alternate I/O
spaces.

Accesses to this space may be fully synchronous (as on i386), but
intermediary bridges (such as the PCI host bridge) may not fully honour
that.

They are guaranteed to be fully ordered with respect to each other.

They are not guaranteed to be fully ordered with respect to other types of
memory and I/O operation.
Interfacing with peripherals via I/O accesses is deeply architecture and device
specific. Therefore, drivers which are inherently non-portable may rely on
specific behaviours of their target systems in order to achieve synchronization
in the most lightweight manner possible. For drivers intending to be portable
between multiple architectures and bus implementations, the kernel offers a
series of accessor functions that provide various degrees of ordering
guarantees:

(*) readX(), writeX():

Whether these are guaranteed to be fully ordered and uncombined with
respect to each other on the issuing CPU depends on the characteristics
defined for the memory window through which they're accessing. On later
i386 architecture machines, for example, this is controlled by way of the
MTRR registers.
The readX() and writeX() MMIO accessors take a pointer to the
peripheral being accessed as an __iomem * parameter. For pointers
mapped with the default I/O attributes (e.g. those returned by
ioremap()), the ordering guarantees are as follows:

1. All readX() and writeX() accesses to the same peripheral are ordered
with respect to each other. This ensures that MMIO register accesses
by the same CPU thread to a particular device will arrive in program
order.

2. A writeX() issued by a CPU thread holding a spinlock is ordered
before a writeX() to the same peripheral from another CPU thread
issued after a later acquisition of the same spinlock. This ensures
that MMIO register writes to a particular device issued while holding
a spinlock will arrive in an order consistent with acquisitions of
the lock.

3. A writeX() by a CPU thread to the peripheral will first wait for the
completion of all prior writes to memory either issued by, or
propagated to, the same thread. This ensures that writes by the CPU
to an outbound DMA buffer allocated by dma_alloc_coherent() will be
visible to a DMA engine when the CPU writes to its MMIO control
register to trigger the transfer.

4. A readX() by a CPU thread from the peripheral will complete before
any subsequent reads from memory by the same thread can begin. This
ensures that reads by the CPU from an incoming DMA buffer allocated
by dma_alloc_coherent() will not see stale data after reading from
the DMA engine's MMIO status register to establish that the DMA
transfer has completed.

5. A readX() by a CPU thread from the peripheral will complete before
any subsequent delay() loop can begin execution on the same thread.
This ensures that two MMIO register writes by the CPU to a peripheral
will arrive at least 1us apart if the first write is immediately read
back with readX() and udelay(1) is called prior to the second
writeX():

writel(42, DEVICE_REGISTER_0); // Arrives at the device...
readl(DEVICE_REGISTER_0);
udelay(1);
writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

The ordering properties of __iomem pointers obtained with non-default
attributes (e.g. those returned by ioremap_wc()) are specific to the
underlying architecture and therefore the guarantees listed above cannot
generally be relied upon for accesses to these types of mappings.

(*) readX_relaxed(), writeX_relaxed():

These are similar to readX() and writeX(), but provide weaker memory
ordering guarantees. Specifically, they do not guarantee ordering with
respect to locking, normal memory accesses or delay() loops (i.e.
bullets 2-5 above) but they are still guaranteed to be ordered with
respect to other accesses from the same CPU thread to the same
peripheral when operating on __iomem pointers mapped with the default
I/O attributes.

(*) readsX(), writesX():

The readsX() and writesX() MMIO accessors are designed for accessing
register-based, memory-mapped FIFOs residing on peripherals that are not
capable of performing DMA. Consequently, they provide only the ordering
guarantees of readX_relaxed() and writeX_relaxed(), as documented above.

Ordinarily, these will be guaranteed to be fully ordered and uncombined,
provided they're not accessing a prefetchable device.
(*) inX(), outX():

However, intermediary hardware (such as a PCI bridge) may indulge in
deferral if it so wishes; to flush a store, a load from the same location
is preferred[*], but a load from the same device or from configuration
space should suffice for PCI.
The inX() and outX() accessors are intended to access legacy port-mapped
I/O peripherals, which may require special instructions on some
architectures (notably x86). The port number of the peripheral being
accessed is passed as an argument.

[*] NOTE! attempting to load from the same location as was written to may
cause a malfunction - consider the 16550 Rx/Tx serial registers for
example.
Since many CPU architectures ultimately access these peripherals via an
internal virtual memory mapping, the portable ordering guarantees
provided by inX() and outX() are the same as those provided by readX()
and writeX() respectively when accessing a mapping with the default I/O
attributes.

Used with prefetchable I/O memory, an mmiowb() barrier may be required to
force stores to be ordered.
Device drivers may expect outX() to emit a non-posted write transaction
that waits for a completion response from the I/O peripheral before
returning. This is not guaranteed by all architectures and is therefore
not part of the portable ordering semantics.

Please refer to the PCI specification for more information on interactions
between PCI transactions.
(*) insX(), outsX():

(*) readX_relaxed(), writeX_relaxed()
As above, the insX() and outsX() accessors provide the same ordering
guarantees as readsX() and writesX() respectively when accessing a
mapping with the default I/O attributes.

These are similar to readX() and writeX(), but provide weaker memory
ordering guarantees. Specifically, they do not guarantee ordering with
respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
ordering with respect to LOCK or UNLOCK operations. If the latter is
required, an mmiowb() barrier can be used. Note that relaxed accesses to
the same peripheral are guaranteed to be ordered with respect to each
other.
(*) ioreadX(), iowriteX():

(*) ioreadX(), iowriteX()
These will perform appropriately for the type of access they're actually
doing, be it inX()/outX() or readX()/writeX().

These will perform appropriately for the type of access they're actually
doing, be it inX()/outX() or readX()/writeX().
With the exception of the string accessors (insX(), outsX(), readsX() and
writesX()), all of the above assume that the underlying peripheral is
little-endian and will therefore perform byte-swapping operations on big-endian
architectures.


========================================
Expand Down
1 change: 1 addition & 0 deletions arch/alpha/include/asm/Kbuild
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += mmiowb.h
generic-y += preempt.h
generic-y += sections.h
generic-y += trace_clock.h
Expand Down
2 changes: 0 additions & 2 deletions arch/alpha/include/asm/io.h
Original file line number Diff line number Diff line change
Expand Up @@ -513,8 +513,6 @@ extern inline void writeq(u64 b, volatile void __iomem *addr)
#define writel_relaxed(b, addr) __raw_writel(b, addr)
#define writeq_relaxed(b, addr) __raw_writeq(b, addr)

#define mmiowb()

/*
* String version of IO memory access ops:
*/
Expand Down
1 change: 1 addition & 0 deletions arch/arc/include/asm/Kbuild
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += mmiowb.h
generic-y += msi.h
generic-y += parport.h
generic-y += percpu.h
Expand Down
1 change: 1 addition & 0 deletions arch/arm/include/asm/Kbuild
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ generic-y += kdebug.h
generic-y += local.h
generic-y += local64.h
generic-y += mm-arch-hooks.h
generic-y += mmiowb.h
generic-y += msi.h
generic-y += parport.h
generic-y += preempt.h
Expand Down
Loading

0 comments on commit dd4e5d6

Please sign in to comment.