Skip to content

Commit

Permalink
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/sc…
Browse files Browse the repository at this point in the history
…m/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
   add_to_avail_list")

 - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP. It
   also speeds up GUP handling of transparent hugepages.

 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").

 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").

 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").

 - xu xin has doe some work on KSM's handling of zero pages. These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support
   tracking KSM-placed zero-pages").

 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").

 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with
   UFFD").

 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").

 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").

 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").

 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").

 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").

 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap"). And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").

 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

 - Baoquan He has converted some architectures to use the
   GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
   architectures to take GENERIC_IOREMAP way").

 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").

 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep"). Liam also developed some efficiency
   improvements ("Reduce preallocations for maple tree").

 - Cleanup and optimization to the secondary IOMMU TLB invalidation,
   from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").

 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").

 - Kemeng Shi provides some maintenance work on the compaction code
   ("Two minor cleanups for compaction").

 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
   most file-backed faults under the VMA lock").

 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").

 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").

 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").

 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").

 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").

 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").

 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").

 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
   memmap on memory feature on ppc64").

 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock
   migratetype").

 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").

 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").

 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").

 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").

 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").

 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").

 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").

 - pagetable handling cleanups from Matthew Wilcox ("New page table
   range API").

 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").

 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").

 - Matthew Wilcox has also done some maintenance work on the MM
   subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
  maple_tree: shrink struct maple_tree
  maple_tree: clean up mas_wr_append()
  secretmem: convert page_is_secretmem() to folio_is_secretmem()
  nios2: fix flush_dcache_page() for usage from irq context
  hugetlb: add documentation for vma_kernel_pagesize()
  mm: add orphaned kernel-doc to the rst files.
  mm: fix clean_record_shared_mapping_range kernel-doc
  mm: fix get_mctgt_type() kernel-doc
  mm: fix kernel-doc warning from tlb_flush_rmaps()
  mm: remove enum page_entry_size
  mm: allow ->huge_fault() to be called without the mmap_lock held
  mm: move PMD_ORDER to pgtable.h
  mm: remove checks for pte_index
  memcg: remove duplication detection for mem_cgroup_uncharge_swap
  mm/huge_memory: work on folio->swap instead of page->private when splitting folio
  mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
  mm/swap: use dedicated entry for swap in folio
  mm/swap: stop using page->private on tail pages for THP_SWAP
  selftests/mm: fix WARNING comparing pointer to 0
  selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
  ...
  • Loading branch information
torvalds committed Aug 29, 2023
2 parents 651a00b + 52ae298 commit b96a3e9
Show file tree
Hide file tree
Showing 471 changed files with 9,558 additions and 7,074 deletions.
40 changes: 36 additions & 4 deletions Documentation/ABI/testing/sysfs-kernel-mm-damon
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
file updates contents of schemes stats files of the kdamond.
Writing 'update_schemes_tried_regions' to the file updates
contents of 'tried_regions' directory of every scheme directory
of this kdamond. Writing 'clear_schemes_tried_regions' to the
file removes contents of the 'tried_regions' directory.
of this kdamond. Writing 'update_schemes_tried_bytes' to the
file updates only '.../tried_regions/total_bytes' files of this
kdamond. Writing 'clear_schemes_tried_regions' to the file
removes contents of the 'tried_regions' directory.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
Date: Mar 2022
Expand Down Expand Up @@ -269,8 +271,10 @@ What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/
Date: Dec 2022
Contact: SeongJae Park <[email protected]>
Description: Writing to and reading from this file sets and gets the type of
the memory of the interest. 'anon' for anonymous pages, or
'memcg' for specific memory cgroup can be written and read.
the memory of the interest. 'anon' for anonymous pages,
'memcg' for specific memory cgroup, 'addr' for address range
(an open-ended interval), or 'target' for DAMON monitoring
target can be written and read.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path
Date: Dec 2022
Expand All @@ -279,6 +283,27 @@ Description: If 'memcg' is written to the 'type' file, writing to and
reading from this file sets and gets the path to the memory
cgroup of the interest.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_start
Date: Jul 2023
Contact: SeongJae Park <[email protected]>
Description: If 'addr' is written to the 'type' file, writing to or reading
from this file sets or gets the start address of the address
range for the filter.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_end
Date: Jul 2023
Contact: SeongJae Park <[email protected]>
Description: If 'addr' is written to the 'type' file, writing to or reading
from this file sets or gets the end address of the address
range for the filter.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/target_idx
Date: Dec 2022
Contact: SeongJae Park <[email protected]>
Description: If 'target' is written to the 'type' file, writing to or
reading from this file sets or gets the index of the DAMON
monitoring target of the interest.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching
Date: Dec 2022
Contact: SeongJae Park <[email protected]>
Expand Down Expand Up @@ -317,6 +342,13 @@ Contact: SeongJae Park <[email protected]>
Description: Reading this file returns the number of the exceed events of
the scheme's quotas.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/total_bytes
Date: Jul 2023
Contact: SeongJae Park <[email protected]>
Description: Reading this file returns the total amount of memory that
corresponding DAMON-based Operation Scheme's action has tried
to be applied.

What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start
Date: Oct 2022
Contact: SeongJae Park <[email protected]>
Expand Down
4 changes: 2 additions & 2 deletions Documentation/ABI/testing/sysfs-memory-page-offline
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Description:
dropping it if possible. The kernel will then be placed
on the bad page list and never be reused.

The offlining is done in kernel specific granuality.
The offlining is done in kernel specific granularity.
Normally it's the base page size of the kernel, but
this might change.

Expand All @@ -35,7 +35,7 @@ Description:
to access this page assuming it's poisoned by the
hardware.

The offlining is done in kernel specific granuality.
The offlining is done in kernel specific granularity.
Normally it's the base page size of the kernel, but
this might change.

Expand Down
2 changes: 0 additions & 2 deletions Documentation/admin-guide/cgroup-v1/memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,6 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
memory.kmem.limit_in_bytes This knob is deprecated and writing to
it will return -ENOTSUPP.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
Expand Down
14 changes: 4 additions & 10 deletions Documentation/admin-guide/kdump/vmcoreinfo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,8 @@ nodemask_t
The size of a nodemask_t type. Used to compute the number of online
nodes.

(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
-------------------------------------------------------------------------------------------------
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
----------------------------------------------------------------------------------

User-space tools compute their values based on the offset of these
variables. The variables are used when excluding unnecessary pages.
Expand Down Expand Up @@ -325,8 +325,8 @@ NR_FREE_PAGES
On linux-2.6.21 or later, the number of free pages is in
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.

PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
------------------------------------------------------------------------------
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
-----------------------------------------------------------------------------------------

Page attributes. These flags are used to filter various unnecessary for
dumping pages.
Expand All @@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
dumping pages.


HUGETLB_PAGE_DTOR
-----------------

The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
excludes these pages.

x86_64
======

Expand Down
76 changes: 50 additions & 26 deletions Documentation/admin-guide/mm/damon/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ comma (","). ::
│ │ │ │ │ │ │ filters/nr_filters
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
│ │ │ │ │ │ │ tried_regions/
│ │ │ │ │ │ │ tried_regions/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
Expand Down Expand Up @@ -127,14 +127,18 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
user inputs in the sysfs files except ``state`` file again. Writing
``update_schemes_stats`` to ``state`` file updates the contents of stats files
for each DAMON-based operation scheme of the kdamond. For details of the
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing
``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based
operation scheme action tried regions directory for each DAMON-based operation
scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``
file clears the DAMON-based operating scheme action tried regions directory for
each DAMON-based operation scheme of the kdamond. For details of the
DAMON-based operation scheme action tried regions directory, please refer to
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.

Writing ``update_schemes_tried_regions`` to ``state`` file updates the
DAMON-based operation scheme action tried regions directory for each
DAMON-based operation scheme of the kdamond. Writing
``update_schemes_tried_bytes`` to ``state`` file updates only
``.../tried_regions/total_bytes`` files. Writing
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
operating scheme action tried regions directory for each DAMON-based operation
scheme of the kdamond. For details of the DAMON-based operation scheme action
tried regions directory, please refer to :ref:`tried_regions section
<sysfs_schemes_tried_regions>`.

If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.

Expand Down Expand Up @@ -359,15 +363,21 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.

Each filter directory contains three files, namely ``type``, ``matcing``, and
``memcg_path``. You can write one of two special keywords, ``anon`` for
anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
the memory cgroup filtering, you can specify the memory cgroup of the interest
by writing the path of the memory cgroup from the cgroups mount point to
``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
filter out pages that does or does not match to the type, respectively. Then,
the scheme's action will not be applied to the pages that specified to be
filtered out.
Each filter directory contains six files, namely ``type``, ``matcing``,
``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
file, you can write one of four special keywords: ``anon`` for anonymous pages,
``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
open-ended interval), or ``target`` for specific DAMON monitoring target
filtering. In case of the memory cgroup filtering, you can specify the memory
cgroup of the interest by writing the path of the memory cgroup from the
cgroups mount point to ``memcg_path`` file. In case of the address range
filtering, you can specify the start and end address of the range to
``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
target filtering, you can specify the index of the target between the list of
the DAMON context's monitoring targets list to ``target_idx`` file. You can
write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
not match to the type, respectively. Then, the scheme's action will not be
applied to the pages that specified to be filtered out.

For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
Expand All @@ -381,8 +391,14 @@ pages of all memory cgroups except ``/having_care_already``.::
echo /having_care_already > 1/memcg_path
echo N > 1/matching

Note that filters are currently supported only when ``paddr``
`implementation <sysfs_contexts>` is being used.
Note that ``anon`` and ``memcg`` filters are currently supported only when
``paddr`` `implementation <sysfs_contexts>` is being used.

Also, memory regions that are filtered out by ``addr`` or ``target`` filters
are not counted as the scheme has tried to those, while regions that filtered
out by other type filters are counted as the scheme has tried to. The
difference is applied to :ref:`stats <damos_stats>` and
:ref:`tried regions <sysfs_schemes_tried_regions>`.

.. _sysfs_schemes_stats:

Expand All @@ -406,13 +422,21 @@ stats by writing a special keyword, ``update_schemes_stats`` to the relevant
schemes/<N>/tried_regions/
--------------------------

This directory initially has one file, ``total_bytes``.

When a special keyword, ``update_schemes_tried_regions``, is written to the
relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer
starting from ``0`` under this directory. Each directory contains files
exposing detailed information about each of the memory region that the
corresponding scheme's ``action`` has tried to be applied under this directory,
during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The
information includes address range, ``nr_accesses``, and ``age`` of the region.
relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
that reading it returns the total size of the scheme tried regions, and creates
directories named integer starting from ``0`` under this directory. Each
directory contains files exposing detailed information about each of the memory
region that the corresponding scheme's ``action`` has tried to be applied under
this directory, during next :ref:`aggregation interval
<sysfs_monitoring_attrs>`. The information includes address range,
``nr_accesses``, and ``age`` of the region.

Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
file will only update the ``total_bytes`` file, and will not create the
subdirectories.

The directories will be removed when another special keyword,
``clear_schemes_tried_regions``, is written to the relevant
Expand Down
27 changes: 20 additions & 7 deletions Documentation/admin-guide/mm/ksm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,8 @@ The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:

general_profit
how effective is KSM. The calculation is explained below.
pages_scanned
how many pages are being scanned for ksm
pages_shared
how many shared pages are being used
pages_sharing
Expand All @@ -173,6 +175,13 @@ stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
ksm_zero_pages
how many zero pages that are still mapped into processes were mapped by
KSM when deduplicating.

When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
``ksm_zero_pages`` represents the actual number of pages saved by KSM.
if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.

A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
Expand All @@ -196,21 +205,25 @@ several times, which are unprofitable memory consumed.
1) How to determine whether KSM save memory or consume memory in system-wide
range? Here is a simple approximate calculation for reference::

general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);

where all_rmap_items can be easily obtained by summing ``pages_sharing``,
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
where ksm_saved_pages equals to the sum of ``pages_sharing`` +
``ksm_zero_pages`` of the system, and all_rmap_items can be easily
obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
and ``pages_volatile``.

2) The KSM profit inner a single process can be similarly obtained by the
following approximate calculation::

process_profit =~ ksm_merging_pages * sizeof(page) -
process_profit =~ ksm_saved_pages * sizeof(page) -
ksm_rmap_items * sizeof(rmap_item).

where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
``ksm_zero_pages``, both of which are shown under the directory
``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
``/proc/<pid>/ksm_stat``. The process profit is also shown in
``/proc/<pid>/ksm_stat`` as ksm_process_profit.

From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
Expand Down
14 changes: 13 additions & 1 deletion Documentation/admin-guide/mm/memory-hotplug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,18 @@ The following module parameters are currently defined:
memory in a way that huge pages in bigger
granularity cannot be formed on hotplugged
memory.

With value "force" it could result in memory
wastage due to memmap size limitations. For
example, if the memmap for a memory block
requires 1 MiB, but the pageblock size is 2
MiB, 1 MiB of hotplugged memory will be wasted.
Note that there are still cases where the
feature cannot be enforced: for example, if the
memmap is smaller than a single page, or if the
architecture does not support the forced mode
in all configurations.

``online_policy`` read-write: Set the basic policy used for
automatic zone selection when onlining memory
blocks without specifying a target zone.
Expand Down Expand Up @@ -669,7 +681,7 @@ when still encountering permanently unmovable pages within ZONE_MOVABLE
(-> BUG), memory offlining will keep retrying until it eventually succeeds.

When offlining is triggered from user space, the offlining context can be
terminated by sending a fatal signal. A timeout based offlining can easily be
terminated by sending a signal. A timeout based offlining can easily be
implemented via::

% timeout $TIMEOUT offline_block | failure_handling
15 changes: 15 additions & 0 deletions Documentation/admin-guide/mm/userfaultfd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,21 @@ write-protected (so future writes will also result in a WP fault). These ioctls
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
respectively) to configure the mapping this way.

Memory Poisioning Emulation
---------------------------

In response to a fault (either missing or minor), an action userspace can
take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
future faulters to either get a SIGBUS, or in KVM's case the guest will
receive an MCE as if there were hardware memory poisoning.

This is used to emulate hardware memory poisoning. Imagine a VM running on a
machine which experiences a real hardware memory error. Later, we live migrate
the VM to another physical machine. Since we want the migration to be
transparent to the guest, we want that same address range to act as if it was
still poisoned, even though it's on a new physical host which ostensibly
doesn't have a memory error in the exact same spot.

QEMU/KVM
========

Expand Down
14 changes: 7 additions & 7 deletions Documentation/admin-guide/mm/zswap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ compressed pool.
Design
======

Zswap receives pages for compression through the Frontswap API and is able to
Zswap receives pages for compression from the swap subsystem and is able to
evict pages from its own compressed pool on an LRU basis and write them back to
the backing swap device in the case that the compressed pool is full.

Expand All @@ -70,19 +70,19 @@ means the compression ratio will always be 2:1 or worse (because of half-full
zbud pages). The zsmalloc type zpool has a more complex compressed page
storage method, and it can achieve greater storage densities.

When a swap page is passed from frontswap to zswap, zswap maintains a mapping
When a swap page is passed from swapout to zswap, zswap maintains a mapping
of the swap entry, a combination of the swap type and swap offset, to the zpool
handle that references that compressed swap page. This mapping is achieved
with a red-black tree per swap type. The swap offset is the search key for the
tree nodes.

During a page fault on a PTE that is a swap entry, frontswap calls the zswap
load function to decompress the page into the page allocated by the page fault
handler.
During a page fault on a PTE that is a swap entry, the swapin code calls the
zswap load function to decompress the page into the page allocated by the page
fault handler.

Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
in the swap_map goes to 0) the swap code calls the zswap invalidate function,
via frontswap, to free the compressed entry.
in the swap_map goes to 0) the swap code calls the zswap invalidate function
to free the compressed entry.

Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
controlled policy:
Expand Down
1 change: 1 addition & 0 deletions Documentation/block/biovecs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Usage of helpers:
bio_for_each_bvec_all()
bio_first_bvec_all()
bio_first_page_all()
bio_first_folio_all()
bio_last_bvec_all()

* The following helpers iterate over single-page segment. The passed 'struct
Expand Down
Loading

0 comments on commit b96a3e9

Please sign in to comment.