Physical Memory

Linux is available for a wide range of architectures so there is a need for an architecture-independent abstraction to represent the physical memory. This chapter describes the structures used to manage physical memory in a running system.

The first principal concept prevalent in the memory management is Non-Uniform Memory Access (NUMA). With multi-core and multi-socket machines, memory may be arranged into banks that incur a different cost to access depending on the “distance” from the processor. For example, there might be a bank of memory assigned to each CPU or a bank of memory very suitable for DMA near peripheral devices.

Each bank is called a node and the concept is represented under Linux by a struct pglist_data even if the architecture is UMA. This structure is always referenced by its typedef pg_data_t. A pg_data_t structure for a particular node can be referenced by NODE_DATA(nid) macro where nid is the ID of that node.

For NUMA architectures, the node structures are allocated by the architecture specific code early during boot. Usually, these structures are allocated locally on the memory bank they represent. For UMA architectures, only one static pg_data_t structure called contig_page_data is used. Nodes will be discussed further in Section :ref:`Nodes <nodes>`

The entire physical address space is partitioned into one or more blocks called zones which represent ranges within memory. These ranges are usually determined by architectural constraints for accessing the physical memory. The memory range within a node that corresponds to a particular zone is described by a struct zone, typedeffed to zone_t. Each zone has one of the types described below.

ZONE_DMA and ZONE_DMA32 historically represented memory suitable for DMA by peripheral devices that cannot access all of the addressable memory. For many years there are better more and robust interfaces to get memory with DMA specific requirements (Documentation/core-api/dma-api.rst), but ZONE_DMA and ZONE_DMA32 still represent memory ranges that have restrictions on how they can be accessed. Depending on the architecture, either of these zone types or even they both can be disabled at build time using CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 configuration options. Some 64-bit platforms may need both zones as they support peripherals with different DMA addressing limitations.
ZONE_NORMAL is for normal memory that can be accessed by the kernel all the time. DMA operations can be performed on pages in this zone if the DMA devices support transfers to all addressable memory. ZONE_NORMAL is always enabled.
ZONE_HIGHMEM is the part of the physical memory that is not covered by a permanent mapping in the kernel page tables. The memory in this zone is only accessible to the kernel using temporary mappings. This zone is available only on some 32-bit architectures and is enabled with CONFIG_HIGHMEM.
ZONE_MOVABLE is for normal accessible memory, just like ZONE_NORMAL. The difference is that the contents of most pages in ZONE_MOVABLE is movable. That means that while virtual addresses of these pages do not change, their content may move between different physical pages. Often ZONE_MOVABLE is populated during memory hotplug, but it may be also populated on boot using one of kernelcore, movablecore and movable_node kernel command line parameters. See Documentation/mm/page_migration.rst and Documentation/admin-guide/mm/memory-hotplug.rst for additional details.
ZONE_DEVICE represents memory residing on devices such as PMEM and GPU. It has different characteristics than RAM zone types and it exists to provide :ref:`struct page <Pages>` and memory map services for device driver identified physical address ranges. ZONE_DEVICE is enabled with configuration option CONFIG_ZONE_DEVICE.

It is important to note that many kernel operations can only take place using ZONE_NORMAL so it is the most performance critical zone. Zones are discussed further in Section :ref:`Zones <zones>`.

The relation between node and zone extents is determined by the physical memory map reported by the firmware, architectural constraints for memory addressing and certain parameters in the kernel command line.

For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the entire memory will be on node 0 and there will be three zones: ZONE_DMA, ZONE_NORMAL and ZONE_HIGHMEM:

0                                                            2G
+-------------------------------------------------------------+
|                            node 0                           |
+-------------------------------------------------------------+

0         16M                    896M                        2G
+----------+-----------------------+--------------------------+
| ZONE_DMA |      ZONE_NORMAL      |       ZONE_HIGHMEM       |
+----------+-----------------------+--------------------------+

With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted with movablecore=80% parameter on an arm64 machine with 16 Gbytes of RAM equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1:

1G                                9G                         17G
+--------------------------------+ +--------------------------+
|              node 0            | |          node 1          |
+--------------------------------+ +--------------------------+

1G       4G        4200M          9G          9320M          17G
+---------+----------+-----------+ +------------+-------------+
|  DMA32  |  NORMAL  |  MOVABLE  | |   NORMAL   |   MOVABLE   |
+---------+----------+-----------+ +------------+-------------+

Memory banks may belong to interleaving nodes. In the example below an x86 machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0 and odd banks belong to node 1:

0              4G              8G             12G            16G
+-------------+ +-------------+ +-------------+ +-------------+
|    node 0   | |    node 1   | |    node 0   | |    node 1   |
+-------------+ +-------------+ +-------------+ +-------------+

0   16M      4G
+-----+-------+ +-------------+ +-------------+ +-------------+
| DMA | DMA32 | |    NORMAL   | |    NORMAL   | |    NORMAL   |
+-----+-------+ +-------------+ +-------------+ +-------------+

In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from 4 to 16 Gbytes.

Nodes

As we have mentioned, each node in memory is described by a pg_data_t which is a typedef for a struct pglist_data. When allocating a page, by default Linux uses a node-local allocation policy to allocate memory from the node closest to the running CPU. As processes tend to run on the same CPU, it is likely the memory from the current node will be used. The allocation policy can be controlled by users as described in Documentation/admin-guide/mm/numa_memory_policy.rst.

Most NUMA architectures maintain an array of pointers to the node structures. The actual structures are allocated early during boot when architecture specific code parses the physical memory map reported by the firmware. The bulk of the node initialization happens slightly later in the boot process by free_area_init() function, described later in Section :ref:`Initialization <initialization>`.

Along with the node structures, kernel maintains an array of nodemask_t bitmasks called node_states. Each bitmask in this array represents a set of nodes with particular properties as defined by enum node_states:

N_POSSIBLE: The node could become online at some point.
N_ONLINE: The node is online.
N_NORMAL_MEMORY: The node has regular memory.
N_HIGH_MEMORY: The node has regular or high memory. When CONFIG_HIGHMEM is disabled aliased to N_NORMAL_MEMORY.
N_MEMORY: The node has memory(regular, high, movable)
N_CPU: The node has one or more CPUs

For each node that has a property described above, the bit corresponding to the node ID in the node_states[<property>] bitmask is set.

For example, for node 2 with normal memory and CPUs, bit 2 will be set in

node_states[N_POSSIBLE]
node_states[N_ONLINE]
node_states[N_NORMAL_MEMORY]
node_states[N_HIGH_MEMORY]
node_states[N_MEMORY]
node_states[N_CPU]

For various operations possible with nodemasks please refer to include/linux/nodemask.h.

Among other things, nodemasks are used to provide macros for node traversal, namely for_each_node() and for_each_online_node().

For instance, to call a function foo() for each online node:

for_each_online_node(nid) {
        pg_data_t *pgdat = NODE_DATA(nid);

        foo(pgdat);
}

Node structure

The nodes structure struct pglist_data is declared in include/linux/mmzone.h. Here we briefly describe fields of this structure:

General

node_zones: The zones for this node. Not all of the zones may be populated, but it is the full list. It is referenced by this node's node_zonelists as well as other node's node_zonelists.
node_zonelists: The list of all zones in all nodes. This list defines the order of zones that allocations are preferred from. The node_zonelists is set up by build_zonelists() in mm/page_alloc.c during the initialization of core memory management structures.
nr_zones: Number of populated zones in this node.
node_mem_map: For UMA systems that use FLATMEM memory model the 0's node node_mem_map is array of struct pages representing each physical frame.
node_page_ext: For UMA systems that use FLATMEM memory model the 0's node node_page_ext is array of extensions of struct pages. Available only in the kernels built with CONFIG_PAGE_EXTENSION enabled.
node_start_pfn: The page frame number of the starting page frame in this node.
node_present_pages: Total number of physical pages present in this node.
node_spanned_pages: Total size of physical page range, including holes.
node_size_lock: A lock that protects the fields defining the node extents. Only defined when at least one of CONFIG_MEMORY_HOTPLUG or CONFIG_DEFERRED_STRUCT_PAGE_INIT configuration options are enabled. pgdat_resize_lock() and pgdat_resize_unlock() are provided to manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
node_id: The Node ID (NID) of the node, starts at 0.
totalreserve_pages: This is a per-node reserve of pages that are not available to userspace allocations.
first_deferred_pfn: If memory initialization on large machines is deferred then this is the first PFN that needs to be initialized. Defined only when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled
deferred_split_queue: Per-node queue of huge pages that their split was deferred. Defined only when CONFIG_TRANSPARENT_HUGEPAGE is enabled.
__lruvec: Per-node lruvec holding LRU lists and related parameters. Used only when memory cgroups are disabled. It should not be accessed directly, use mem_cgroup_lruvec() to look up lruvecs instead.

Reclaim control

See also Documentation/mm/page_reclaim.rst.

kswapd: Per-node instance of kswapd kernel thread.
kswapd_wait, pfmemalloc_wait, reclaim_wait: Workqueues used to synchronize memory reclaim tasks
nr_writeback_throttled: Number of tasks that are throttled waiting on dirty pages to clean.
nr_reclaim_start: Number of pages written while reclaim is throttled waiting for writeback.
kswapd_order: Controls the order kswapd tries to reclaim
kswapd_highest_zoneidx: The highest zone index to be reclaimed by kswapd
kswapd_failures: Number of runs kswapd was unable to reclaim any pages
min_unmapped_pages: Minimal number of unmapped file backed pages that cannot be reclaimed. Determined by vm.min_unmapped_ratio sysctl. Only defined when CONFIG_NUMA is enabled.
min_slab_pages: Minimal number of SLAB pages that cannot be reclaimed. Determined by vm.min_slab_ratio sysctl. Only defined when CONFIG_NUMA is enabled
flags: Flags controlling reclaim behavior.

Compaction control

kcompactd_max_order: Page order that kcompactd should try to achieve.
kcompactd_highest_zoneidx: The highest zone index to be compacted by kcompactd.
kcompactd_wait: Workqueue used to synchronize memory compaction tasks.
kcompactd: Per-node instance of kcompactd kernel thread.
proactive_compact_trigger: Determines if proactive compaction is enabled. Controlled by vm.compaction_proactiveness sysctl.

Statistics

per_cpu_nodestats: Per-CPU VM statistics for the node
vm_stat: VM statistics for the node.

Zones

Stub

This section is incomplete. Please list and describe the appropriate fields.

Pages

Stub

This section is incomplete. Please list and describe the appropriate fields.

Folios

Stub

This section is incomplete. Please list and describe the appropriate fields.

Initialization

Stub

This section is incomplete. Please list and describe the appropriate fields.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

physical_memory.rst

physical_memory.rst

Physical Memory

Nodes

Node structure

General

Reclaim control

Compaction control

Statistics

Zones

Pages

Folios

Initialization

Files

physical_memory.rst

Latest commit

History

physical_memory.rst

File metadata and controls

Physical Memory

Nodes

Node structure

General

Reclaim control

Compaction control

Statistics

Zones

Pages

Folios

Initialization